Improve compilation time (reduce from ~50 seconds to ~15s for vLLM)#3145
Conversation
fadcce5 to
e81ea28
Compare
8bac4cf to
671cc0a
Compare
| disable_log_stats=True, | ||
| ) | ||
| if config.max_num_seqs is not None: | ||
| engine_kwargs["max_num_seqs"] = config.max_num_seqs |
There was a problem hiding this comment.
what does this field do?
what if we don't set it here?
There was a problem hiding this comment.
Commented above - this controls cudagraphs we capture, not setting it defaults to the behavior today on main
I can go ahead and make this set by default to avoid any sort of silent slowdown
There was a problem hiding this comment.
wait these are two different kwargs -- the other is for vllm cudagraph behavior, what's this additional kwarg for?
There was a problem hiding this comment.
max_num_seqs controls other things like the padding for max size (used in kv cache)
| kwargs: dict = dict(cudagraph_mode=self.cudagraph_mode, mode=0) | ||
|
|
||
| if max_num_seqs is not None and self.cudagraph_mode != "none": | ||
| kwargs["cudagraph_capture_sizes"] = self._compute_cudagraph_capture_sizes( |
There was a problem hiding this comment.
what if we don't set it when cudagraph is enabled?
There was a problem hiding this comment.
Defaults to the 256 which captures ~35 different sizes (ranging from 1 to 256) so no incorrectness, just more memory and startup time used
| require vLLM's whole-model torch.compile to split the graph around | ||
| non-capturable ops, which conflicts with per-layer compile. | ||
| See https://docs.vllm.ai/en/latest/design/cuda_graphs/#cudagraphmodes""" |
There was a problem hiding this comment.
What happens when we enable cudagraph for per-layer compile?
We can save compile time, but what's the impact on run time, e.g. when going to GB200 with significant CPU overhead.
There was a problem hiding this comment.
The test plan shows the time impact - we observe speedup over piecewise in this particular setup
There was a problem hiding this comment.
also would like to check if it works with EP, being enabled in #3142
MoE has dynamic shape, despite being full graph torch-compilable
| require vLLM's whole-model torch.compile to split the graph around | ||
| non-capturable ops, which conflicts with per-layer compile. | ||
| See https://docs.vllm.ai/en/latest/design/cuda_graphs/#cudagraphmodes""" |
There was a problem hiding this comment.
Also, it seems we move compile back to torchtitan, but cudagraph application still in vllm. How far are we from moving cudagraph application also in torchtitan?
There was a problem hiding this comment.
I would estimate 2-3weeks but I can leave a TODO here that we should unify cudagraph config once we have it on trainer side
671cc0a to
f587e8a
Compare
8509698 to
386d02e
Compare
411d944 to
e671a2e
Compare
e671a2e to
5a362db
Compare
Fixes #3119 and #3071
Summary
We make significant improvements to the vLLM compilation, saving ~40s (20s from cudagraph, 1s per step, and ~13s from Dynamo) from the following changes:
Test plan
Test results:
Before
After (this PR)