What do the parameters of torch.compile do?
The Torch compiler is a Just-In-Time compiler that lowers PyTorch code into optimized kernels. It can significantly speed up PyTorch models, but it has a few knobs to turn that determine performance trade-offs.
We’ll go over these parameters, what they do, and when to use them.
Compilation Steps
First we’ll outline the steps involved in compilation and define some important pieces of the compiler:
- TorchDynamo: A Python-level compiler that hooks CPython to trace Python bytecode into an FX Graph. This graph is a lower-level representation of the flow of PyTorch Tensor operations.
- AOT Autograd: Generates the backward graph from the forward graph, usually used for the backpropagation of gradients for training.
- TorchInductor: A backend compiler that lowers the computational graph to a further optimized representation, like Triton kernels for CUDA GPUs.
Parameters:
mode
: Determines what is being optimized (affects compile time).
"default"
: A good balance between compile time, performance, and memory."reduce-overhead"
: Leverages CUDA graphs to minimize Python overhead at the expense of using a little more memory for caching. This can benefit models where Python operations are the bottleneck, e.g. small models on small batches on big GPUs."max-autotune"
: Uses a more exhaustive search that profiles multiple optimizations of operations, and selects the fastest one. This can add significant time to compilation, so it’s best when it can be amortized over subsequent calls."max-autotune-no-cudagraphs"
: Uses the same “autotune” search to find the best optimization, but avoids CUDA graph capture. This is a good option if your model needs to accept varying input shapes, or contains operations that can’t be represented in the graph, such as flow control statements like if/else.- Using autotune speeds things up at the expense of compile time. Modal makes it easy to cache these compilation steps and load the optimized model on-demand! Check out this blog post to see how we used torch compile and other tools to run the FLUX.1-dev model 3x faster.
dynamic: bool = False
- Optimizes for different input shapes.
False
: The compiler will optimize for the input shape it sees when it is first called. If the input shape changes, it will trigger a re-compile. Keep the default if you know your input shape is static.True
: Attempts to optimize for dynamic shapes at the expense of a minor performance hit. Use this if you know your inputs will have varying shapes.
fullgraph: bool = False
- Enforce compiling the entire function as one graph.
False
: Splits the graph when it encounters operations that can’t be optimized. These fall back to eager execution.True
: Forces the compiler to attempt to create a contiguous graph. If it encounters operations it can’t optimize, it will raise an error. You can use this to identify and fix graph breaks to maximize speedup.
backend: str = "inductor"
- Specifies which backend to use to compile the FX graph into further optimized kernels.
"inductor"
: Default backend that has great coverage of PyTorch operations."eager"
: For debugging. Runs TorchDynamo without the full compilation to surface any graph breaks."aot_eager"
: This option runs AOTAutograd without fully compiling, which can be used to determine if there are any issues tracing the backward graph.- Other backends: You can view other registered backends with
torch._dynamo.list_backends()
, or create a custom backend.
options: dict
- Use this for configuring options specific to the backend
.
- This is for custom configurations. Most common options should be set with
mode
. - For Torch Inductor you can list all config options with
torch._inductor.list_options()
.