At Modal, we believe that AI inference has unique infrastructure needs.
So we’re thrilled to share that a new inference-focused accelerator is now available for all Modal users: NVIDIA L40S GPUs, priced at $1.95/hr.
The L40S can offer substantial performance benefits over our current most popular inference-focused accelerator, the NVIDIA A10 GPU.
Run bigger models with the L40S
At 48GB, the L40S has twice the on-device DDR6 random access memory of the A10.
That means you can run larger models on large inputs. For example, running Flux.1-schnell in 16bit precision consumes 24 GB of RAM just for model weights, so it cannot run on a single A10 GPU without a throughput-killing offload to CPU RAM. Trying to do so will trigger a dreaded CUDA OOM:
But the same workload fits very comfortably in a single L40S!
Run inference faster with the L40S
The L40S is also faster than the A10, not just beefier.
Users can expect approximately a 40% speedup for memory-bound jobs like small-batch inference and well over a 100% speedup for compute-bound jobs using 16bit Tensor Cores. Without any tuning, we were able to achieve a 20% speedup in a basic load test for a chat-style workload (LLaMA 3.1 8B with vLLM).
A10 vs L40S specs
See the table below for a comparison of the features of the A10 and the L40S, adapted from the manufacturer datasheets. If any of the vocabulary is new to you, click the link to be taken to our new GPU Glossary for an explanation.
A10 | L40S | |
---|---|---|
Streaming Multiprocessor Architecture | Ampere | Ada Lovelace |
Compute Capability | 8.6 | 8.9 |
GPU RAM | 24 GB DDR6 | 48 GB DDR6 |
GPU RAM ↔ Streaming Multiprocessor Memory Bandwidth | 600 GB/s | 864 GB/s |
FP16/BF16 Tensor Core Arithmetic Bandwidth | 125 TFLOP/s | 362 TFLOP/s |
FP8 Tensor Core Arithmetic Bandwidth | N/A | 733 TFLOP/s |
Get started now
Modal is the easiest way to deploy code to GPUs. Our custom infrastructure allows us to spin up L40S (or other GPU) containers running your code in one second. We help you efficiently autoscale your workloads to hundreds of GPUs, and you only ever pay for what you use.
Modal also comes with $30/month in free compute, so you can try an L40S for free right now. Just sign up for Modal if you haven’t yet, install and authenticate with our Python SDK, and then decorate a Python function with app.function(gpu="L40S")
:
import modal
app = modal.App()
@app.function(gpu="L40S")
def run_flux_inference():
# This will run on a Modal L40S
If you have questions on our L40S support or want to share something you’ve built, please reach out in our community Slack.