Check out our new GPU Glossary! Read now
December 19, 20244 minute read
Introducing: L40S GPUs on Modal

At Modal, we believe that AI inference has unique infrastructure needs.

So we’re thrilled to share that a new inference-focused accelerator is now available for all Modal users: NVIDIA L40S GPUs, priced at $1.95/hr.

The L40S can offer substantial performance benefits over our current most popular inference-focused accelerator, the NVIDIA A10 GPU.

Run bigger models with the L40S

At 48GB, the L40S has twice the on-device DDR6 random access memory of the A10.

That means you can run larger models on large inputs. For example, running Flux.1-schnell in 16bit precision consumes 24 GB of RAM just for model weights, so it cannot run on a single A10 GPU without a throughput-killing offload to CPU RAM. Trying to do so will trigger a dreaded CUDA OOM:

flux a10

But the same workload fits very comfortably in a single L40S!

flux l40s

Run inference faster with the L40S

The L40S is also faster than the A10, not just beefier.

Users can expect approximately a 40% speedup for memory-bound jobs like small-batch inference and well over a 100% speedup for compute-bound jobs using 16bit Tensor Cores. Without any tuning, we were able to achieve a 20% speedup in a basic load test for a chat-style workload (LLaMA 3.1 8B with vLLM).

l40s benchmark

A10 vs L40S specs

See the table below for a comparison of the features of the A10 and the L40S, adapted from the manufacturer datasheets. If any of the vocabulary is new to you, click the link to be taken to our new GPU Glossary for an explanation.

A10 L40S
Streaming Multiprocessor Architecture Ampere Ada Lovelace
Compute Capability 8.6 8.9
GPU RAM 24 GB DDR6 48 GB DDR6
GPU RAMStreaming Multiprocessor Memory Bandwidth 600 GB/s 864 GB/s
FP16/BF16 Tensor Core Arithmetic Bandwidth 125 TFLOP/s 362 TFLOP/s
FP8 Tensor Core Arithmetic Bandwidth N/A 733 TFLOP/s

Get started now

Modal is the easiest way to deploy code to GPUs. Our custom infrastructure allows us to spin up L40S (or other GPU) containers running your code in one second. We help you efficiently autoscale your workloads to hundreds of GPUs, and you only ever pay for what you use.

Modal also comes with $30/month in free compute, so you can try an L40S for free right now. Just sign up for Modal if you haven’t yet, install and authenticate with our Python SDK, and then decorate a Python function with app.function(gpu="L40S"):

import modal

app = modal.App()

@app.function(gpu="L40S")
def run_flux_inference():
    # This will run on a Modal L40S

If you have questions on our L40S support or want to share something you’ve built, please reach out in our community Slack.

Ship your first app in minutes.

Get Started

$30 / month free compute