Run gpt-oss, OpenAI's new open weights model. Run now
August 12, 20255 minute read

How much does it cost to run NVIDIA L40S GPUs in 2025?

The NVIDIA L40S is an ideal GPU for cost-effective AI inference and graphics workloads. With 48 GB of GDDR6 memory and Ada Lovelace architecture, it is comparable in price to A100s (40GB version) and can exceed A100 performance for compute-bound workloads.

L40S specs & performance

The NVIDIA L40S strikes a balance between performance and affordability, making advanced AI accessible without breaking the budget.

  • Architecture: Ada Lovelace with 18,176 CUDA cores
  • Memory: 48 GB GDDR6 with ECC, more than A100 40GB
  • Bandwidth: 864 GB/s memory bandwidth
  • Peak compute: 1.47 PFLOPS FP8 tensor performance with sparsity
  • Special features: 4th-gen Tensor Cores with FP8, 3rd-gen RT Cores for graphics

This configuration makes it ideal for serving 13B-70B parameter models, running diffusion models, and GPU-accelerated visualization tasks.

NVIDIA L40S cloud pricing

Here’s per-GPU pricing for L40S across major providers (August 2025):

Provider & SKUServerlessSpotOn-demandCapacity block*1-yr reservation3-yr reservationPricing sources
Modal$1.95/hr or $0.000542 / secn/an/an/an/an/aModal pricing
RunPod$1.90/hrn/a$0.86/hrn/an/an/aRunPod
AWS (G6e, 1×L40S)n/a$1.1027/hr$1.861/hrn/a~$1.17/hr (No-upfront) / ~$1.09/hr (All-upfront)~$0.80/hr (No-upfront) / ~$0.70/hr (All-upfront)AWS Pricing (us-east-1)
CoreWeaven/an/a$2.25/hrn/a$0.90+/hr$0.90+/hrCoreWeave
Civon/an/a$1.29/hrn/a$0.89+/hr$0.89+/hrCivo
Vultrn/an/a$1.67/hrn/an/a$0.848+/hrVultr
Oracle Cloud (OCI)n/an/a$3.50/hrn/an/an/aOracle
Replicate (multi-GPU L40S)n/an/a$3.51/hrn/an/an/aReplicate

* Capacity blocks are AWS-specific

GCP does not offer L40S GPUs at this time.

Choosing the right provider

Different workload patterns call for different approaches:

ScenarioBest fitRationale
Bursty inference workloadsModalPer-second billing ($0.000542/s ≈ $1.95/hr) eliminates idle cost
Budget-conscious consumer experimentsRunPodLowest on-demand L40S (~$0.86/hr); optional community capacity often ~$0.69–$0.79/hr
Static, predictable AI inference trafficAWS1–3 year commitments drop g6e (1×L40S) to ~$0.70–$0.80/hr

On-premise options: buy an L40S?

For those considering ownership:

  • Single L40S PCIe card: ~$7,500
  • Dell PowerEdge R760xa with 4x L40S: ~$47,000-$49,000

At $7.5k per card, breakeven against $1-2/hour cloud rates happens in under a year of heavy utilization. Factor in ~$0.20-0.30/hr for electricity and cooling, and your effective cost might be $0.80-0.90/hr, competitive with cloud rates if you maintain >50% utilization.

L40S vs. A100 vs. H100 vs. B200: which GPU for your workload?

The L40S fills a unique niche in NVIDIA’s lineup:

  1. L40S (48 GB GDDR6) - Best for inference serving, small fine-tuning jobs (e.g. training an LLM LoRA), and graphics at $1-2/hr
  2. A100 (40/80 GB HBM2e) – Previous-gen Ampere architecture; 80 GB suits larger models; widely available at ~$1–3/hr depending on provider
  3. H100 (80 GB HBM3) - For cutting-edge training and serving larger models at ~$5/hr
  4. B200 (192 GB HBM3e) - For the most compute-intensive AI training and inference workloads (e.g. running 300B+ parameter models) at ~$6-10/hr

The L40S has a great price-to-performance tradeoff for inference workloads, especially for “smaller” gen AI models like generative image models or sub-70B param LLMs.

Many organizations train on H100s, then deploy inference on L40S fleets for cost efficiency. Of course, if inference speed is paramount for your use case and you are not cost-sensitive, you may still want to evaluate more powerful GPUs for model serving.

Short answer: No. The L40S is PCIe-only and doesn’t support NVLink or MIG (Multi-Instance GPU).

This means:

  • Inter-GPU communication happens over PCIe 4.0 x16 (~32 GB/s) instead of NVLink’s multi-hundred GB/s
  • Data parallel training works fine, but tensor/model parallel approaches hit PCIe bottlenecks quickly
  • No hardware partitioning means you can’t slice the GPU into guaranteed instances like with A100/H100s

As a result, L40S GPUs are suboptimal for multi-GPU training workloads. Consider A100s, H100s, and B200s instead if you need NVLink for large model parallelism. If you still need a multi-GPU L40S setup, prefer FSDP/ZeRO sharding strategies and ensure fast networking (100 GbE+) between nodes.

When is serverless cheaper than reserved instances on hyperscalers?

The break-even point for serverless vs. reserved L40S instances depends on utilization:

The math: With Modal at $1.95/hr and AWS 3-year reserved at ~$0.80/hr, break-even utilization is:

  • $0.80 / $1.95 ≈ 41%

Simple rule: If your GPU sits idle >60% of the time, serverless is cheaper. If it’s busy most of the day, locking in a reservation will be cheaper.

Monthly cost examples (single GPU):

  • 25% utilized: Serverless ~$351 vs. Reserved ~$540 → serverless wins
  • 60% utilized: Serverless ~$842 vs. Reserved ~$540 → reserved wins

Keep in mind, however, the inflexibility and capital commitment required for GPU reservations. Traditional cloud platforms also require you to configure and manage your own cloud infrastructure, so make sure to factor in devops cost and slower time-to-ship as well.

Additional considerations:

  • Serverless providers can have your code running on L40S GPUs in less than a second, while with traditional cloud there’s a long process to request quota and provision instances.
  • Serverless auto-scales for traffic spikes; traditional cloud needs pre-provisioned headroom, which makes achieving high utitilization difficult.

Quick-start guide: run code on a cloud L40S in under 5 minutes

Modal’s serverless platform lets you run code on an L40S without managing any AI cloud infrastructure:

import modal

app = modal.App()

@app.function(gpu="L40S")
def run_inference():
    # This will run on an L40S on Modal

At $0.000542/second, you can prototype for pennies, then scale to production without touching instance configuration.

Get started with L40S today

The L40S represents the sweet spot for AI inference, offering enterprise-grade performance at accessible prices. If you’re serving smaller LLMs, running image generation models, or building GPU-accelerated applications, the L40S delivers the best price-to-performance ratio in today’s market.

Ship your first app in minutes.

Get Started

$30 / month free compute