Run gpt-oss, OpenAI's new open weights model. Run now
August 25, 20255 minute read

How much does it cost to run NVIDIA L4 GPUs in 2025?

In 2023, NVIDIA launched the L4, the successor to the T4. The L4 has 24 GB of GDDR6 memory and Ada Lovelace architecture that delivers up to 485 TFLOPS of FP8 performance. With a max power consumption of 72W, it is the most power-efficient option for dense deployments.

L4 specs & why efficiency matters

The idea behind the NVIDIA L4 is to balance performance, power consumption, and size. Its smaller size and lower power consumption maximizes rack density while minimizing cooling costs. Built on the Ada Lovelace architecture, the L4’s key specifications include:

  • Architecture: Ada Lovelace with 7,424 CUDA cores
  • Memory: 24 GB GDDR6 (50% more than T4’s 16 GB)
  • Bandwidth: 300 GB/s memory bandwidth
  • Performance: 30.3 TFLOPS FP32, 485 TFLOPS FP8 with sparsity
  • Power: 72W max TDP — 5× less than L40S
  • Media engines: 2 NVENC (AV1-capable), 4 NVDEC units

NVIDIA markets this card as ideal for AI, video, and graphics applications. The 24 GB of memory means you can serve larger models and handle bigger batch sizes compared to the T4.

The L4 also incorporates NVENC and NVDEC units, which enable an 8×L4 server to handle 1,040 concurrent 720p30 AV1 streams, delivering 120× higher throughput than CPU-only solutions while improving energy efficiency by up to 99%.

NVIDIA L4 cloud pricing breakdown

Here’s where things get interesting for cost-conscious teams. The L4’s pricing varies dramatically depending on how you consume it:

ProviderServerlessSpotOn-demand1-yr Reserved3-yr Reserved
Modal$0.80/hr ($0.000222/sec)n/an/an/an/aModal pricing
GCPn/a$0.2231/hr$0.71/hr~$0.45/hr~$0.32/hrGoogle pricing
AWS (G6)n/a~$0.41/hr$0.80/hr~$0.52/hr~$0.37/hrVantage

L4s are already a low-cost option compared to other GPU types. They can be even more economical when used serverlessly for variable workloads. Running a 30-second inference job? That’s way less than 1 cent on Modal, but on a traditional cloud provider like AWS, you would have to pay for a minute minimum. While this difference initially seems small, it adds up when you’re running many bursty inference workloads.

Real-world deployment patterns

Consider a typical AI startup serving an image generation model. During peak hours, they might need 10 L4s to handle traffic, but overnight that drops to just 1-2 GPUs. With traditional cloud instances, they’d need to either provision for peak (wasting money on idle resources) or accept degraded performance during busy periods. Even if they built their own autoscaling system, they would still have to work within the constraints of spending minutes to provision and deprovision instances (not to mention paying for that time).

Serverless automatically scales from 0 to whatever you need, spinning up L4s in seconds when requests come in and releasing them when traffic dies down. This allows you to achieve much higher utilization on the GPUs you’re paying for.

Traditional cloud purchase models still have their place. If you are running workloads that are very stable, like a 24/7 video transcoding pipeline, a reserved AWS/GCP instance at ~$0.30/hr makes perfect sense cost-wise. The key is matching your consumption model to your workload pattern. Static, predictable workloads can benefit from reservations, while variable workloads (like customer-facing interactive AI features) almost always cost less on serverless platforms.

L4 limitations and workarounds

The L4 isn’t perfect, and understanding its limitations helps avoid costly mistakes. First, there’s no NVLink support. Multi-GPU communication happens over PCIe at 64 GB/s instead of NVLink’s 900 GB/s. This makes the L4 unsuitable for model parallel training or serving models that require tight GPU coupling.

The 24 GB memory limit also constrains model selection. While you can serve Llama-2-70B on an L4 using 4-bit quantization, quality degradation might be unacceptably high. AI companies serving multiple models will use L4s for smaller models but stick with larger GPUs like the L40S or the H100 for production deployments of more powerful models.

Another consideration: the L4 doesn’t support MIG (Multi-Instance GPU) partitioning. On an A100, you can slice the GPU into up to 7 isolated instances, perfect for multi-tenant scenarios. With the L4, you’ll need container-level isolation, which is possible but requires more careful resource management.

Quick-start: from zero to inference

Getting started with L4s has never been easier, especially with serverless platforms. Here’s a complete example using Modal that you can run in seconds:

import modal

app = modal.App()

@app.function(gpu="L40S")
def run_inference():
    # This will run on an L4 on Modal

This code spins up an L4 container on-demand, runs your model on an input, and then spins down the container. Total cost for a single 30-second inference call? Less than a penny. Compare that to requesting quota on EC2, waiting several minutes to spin up the instance, configuring CUDA drivers, managing container images, and orchestrating clusters once you go to production. Going with a serverless approach eliminates operational overhead while minimizing costs.

L4 is ideal for low-cost, low-power AI and video

The NVIDIA L4 is a cost-effective option for serving small models in 2025. While you won’t be running the latest LLMs on L4s, they are powerful enough for smaller generative AI workloads, efficient enough for dense deployments, and affordable enough for experimentation. At $0.30-0.80/hr depending on your consumption model, they are an order of magnitude cheaper than top-of-the-line GPUs like H100s.

For teams that are running variable workloads, serverless platforms like Modal at $0.000222/second eliminate the traditional barriers of high minimum costs and complex infrastructure management. You can prototype for dollars and scale seamlessly as your application grows.

Ship your first app in minutes.

Get Started

$30 / month free compute