How much does it cost to run NVIDIA B200 GPUs in 2025?
Researchers and engineers can now get their hands on the first NVIDIA Blackwell GPU, the B200. With 192 GB of ultra-fast HBM3e and a second-generation Transformer Engine that introduces FP4 arithmetic, a single card delivers up to 20 petaFLOPS of sparse-FP4 AI compute.
Blackwell B200 specs & performance
The B200 represents a massive leap in GPU capabilities: built on TSMC’s 4NP process, it packs 208 billion transistors across a dual-die design, enabling both the new FP4 Tensor Cores and an on-package NVSwitch.
- Process: TSMC 4NP with 208 billion transistors (dual-die)
- Memory: 192 GB HBM3e (cloud consoles expose 180 GB usable) — 2.4x H100 capacity
- Bandwidth: 8 TB/s memory bandwidth, doubling Hopper’s throughput
- Peak compute: 20 PFLOPS FP4 with 2:1 sparsity — ~5x H100 inference throughput
- Interconnect: NVLink 5 at 1.8 TB/s bidirectional, removing PCIe bottlenecks
This huge transistor budget allows you to fit models that would require complex parallelism on H100s, while delivering inference throughput that makes real-time AI applications viable.
NVIDIA B200 cloud pricing
Here’s per-GPU pricing for B200s across major providers, from most to least flexible purchase options (July 2025):
Provider & SKU | Serverless | Spot | On‑demand | Capacity block | 1‑yr reservation | 3‑yr reservation | Pricing sources |
---|---|---|---|---|---|---|---|
Modal | $6.25/hr | n/a | n/a | n/a | n/a | n/a | Modal pricing |
Baseten | $9.98/hr | n/a | n/a | n/a | n/a | n/a | Baseten pricing |
RunPod | n/a | n/a | $5.99/hr | n/a | ~$5.09 hr | n/a | Runpod pricing |
Lambda Labs | n/a | n/a | $3.79/hr | n/a | $3.49/ hr | $2.99/hr | Lambda Labs pricing |
AWS | n/a | n/a | $14.24/hr | $8.14/hr | ~$12.50/hr | n/a | Vantage, AWS Saving Plan, AWS Capacity Block pricing |
GCP | n/a | $8.06/hr | $18.53/hr | n/a | $11.12/hr | $7.09/hr | Vertex pricing, Google Cloud pricing, Spot pricing |
Note that B200s are only available in instances of 8 GPUs on AWS and GCP
Choosing the right provider
Different scenarios call for different providers:
Scenario | Best fit | Rationale |
---|---|---|
Bursty AI inference traffic | Modal | Per-second billing and sub-second cold starts keep effective cost lowest |
Static, predictable AI inference traffic | AWS, GCP | Most reliable option that offers reservation-based discounts |
Multi-week training runs | Lambda Labs | Cheapest reservation prices |
Serverless options like Modal automatically scale up and down from 0 so you only pay for compute you actually use. Reserved-capacity options provide a fixed block of resources; this is acceptable for static workloads but for variable workloads results in increased latency when demand is high and wasted money when demand is low.
On‑premise options: buy a B200 or DGX B200?
For those considering ownership:
- Standalone B200 SXM module: $30,000 - $40,000 (one 700W GPU board)
- Grace-Blackwell GB200 Superchip: $60,000 - $70,000 (1x Grace CPU + 2x B200)
- NVIDIA DGX B200: ~$515,000 (8x B200, 1.44 TB GPU RAM, 72 PFLOPS FP8)
At $30k per card, breakeven against $6-8/hour cloud rates happens at ~60% utilization over 18 months (excluding electricity and cooling). Factor in datacenter space (~14 kW per DGX B200) and staff before buying.
B200 vs. H100/H200: is the upgrade worth it?
The B200 offers compelling advantages for specific workloads:
- Memory headroom - 192 GB HBM3e lets you serve GPT-4-class 400B parameter models on one card instead of 2-way sharding
- FP4 Transformer Engine - 5x higher inference throughput; MLPerf Llama-2-70B results show 2-3x tokens/second on identical node counts
- Fifth-gen NVLink - 1.8 TB/s cuts all-reduce time ~40% in 8-GPU training replicas
- Better real-world latency - Modal’s benchmarks show 2.5x lower TTFB versus H200 for MoE models
Quick‑start guide: run code on a cloud B200 in under 5 minutes
Modal’s serverless platform lets you run and deploy code on a B200 without having to manage cloud resources. To get started, simply sign up and run the code snippet below:
import modal
app = modal.App()
@app.function(gpu="B200")
def run_big_model():
# This will run on a B200 on Modal
At $0.001736/second, you can benchmark for pennies, then scale to thousands of ephemeral workers without touching an instance planner.
Get started with B200s today
Modal serverless B200s at $6.25/hour is the most cost-effective option for bursty workloads.
If your H100s are out of memory or your user-visible latency targets are slipping, Blackwell’s 192 GB HBM3e and FP4 Tensor Cores are the most cost-effective escape hatch available today.