This article will guide you through the key differences between NVIDIA’s A10, A100, and H100 GPUs, helping you make an informed decision based on your specific needs and budget.
GPU comparison
Let’s start with a comparison of the GPUs available on Modal:
GPU Type | VRAM (GiB) | Memory bandwidth (VRAM-to-SRAM, TB/s) | Price (on Modal, $ / hour) | Architecture |
---|---|---|---|---|
H100 | 80 | 3.35 | 4.56 | Hopper |
A100 (80GB) | 80 | 2 | 3.40 | Ampere |
A100 (40GB) | 40 | 2 | 2.78 | Ampere |
A10 | 24 | 0.6 | 1.10 | Ampere |
L4 | 24 | 0.3 | 0.80 | Lovelace |
T4 | 16 | 0.3 | 0.59 | Tesla |
- VRAM is high speed, byte-addressable memory located on your graphics card. It plays the same role in the GPU’s memory system as the RAM plays in your CPU’s. The more VRAM, the larger the models you can run.
- In the table above, we show the VRAM-to-SRAM memory bandwidth, which is the rate at which data can be transferred between the GPU’s main memory (VRAM, typically GDDR or HBM) and its on-chip cache memory (SRAM). This bandwidth is crucial for the GPU’s ability to quickly bring model parameters into the compute cores where activations and outputs are calculated.
H100
- Best for: Training and inference for very large models (70B parameters or more), transformer-based architectures, low (8-bit) precision
- Key features:
- Most powerful NVIDIA datacenter GPU that’s generally available at time of writing (late 2024)
- ~2x faster than A100 for most workloads, but also harder to get (might have to queue), and more expensive
- Optimized for large language model workloads. It offers over 3 TB/s of memory bandwidth, which is crucial for LLM inference workloads that require rapid data transfer between VRAM and compute cores.
- Contains specialized compute units for lower precision (FP8) operations
A100
- Best for: Training and inference for large models (7B-70B parameters)
- Key features:
- NVIDIA’s workhorse GPU, meant for AI, data analytics, and HPC workloads
- Available in 40GB and 80GB variants
- Because memory bandwidth has scaled more slowly than arithmetic bandwidth, A100s can be more cost-effective than H100s for workloads that are memory-bound, like running large models on small batches
A10
- Best for: Inference for small to medium models (7B parameters or less, like most diffusion-based image generation models), cost-effective, small-scale training for smaller models
- Key features:
- Same architecture as A100, so most code that runs on A100 will run on A10
- Good performance-to-cost ratio for smaller workloads
L4
Best for: Inference for small to medium size models (7B parameters or less, like most diffusion-based image generation models)
Key features:
- Cost-efficient GPU, but still very capable
- L4 has the same amount of VRAM as A10, but only half the memory bandwidth
- L4 offers 2x-4x better performance over and is newer than T4
T4
Best for:
- Inference for small models
Key features:
- T4 is older and slower than L4
- Offered for free with Google Colab, so good for small-scale experimentation and prototyping. For example, you can start with T4s on Colab, and run the same code in prod on L4s or A10s.
Choosing the right GPU
When selecting a GPU for your machine learning, first gather the following information:
- Task Type: Are you training, fine-tuning, or running inference?
- Model Size: How many parameters does your model have?
- Memory Requirements: How much VRAM does your model need?
- Budget: What’s your cost constraint per hour of computation?
- Performance Needs: Do you require the absolute fastest processing times?
Then follow this procedure to decide which GPU is the best fit:
Calculate the amount of memory that you need, depending on your use case and model size. Remember to take into account whether you are quantizing the models and/or using techniques like LoRA or QLoRA. You can refer to our VRAM guides for more information on how to calculate the memory requirements:
Check against the table above for the most cost-effective GPU that the model will fit on
Start with the most cost-effective GPU to see whether the model runs/performs well and move to the more expensive ones if it doesn’t.
Advanced considerations
- Multi-GPU Setups: For some super large models (greater than 100B parameters, like Llama3-405B), you may need to allocate more than a single even top-tier GPU. Modal’s platform makes it easy to scale up your GPU resources as needed.
Conclusion
At Modal, we offer flexible access to all these GPU types with a simple gpu="A100"
or gpu="H100"
flag in your code. This allows you to easily switch between GPUs based on your needs without worrying about hardware procurement or maintenance.
Ready to supercharge your AI workloads with the right GPU? Sign up for Modal today and experience the difference firsthand!