Why use an inference framework?
Why can’t developers simply use a library like Transformers to serve their models?
While libraries like Transformers are excellent for training and basic inference, they have limitations when it comes to large-scale deployment and serving of LLMs:
Memory efficiency: LLMs require significant memory resources. General-purpose libraries may not optimize memory usage, leading to inefficient resource allocation. For more information about the VRAM requirements for serving LLMs, read here.
Inference speed: Standard libraries often lack optimizations specific to inference, resulting in slower processing times for large models.
Batching and queueing: Handling multiple requests efficiently requires sophisticated batching and queueing mechanisms, which are not typically included in training-focused libraries.
Scalability: Serving LLMs at scale requires careful management of computational resources, which is beyond the scope of most general-purpose libraries.
Instead, for most production model serving, to maximize throughput and minimize latency, you should be using an inference server. Two of the most popular inference servers for LLM use cases are vLLM and TGI.
What are vLLM and TGI?
vLLM
vLLM is an open-source library designed for fast LLM inference and serving. Developed by researchers at UC Berkeley, it utilizes PagedAttention, a new attention algorithm that effectively manages attention keys and values. vLLM delivers up to 24x higher throughput than Hugging Face Transformers, without requiring any model architecture changes.
Key features of vLLM include:
- Efficient memory management
- Continuous batching
- Optimized kernel implementations
- Support for various model architectures
TGI (Text Generation Inference)
TGI, short for Text Generation Inference, is a toolkit for deploying and serving Large Language Models (LLMs). Developed by Hugging Face, TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and more. It focuses on providing a production-ready solution for deploying and serving large language models, with a particular emphasis on text generation tasks.
Performance comparison: Which one is faster?
When it comes to performance, both vLLM and TGI offer significant improvements over baseline implementations. However, determining which one is faster is not straightforward, as performance can vary depending on the specific use case, model architecture, and hardware configuration.
Throughput: vLLM often demonstrates higher throughput, especially for larger batch sizes, due to its PagedAttention mechanism and continuous batching optimizations.
Memory efficiency: vLLM’s PagedAttention technique allows for more efficient memory usage, potentially enabling higher concurrency on the same hardware.
Ease-of-use: Since TGI is made by Hugging Face, serving any Hugging Face model (including private/gates ones) with TGI is relatively straightforward. The default way of running TGI, via an official Docker container, also brings up an API endpoint.
Production-readiness: TGI offers built-in telemetry via OpenTelemetry and Prometheus metrics. vLLM has fewer “production-ready” bells and whisles.
We would generally recommend using vLLM, which provides a nice balance of speed, support for distributed inference (needed for large models), and ease of installation.
How to deploy models with vLLM
Modal is a serverless cloud computing platform that makes it easy for you to deploy open-source models with frameworks like vLLM and TGI. To get started, follow these tutorials: