Language model inference

Modal gives you the power and economy of open-source LLMs with the ease of serverless.

Get started
Bleeding-edge engines

With Modal, you no longer have to choose between ease of use and the latest developments in language model research—you can have both!

All state-of-the-art LLM serving frameworks work out of the box, including:

  • vLLM
  • text-generation-inference
  • MLC
  • CTranslate2
Maximum utilization

Modal helps you squeeze the last bit of utilization out of your GPUs. If your LLM framework supports continuous batching for greater token throughput, you can gain the benefits from that with a single config change.

Token streaming

To implement token streaming for your language model, all you have to do is make your regular Python function a generator. This magically works with HTTPS endpoints, so you can subscribe to the stream directly from your Node.js backend!

def generate(self, prompt: str):
    for output in pipeline(
        {"prompt": prompt}
        yield output
LoRA made easy

Low-rank adaptation (LoRA) is a technique that makes it possible to create fine-tune models in the form of small adapters that can be applied to the original model.

Modal’s parametrized functions make it trivial to build applications where you perform inference for a dynamic set of LoRA adapters. Now you can fine-tune your models on-demand, store the adapters in Volumes and immediately have them ready to go for inference.

Try it out

Karim Atiyeh
Karim Atiyeh
Co-Founder & CTO

Ramp uses Modal to run some of our most data-intensive projects. Our team loves the developer experience because it allows them to be more productive and move faster. Without Modal, these projects would have been impossible for us to launch. Modal's user-friendly interface and efficient tools have truly empowered our team to navigate data-intensive tasks with ease, enabling us to achieve our project goals more efficiently.

Ship your first app in minutes

with $30 / month free compute