Modal gives you the power and economy of open-source LLMs with the ease of serverless.
Get startedWith Modal, you no longer have to choose between ease of use and the latest developments in language model research—you can have both!
All state-of-the-art LLM serving frameworks work out of the box, including:
Modal helps you squeeze the last bit of utilization out of your GPUs. If your LLM framework supports continuous batching for greater token throughput, you can gain the benefits from that with a single config change.
To implement token streaming for your language model, all you have to do is make your regular Python function a generator. This magically works with HTTPS endpoints, so you can subscribe to the stream directly from your Node.js backend!
@method()
def generate(self, prompt: str):
for output in pipeline(
self.model,
self.tokenizer,
{"prompt": prompt}
):
yield output
Low-rank adaptation (LoRA) is a technique that makes it possible to create fine-tune models in the form of small adapters that can be applied to the original model.
Modal’s parametrized functions make it trivial to build applications where you perform inference for a dynamic set of LoRA adapters. Now you can fine-tune your models on-demand, store the adapters in Volumes and immediately have them ready to go for inference.
Deploy Llama 3 using Text Generation Inference with continuous batching and PagedAttention.
Deploy Mistral 7B with vLLM for fast inference.
Using retrieval-augemented generation (RAG) to turn any document into a dynamic knowledge base.
Build a complete chat app with Whisper transcription, Vicuna LLM and Tortoise text-to-speech.
“Ramp uses Modal to run some of our most data-intensive projects. Our team loves the developer experience because it allows them to be more productive and move faster. Without Modal, these projects would have been impossible for us to launch. Modal's user-friendly interface and efficient tools have truly empowered our team to navigate data-intensive tasks with ease, enabling us to achieve our project goals more efficiently.”