Run open-source LLMs with Ollama on Modal
Ollama is a popular tool for running open-source large language models (LLMs) locally. It provides a simple API, including OpenAI compatibility, allowing you to interact with various models like Llama, Mistral, Phi, and more.
In this example, we demonstrate how to run Ollama on Modal’s cloud infrastructure, leveraging:
- Modal’s powerful GPU resources that far exceed what’s available on most local machines
- Serverless design that scales to zero when not in use (saving costs)
- Persistent model storage using Modal Volumes
- Web-accessible endpoints that expose Ollama’s OpenAI-compatible API
Since the Ollama server provides its own REST API, we use Modal’s web_server decorator to expose these endpoints directly to the internet.
Configuration and Constants
Directory for Ollama models within the container and volume
Define the models we want to work with You can specify different model versions using the format “model:tag”
Ollama version to install - you may need to update this for the latest models
Ollama’s default port - we’ll expose this through Modal
Building the Container Image
First, we create a Modal Image that includes Ollama and its dependencies. We use the official Ollama installation script to set up the Ollama binary.
Create a Modal App, which groups our functions together
Persistent Storage for Models
We use a Modal Volume to cache downloaded models between runs. This prevents needing to re-download large model files each time.
The Ollama Server Class
We define an OllamaServer class to manage the Ollama process. This class handles:
- Starting the Ollama server
- Downloading required models
- Exposing the API via Modal’s web_server
- Running test requests against the served models
Running the Example
This local entrypoint function provides a simple way to test the Ollama server.
When you run modal run ollama.py, this function will:
- Start an OllamaServer instance in the cloud
- Run test prompts against each configured model
- Print a summary of the results
Deploying to Production
While the local entrypoint is great for testing, for production use you’ll want to deploy this application persistently. You can do this with:
This creates a persistent deployment that:
- Provides a stable URL endpoint for your Ollama API
- Keeps at least one container warm for fast responses
- Scales automatically based on usage
- Preserves your models in the persistent volume between invocations
After deployment, you can find your endpoint URL in your Modal dashboard.
You can then use this endpoint with any OpenAI-compatible client by setting: