Run open-source LLMs with Ollama on Modal

Ollama is a popular tool for running open-source large language models (LLMs) locally. It provides a simple API, including OpenAI compatibility, allowing you to interact with various models like Llama, Mistral, Phi, and more.

In this example, we demonstrate how to run Ollama on Modal’s cloud infrastructure, leveraging:

  1. Modal’s powerful GPU resources that far exceed what’s available on most local machines
  2. Serverless design that scales to zero when not in use (saving costs)
  3. Persistent model storage using Modal Volumes
  4. Web-accessible endpoints that expose Ollama’s OpenAI-compatible API

Since the Ollama server provides its own REST API, we use Modal’s web_server decorator to expose these endpoints directly to the internet.

Configuration and Constants 

Directory for Ollama models within the container and volume

Define the models we want to work with You can specify different model versions using the format “model:tag”

Ollama version to install - you may need to update this for the latest models

Ollama’s default port - we’ll expose this through Modal

Building the Container Image 

First, we create a Modal Image that includes Ollama and its dependencies. We use the official Ollama installation script to set up the Ollama binary.

Create a Modal App, which groups our functions together

Persistent Storage for Models 

We use a Modal Volume to cache downloaded models between runs. This prevents needing to re-download large model files each time.

The Ollama Server Class 

We define an OllamaServer class to manage the Ollama process. This class handles:

  • Starting the Ollama server
  • Downloading required models
  • Exposing the API via Modal’s web_server
  • Running test requests against the served models

Running the Example 

This local entrypoint function provides a simple way to test the Ollama server. When you run modal run ollama.py, this function will:

  1. Start an OllamaServer instance in the cloud
  2. Run test prompts against each configured model
  3. Print a summary of the results

Deploying to Production 

While the local entrypoint is great for testing, for production use you’ll want to deploy this application persistently. You can do this with:

This creates a persistent deployment that:

  1. Provides a stable URL endpoint for your Ollama API
  2. Keeps at least one container warm for fast responses
  3. Scales automatically based on usage
  4. Preserves your models in the persistent volume between invocations

After deployment, you can find your endpoint URL in your Modal dashboard.

You can then use this endpoint with any OpenAI-compatible client by setting: