Run open-source LLMs with Ollama on Modal

View on GitHub

Ollama is a popular tool for running open-source large language models (LLMs) locally. It provides a simple API, including OpenAI compatibility, allowing you to interact with various models like Llama, Mistral, Phi, and more.

In this example, we demonstrate how to run Ollama on Modal’s cloud infrastructure, leveraging:

Modal’s powerful GPU resources that far exceed what’s available on most local machines
Serverless design that scales to zero when not in use (saving costs)
Persistent model storage using Modal Volumes
Web-accessible endpoints that expose Ollama’s OpenAI-compatible API

Since the Ollama server provides its own REST API, we use Modal’s web_server decorator to expose these endpoints directly to the internet.

Configuration and Constants

Directory for Ollama models within the container and volume

Define the models we want to work with You can specify different model versions using the format “model:tag”

Ollama version to install - you may need to update this for the latest models

Ollama’s default port - we’ll expose this through Modal

Building the Container Image

First, we create a Modal Image that includes Ollama and its dependencies. We use the official Ollama installation script to set up the Ollama binary.

Create a Modal App, which groups our functions together

Persistent Storage for Models

We use a Modal Volume to cache downloaded models between runs. This prevents needing to re-download large model files each time.

The Ollama Server Class

We define an OllamaServer class to manage the Ollama process. This class handles:

Starting the Ollama server
Downloading required models
Exposing the API via Modal’s web_server
Running test requests against the served models

Running the Example

This local entrypoint function provides a simple way to test the Ollama server. When you run modal run ollama.py, this function will:

Start an OllamaServer instance in the cloud
Run test prompts against each configured model
Print a summary of the results

Deploying to Production

While the local entrypoint is great for testing, for production use you’ll want to deploy this application persistently. You can do this with:

This creates a persistent deployment that:

Provides a stable URL endpoint for your Ollama API
Keeps at least one container warm for fast responses
Scales automatically based on usage
Preserves your models in the persistent volume between invocations

After deployment, you can find your endpoint URL in your Modal dashboard.

You can then use this endpoint with any OpenAI-compatible client by setting:

Configuration and Constants Building the Container Image Persistent Storage for Models The Ollama Server Class Running the Example Deploying to Production

Try this on Modal!

You can run this example on Modal in 60 seconds.

Create account to run

After creating a free account, install the Modal Python package, and create an API token.

Clone the modal-examples repository and run: