Check out our new GPU Glossary! Read now
September 15, 20245 minute read
How to run Ollama
author
Yiren Lu@YirenLu
Solutions Engineer

What is Ollama?

Ollama is an open-source project that simplifies the process of running and managing large language models. It has a bunch of nice features:

  • install multiple models and switch between them on the fly, without requiring a daemon restart.
  • comes with a powerful command-line interface, making it easy to integrate into your workflows. You can run commands like ollama run <modelname> "Your request" to quickly load a model and process your input.
  • provides access to a wide range of pre-configured models. Simply running ollama run <modelname> will download and run the specified model if it’s not already available locally.

This guide will walk you through the process of running Ollama on Modal, a serverless cloud computing platform. This allows you to leverage Modal’s serverless GPU resources. The full code for this guide is here.

Prerequisites

Before we begin, make sure you have the following:

  1. An account at modal.com
  2. The Modal Python package installed (pip install modal)
  3. Modal CLI authenticated (run modal setup or python -m modal setup if the former doesn’t work)

Running Ollama on Modal

To run Ollama on Modal:

  1. Clone the project directory containing the code.
  2. Open a terminal and navigate to the project directory.
  3. Run the following command:
modal run ollama-modal.py --text "Your question here"

This command will deploy the Ollama service on Modal and run an inference with your specified text.

Understanding the code

Service configuration

The ollama.service file contains a systemd service configuration for Ollama:

[Unit]
Description=Ollama Service
After=network-online.target
[Service]
ExecStart=/usr/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
[Install]
WantedBy=default.target

This configuration ensures that Ollama runs as a service, automatically starting after the network is online and restarting if it fails.

Main application code

The ollama-modal.py file contains the main application code for running Ollama on Modal. Let’s examine its key components:

  1. Importing necessary modules:
import modal
import os
import subprocess
import time
from modal import build, enter, method

These imports provide the required functionality for interacting with Modal and managing system processes.

  1. Defining the model and pull function:
MODEL = os.environ.get("MODEL", "llama3:instruct")

def pull(model: str = MODEL):
    # ... (code for starting Ollama service and pulling the model)

This section sets up the default model and defines a function to start the Ollama service and pull the specified model.

  1. Creating the Modal image:
image = (
    modal.Image
    .debian_slim()
    .apt_install("curl", "systemctl")
    .run_commands(
        # ... (commands to install Ollama)
    )
    .copy_local_file("ollama.service", "/etc/systemd/system/ollama.service")
    .pip_install("ollama")
    .run_function(pull)
)

This code creates a Modal image with Ollama installed and configured.

  1. Defining the Ollama class:
@app.cls(gpu="a10g", region="us-east", container_idle_timeout=300)
class Ollama:
    @build()
    def pull(self):
        # ... (build step, currently empty)

    @enter()
    def load(self):
        subprocess.run(["systemctl", "start", "ollama"])

    @method()
    def infer(self, text: str):
        # ... (code for inference using Ollama)

This class encapsulates the Ollama functionality, including starting the service and performing inference.

  1. Main entrypoint:
def main(text: str = "Why is the sky blue?", lookup: bool = False):
    if lookup:
        ollama = modal.Cls.lookup("ollama", "Ollama")
    else:
        ollama = Ollama()
    for chunk in ollama.infer.remote_gen(text):
        print(chunk, end='', flush=False)

This function provides a convenient way to run the Ollama inference from the command line.

Ship your first app in minutes.

Get Started

$30 / month free compute