What is Ollama?
Ollama is an open-source project that simplifies the process of running and managing large language models. It has a bunch of nice features:
- install multiple models and switch between them on the fly, without requiring a daemon restart.
- comes with a powerful command-line interface, making it easy to integrate into your workflows. You can run commands like
ollama run <modelname> "Your request"
to quickly load a model and process your input. - provides access to a wide range of pre-configured models. Simply running
ollama run <modelname>
will download and run the specified model if it’s not already available locally.
This guide will walk you through the process of running Ollama on Modal, a serverless cloud computing platform. This allows you to leverage Modal’s serverless GPU resources. The full code for this guide is here.
Prerequisites
Before we begin, make sure you have the following:
- An account at modal.com
- The Modal Python package installed (
pip install modal
) - Modal CLI authenticated (run
modal setup
orpython -m modal setup
if the former doesn’t work)
Running Ollama on Modal
To run Ollama on Modal:
- Clone the project directory containing the code.
- Open a terminal and navigate to the project directory.
- Run the following command:
modal run ollama-modal.py --text "Your question here"
This command will deploy the Ollama service on Modal and run an inference with your specified text.
Understanding the code
Service configuration
The ollama.service
file contains a systemd service configuration for Ollama:
[Unit]
Description=Ollama Service
After=network-online.target
[Service]
ExecStart=/usr/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
[Install]
WantedBy=default.target
This configuration ensures that Ollama runs as a service, automatically starting after the network is online and restarting if it fails.
Main application code
The ollama-modal.py
file contains the main application code for running Ollama on Modal. Let’s examine its key components:
- Importing necessary modules:
import modal
import os
import subprocess
import time
from modal import build, enter, method
These imports provide the required functionality for interacting with Modal and managing system processes.
- Defining the model and pull function:
MODEL = os.environ.get("MODEL", "llama3:instruct")
def pull(model: str = MODEL):
# ... (code for starting Ollama service and pulling the model)
This section sets up the default model and defines a function to start the Ollama service and pull the specified model.
- Creating the Modal image:
image = (
modal.Image
.debian_slim()
.apt_install("curl", "systemctl")
.run_commands(
# ... (commands to install Ollama)
)
.copy_local_file("ollama.service", "/etc/systemd/system/ollama.service")
.pip_install("ollama")
.run_function(pull)
)
This code creates a Modal image with Ollama installed and configured.
- Defining the Ollama class:
@app.cls(gpu="a10g", region="us-east", container_idle_timeout=300)
class Ollama:
@build()
def pull(self):
# ... (build step, currently empty)
@enter()
def load(self):
subprocess.run(["systemctl", "start", "ollama"])
@method()
def infer(self, text: str):
# ... (code for inference using Ollama)
This class encapsulates the Ollama functionality, including starting the service and performing inference.
- Main entrypoint:
def main(text: str = "Why is the sky blue?", lookup: bool = False):
if lookup:
ollama = modal.Cls.lookup("ollama", "Ollama")
else:
ollama = Ollama()
for chunk in ollama.infer.remote_gen(text):
print(chunk, end='', flush=False)
This function provides a convenient way to run the Ollama inference from the command line.