# Modal llms-full.txt

> Modal is a platform for running Python code in the cloud with minimal
> configuration, especially for serving AI models and high-performance batch
> processing. It supports fast prototyping, serverless APIs, scheduled jobs,
> GPU inference, distributed volumes, and sandboxes.

Important notes:

- Modal's primitives are embedded in Python and tailored for AI/GPU use cases,
    but they can be used for general-purpose cloud compute.
- Modal is a serverless platform, meaning you are only billed for resources used
    and can spin up containers on demand in seconds.

You can sign up for free at [https://modal.com] and get $30/month of credits.

## Guides

### Custom container images

#### Defining Images

# Images

This guide walks you through how to define the environment your Modal Functions run in.

These environments are called _containers_. Containers are like light-weight
virtual machines -- container engines use
[operating system tricks](https://earthly.dev/blog/chroot/) to isolate programs
from each other ("containing" them), making them work as though they were
running on their own hardware with their own filesystem. This makes execution
environments more reproducible, for example by preventing accidental
cross-contamination of environments on the same machine. For added security,
Modal runs containers using the sandboxed
[gVisor container runtime](https://cloud.google.com/blog/products/identity-security/open-sourcing-gvisor-a-sandboxed-container-runtime).

Containers are started up from a stored "snapshot" of their filesystem state
called an _image_. Producing the image for a container is called _building_ the
image.

By default, Modal Functions are executed in a
[Debian Linux](https://en.wikipedia.org/wiki/Debian) container with a basic
Python installation of the same minor version `v3.x` as your local Python
interpreter.

To make your Apps and Functions useful, you will probably need some third party system packages
or Python libraries. Modal provides a number of options to customize your container images at
different levels of abstraction and granularity, from high-level convenience
methods like `pip_install` through wrappers of core container image build
features like `RUN` and `ENV` to full on "bring-your-own-Dockerfile". We'll
cover each of these in this guide, along with tips and tricks for building
Images effectively when using each tool.

The typical flow for defining an image in Modal is
[method chaining](https://jugad2.blogspot.com/2016/02/examples-of-method-chaining-in-python.html)
starting from a base image, like this:

```python
import modal

image = (
    modal.Image.debian_slim(python_version="3.10")
    .apt_install("git")
    .pip_install("torch==2.6.0")
    .env({"HALT_AND_CATCH_FIRE": "0"})
    .run_commands("git clone https://github.com/modal-labs/agi && echo 'ready to go!'")
)
```

In addition to being Pythonic and clean, this also matches the onion-like
[layerwise build process](https://docs.docker.com/build/guide/layers/) of
container images.

## Adding Python packages

The simplest and most common container modification is to add some third party
Python package, like [`pandas`](https://pandas.pydata.org/).

You can add Python packages to the environment by passing all the packages you
need to the [`Image.uv_pip_install`](https://modal.com/docs/reference/modal.Image#uv_pip_install) method.
The `Image.uv_pip_install` method takes care of some nuances that are important for
using `uv` in containerized workflows, like generating bytecode files during the build
phase so that cold starts are faster.

You can include
[typical Python dependency version specifiers](https://peps.python.org/pep-0508/),
like `"torch <= 2.0"`, in the arguments. But we recommend pinning dependencies
tightly, like `"torch == 1.9.1"`, to improve the reproducibility and robustness
of your builds.

```python
import modal

datascience_image = (
    modal.Image.debian_slim(python_version="3.10")
    .uv_pip_install("pandas==2.2.0", "numpy")
)

@app.function(image=datascience_image)
def my_function():
    import pandas as pd
    import numpy as np

    df = pd.DataFrame()
    ...
```

If you run into any issues with
[`Image.uv_pip_install`](https://modal.com/docs/reference/modal.Image#uv_pip_install), then
you can fallback to [`Image.pip_install`](https://modal.com/docs/reference/modal.Image#pip_install) which
uses standard `pip`:

```python
import modal

datascience_image = (
    modal.Image.debian_slim(python_version="3.10")
    .pip_install("pandas==2.2.0", "numpy")
)
```

Note that because you can define a different environment for each and every
Modal Function if you so choose, you don't need to worry about virtual
environment management. Containers make for much better separation of concerns!

If you want to run a specific version of Python remotely rather than just
matching the one you're running locally, provide the `python_version` as a
string when constructing the base image, like we did above.

## Add local files with `add_local_dir` and `add_local_file`

If you want to forward files from your local system, you can do that using the
`image.add_local_dir` and `image.add_local_file` image builder methods.

```python
image = modal.Image.debian_slim().add_local_dir("/user/erikbern/.aws", remote_path="/root/.aws")
```

By default, these files are added to your container as it starts up rather than introducing
a new image layer. This means that the redeployment after making changes is really quick, but
also means you can't run additional build steps after. You can specify a `copy=True` argument
to the `add_local_` methods to instead force the files to be included in a built image.

### Adding local Python modules

There is a convenience method for the special case of adding local Python modules to
the container: [`Image.add_local_python_source`](https://modal.com/docs/reference/modal.Image#add_local_python_source)

The difference from `add_local_dir` is that `add_local_python_source` takes module names as arguments
instead of a file system path and looks up the local package's or module's location via Python's importing
mechanism. The files are then added to directories that make them importable in containers in the
same way as they are locally.

This is mostly intended for pure Python auxiliary modules that are part of your project and that your code imports,
whereas third party packages should be installed via
[`Image.pip_install()`](https://modal.com/docs/reference/modal.Image#pip_install) or similar.

```python
import modal

app = modal.App()

image_with_module = modal.Image.debian_slim().add_local_python_source("my_local_module")

@app.function(image=image_with_module)
def f():
    import my_local_module  # this will now work in containers
    my_local_module.do_stuff()
```

### What if I have different Python packages locally and remotely?

You might want to use packages inside your Modal code that you don't have on
your local computer. In the example above, we build a container that uses
`pandas`. But if we don't have `pandas` locally, on the computer launching the
Modal job, we can't put `import pandas` at the top of the script, since it would
cause an `ImportError`.

The easiest solution to this is to put `import pandas` in the function body
instead, as you can see above. This means that `pandas` is only imported when
running inside the remote Modal container, which has `pandas` installed.

Be careful about what you return from Modal Functions that have different
packages installed than the ones you have locally! Modal Functions return Python
objects, like `pandas.DataFrame`s, and if your local machine doesn't have
`pandas` installed, it won't be able to handle a `pandas` object (the error
message you see will mention
[serialization](https://hazelcast.com/glossary/serialization/)/[deserialization](https://hazelcast.com/glossary/deserialization/)).

If you have a lot of functions and a lot of Python packages, you might want to
keep the imports in the global scope so that every function can use the same
imports. In that case, you can use the
[`imports()`](https://modal.com/docs/reference/modal.Image#imports) context manager:

```python
import modal

pandas_image = modal.Image.debian_slim().pip_install("pandas", "numpy")

with pandas_image.imports():
    import pandas as pd
    import numpy as np

@app.function(image=pandas_image)
def my_function():
    df = pd.DataFrame()
```

Because these imports happen before a new container processes its first input,
you can combine this decorator with [memory snapshots](https://modal.com/docs/guide/memory-snapshot)
to improve [cold start performance](https://modal.com/docs/guide/cold-start#share-initialization-work-across-cold-starts-with-memory-snapshots)
for Functions that frequently scale from zero.

## Run shell commands with `.run_commands`

You can also supply shell commands that should be executed when building the
container image.

You might use this to preload custom assets, like model parameters, so that they
don't need to be retrieved when Functions start up:

```python
import modal

image_with_model = (
    modal.Image.debian_slim().apt_install("curl").run_commands(
        "curl -O https://raw.githubusercontent.com/opencv/opencv/master/data/haarcascades/haarcascade_frontalcatface.xml",
    )
)

@app.function(image=image_with_model)
def find_cats():
    content = open("/haarcascade_frontalcatface.xml").read()
    ...
```

## Run a Python function during your build with `.run_function`

Instead of using shell commands, you can also run a Python function as an image
build step using the
[`Image.run_function`](https://modal.com/docs/reference/modal.Image#run_function) method. For
example, you can use this to download model parameters from Hugging Face into
your Image:

```python
import os
import modal

def download_models() -> None:
    import diffusers

    model_name = "segmind/small-sd"
    pipe = diffusers.StableDiffusionPipeline.from_pretrained(
        model_name, use_auth_token=os.environ["HF_TOKEN"]
    )
    pipe.save_pretrained("/model")

image = (
    modal.Image.debian_slim()
        .pip_install("diffusers[torch]", "transformers", "ftfy", "accelerate")
        .run_function(download_models, secrets=[modal.Secret.from_name("huggingface-secret")])
)
```

Any kwargs accepted by [`@app.function`](https://modal.com/docs/reference/modal.App#function)
([`Volume`s](https://modal.com/docs/guide/volumes), and specifications of
resources like [GPUs](https://modal.com/docs/guide/gpu)) can be supplied here.

Essentially, this is equivalent to running a Modal Function and snapshotting the
resulting filesystem as an image.

Whenever you change other features of your image, like the base image or the
version of a Python package, the image will automatically be rebuilt the next
time it is used. This is a bit more complicated when changing the contents of
functions. See the
[reference documentation](https://modal.com/docs/reference/modal.Image#run_function) for details.

## Attach GPUs during setup

If a step in the setup of your container image should be run on an instance with
a GPU (e.g., so that a package can query the GPU to set compilation flags), pass a
desired GPU type when defining that step:

```python
import modal

image = (
    modal.Image.debian_slim()
    .pip_install("bitsandbytes", gpu="H100")
)
```

## Use `mamba` instead of `pip` with `micromamba_install`

`pip` installs Python packages, but some Python workloads require the
coordinated installation of system packages as well. The `mamba` package manager
can install both. Modal provides a pre-built
[Micromamba](https://mamba.readthedocs.io/en/latest/user_guide/micromamba.html)
base image that makes it easy to work with `micromamba`:

```python
import modal

app = modal.App("bayes-pgm")

numpyro_pymc_image = (
    modal.Image.micromamba()
    .micromamba_install("pymc==5.10.4", "numpyro==0.13.2", channels=["conda-forge"])
)

@app.function(image=numpyro_pymc_image)
def sample():
    import pymc as pm
    import numpyro as np

    print(f"Running on PyMC v{pm.__version__} with JAX/numpyro v{np.__version__} backend")
    ...
```

## Use an existing container image with `.from_registry`

You don't always need to start from scratch! Public registries like
[Docker Hub](https://hub.docker.com/) have many pre-built container images for
common software packages.

You can use any public image in your function using
[`Image.from_registry`](https://modal.com/docs/reference/modal.Image#from_registry), so long as:

- Python 3.9 or later is installed on the `$PATH` as `python`
- `pip` is installed correctly
- The image is built for the
  [`linux/amd64` platform](https://unix.stackexchange.com/questions/53415/why-are-64-bit-distros-often-called-amd64)
- The image has a [valid `ENTRYPOINT`](#entrypoint)

```python
import modal

sklearn_image = modal.Image.from_registry("huanjason/scikit-learn")

@app.function(image=sklearn_image)
def fit_knn():
    from sklearn.neighbors import KNeighborsClassifier
    ...
```

If an existing image does not have either `python` or `pip` set up properly, you
can still use it. Just provide a version number as the `add_python` argument to
install a reproducible
[standalone build](https://github.com/indygreg/python-build-standalone)
of Python:

```python
import modal

image1 = modal.Image.from_registry("ubuntu:22.04", add_python="3.11")
image2 = modal.Image.from_registry("gisops/valhalla:latest", add_python="3.11")
```

The `from_registry` method can load images from all public registries, such as
[Nvidia's `nvcr.io`](https://catalog.ngc.nvidia.com/containers),
[AWS ECR](https://aws.amazon.com/ecr/), and
[GitHub's `ghcr.io`](https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry).

We also support access to
[private AWS ECR and GCP Artifact Registry images](https://modal.com/docs/guide/private-registries).

## Bring your own image definition with `.from_dockerfile`

Sometimes, you might be already have a container image defined in a Dockerfile.

You can define an Image with a Dockerfile using
[`Image.from_dockerfile`](https://modal.com/docs/reference/modal.Image#from_dockerfile).
It takes a path to an existing Dockerfile.

For instance, we might write a Dockerfile that adds scikit-learn to the official Python image:

```
FROM python:3.9
RUN pip install sklearn
```

and then define a Modal Image with it:

```python
import modal

dockerfile_image = modal.Image.from_dockerfile("Dockerfile")

@app.function(image=dockerfile_image)
def fit():
    import sklearn
    ...
```

Note that you can still do method chaining to extend this image!

### Dockerfile command compatibility

Since Modal doesn't use Docker to build containers, we have our own
implementation of the
[Dockerfile specification](https://docs.docker.com/engine/reference/builder/).
Most Dockerfiles should work out of the box, but there are some differences to
be aware of.

First, a few minor Dockerfile commands and flags have not been implemented yet.
These include `ONBUILD`, `STOPSIGNAL`, and `VOLUME`.
Please reach out to us if your use case requires any of these.

Next, there are some command-specific things that may be useful when porting a
Dockerfile to Modal.

#### `ENTRYPOINT`

While the
[`ENTRYPOINT`](https://docs.docker.com/engine/reference/builder/#entrypoint)
command is supported, there is an additional constraint to the entrypoint script
provided: when used with a Modal Function, it must also `exec` the arguments passed to it at some point.
This is so the Modal Function runtime's Python entrypoint can run after your own. Most entrypoint
scripts in Docker containers are wrappers over other scripts, so this is likely
already the case.

If you wish to write your own entrypoint script, you can use the following as a
template:

```bash
#!/usr/bin/env bash

# Your custom startup commands here.

exec "$@" # Runs the command passed to the entrypoint script.
```

If the above file is saved as `/usr/bin/my_entrypoint.sh` in your container,
then you can register it as an entrypoint with
`ENTRYPOINT ["/usr/bin/my_entrypoint.sh"]` in your Dockerfile, or with
[`entrypoint`](https://modal.com/docs/reference/modal.Image#entrypoint) as an
Image build step.

```python
import modal

image = (
    modal.Image.debian_slim()
    .pip_install("foo")
    .entrypoint(["/usr/bin/my_entrypoint.sh"])
)
```

#### `ENV`

We currently don't support default values in
[interpolations](https://docs.docker.com/compose/compose-file/12-interpolation/),
such as `${VAR:-default}`

## Image caching and rebuilds

Modal uses the definition of an Image to determine whether it needs to be
rebuilt. If the definition hasn't changed since the last time you ran or
deployed your App, the previous version will be pulled from the cache.

Images are cached per layer (i.e., per `Image` method call), and breaking
the cache on a single layer will cause cascading rebuilds for all subsequent
layers. You can shorten iteration cycles by defining frequently-changing
layers last so that the cached version of all other layers can be used.

In some cases, you may want to force an Image to rebuild, even if the
definition hasn't changed. You can do this by adding the `force_build=True`
argument to any of the Image building methods.

```python
import modal

image = (
    modal.Image.debian_slim()
    .apt_install("git")
    .pip_install("slack-sdk", force_build=True)
    .run_commands("echo hi")
)
```

As in other cases where a layer's definition changes, both the `pip_install` and
`run_commands` layers will rebuild, but the `apt_install` will not. Remember to
remove `force_build=True` after you've rebuilt the Image, or it will
rebuild every time you run your code.

Alternatively, you can set the `MODAL_FORCE_BUILD` environment variable (e.g.
`MODAL_FORCE_BUILD=1 modal run ...`) to rebuild all images attached to your App.
But note that when you rebuild a base layer, the cache will be invalidated for _all_
Images that depend on it, and they will rebuild the next time you run or deploy
any App that uses that base. If you're debugging an issue with your Image, a better
option might be using `MODAL_IGNORE_CACHE=1`. This will rebuild the Image from the
top without breaking the Image cache or affecting subsequent builds.

## Image builder updates

Because changes to base images will cause cascading rebuilds, Modal is
conservative about updating the base definitions that we provide. But many
things are baked into these definitions, like the specific versions of the Image
OS, the included Python, and the Modal client dependencies.

We provide a separate mechanism for keeping base images up-to-date without
causing unpredictable rebuilds: the "Image Builder Version". This is a workspace
level-configuration that will be used for every Image built in your workspace.
We release a new Image Builder Version every few months but allow you to update
your workspace's configuration when convenient. After updating, your next
deployment will take longer, because your Images will rebuild. You may also
encounter problems, especially if your Image definition does not pin the version
of the third-party libraries that it installs (as your new Image will get the
latest version of these libraries, which may contain breaking changes).

You can set the Image Builder Version for your workspace by going to your
[workspace settings](https://modal.com/settings/image-config). This page also documents the
important updates in each version.

#### Private registries

# Private registries

Modal provides the
[`Image.from_registry`](https://modal.com/docs/guide/images#use-an-existing-container-image-with-from_registry)
function, which can pull public images available from registries such as Docker
Hub and GitHub Container Registry, as well as private images from registries
such as [AWS Elastic Container Registry (ECR)](https://aws.amazon.com/ecr/),
[GCP Artifact Registry](https://cloud.google.com/artifact-registry), and Docker
Hub.

## Docker Hub (Private)

To pull container images from private Docker Hub repositories,
[create an access token](https://docs.docker.com/security/for-developers/access-tokens/)
with "Read-Only" permissions and use this token value and your Docker Hub
username to create a Modal [Secret](https://modal.com/docs/guide/secrets).

```
REGISTRY_USERNAME=my-dockerhub-username
REGISTRY_PASSWORD=dckr_pat_TS012345aaa67890bbbb1234ccc
```

Use this Secret with the
[`modal.Image.from_registry`](https://modal.com/docs/reference/modal.Image#from_registry) method.

## Elastic Container Registry (ECR)

You can pull images from your AWS ECR account by specifying the full image URI
as follows:

```python
import modal

aws_secret = modal.Secret.from_name("my-aws-secret")
image = (
    modal.Image.from_aws_ecr(
        "000000000000.dkr.ecr.us-east-1.amazonaws.com/my-private-registry:latest",
        secret=aws_secret,
    )
    .pip_install("torch", "huggingface")
)

app = modal.App(image=image)
```

As shown above, you also need to use a [Modal Secret](https://modal.com/docs/guide/secrets)
containing the environment variables `AWS_ACCESS_KEY_ID`,
`AWS_SECRET_ACCESS_KEY`, and `AWS_REGION`. The AWS IAM user account associated
with those keys must have access to the private registry you want to access.

The user needs to have the following read-only policies:

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": ["ecr:GetAuthorizationToken"],
      "Effect": "Allow",
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "ecr:BatchCheckLayerAvailability",
        "ecr:GetDownloadUrlForLayer",
        "ecr:GetRepositoryPolicy",
        "ecr:DescribeRepositories",
        "ecr:ListImages",
        "ecr:DescribeImages",
        "ecr:BatchGetImage",
        "ecr:GetLifecyclePolicy",
        "ecr:GetLifecyclePolicyPreview",
        "ecr:ListTagsForResource",
        "ecr:DescribeImageScanFindings"
      ],
      "Resource": ""
    }
  ]
}
```

You can use the IAM configuration above as a template for creating an IAM user.
You can then
[generate an access key](https://aws.amazon.com/premiumsupport/knowledge-center/create-access-key/)
and create a Modal Secret using the AWS integration option. Modal will use your
access keys to generate an ephemeral ECR token. That token is only used to pull
image layers at the time a new image is built. We don't store this token but
will cache the image once it has been pulled.

Images on ECR must be private and follow
[image configuration requirements](https://modal.com/docs/reference/modal.Image#from_aws_ecr).

## Google Artifact Registry and Google Container Registry

For further detail on how to pull images from Google's image registries, see
[`modal.Image.from_gcp_artifact_registry`](https://modal.com/docs/reference/modal.Image#from_gcp_artifact_registry).

#### Fast pull from registry

# Fast pull from registry

The performance of pulling public and private images from registries into Modal
can be significantly improved by adopting the [eStargz](https://github.com/containerd/stargz-snapshotter/blob/main/docs/estargz.md) compression format.

By applying eStargz compression during your image build and push, Modal will be much
more efficient at pulling down your image from the registry.

## How to use estargz

If you have [Buildkit](https://docs.docker.com/build/buildkit/) version greater than `0.10.0`, adopting `estargz` is as simple as
adding some flags to your `docker buildx build` command:

- `type=registry` flag will instruct BuildKit to push the image after building.
  - If you do not push the image from immediately after build and instead attempt to push it later with docker push, the image will be converted to a standard gzip image.
- `compression=estargz` specifies that we are using the [eStargz](https://github.com/containerd/stargz-snapshotter/blob/main/docs/estargz.md) compression format.
- `oci-mediatypes=true` specifies that we are using the OCI media types, which is required for eStargz.
- `force-compression=true` will recompress the entire image and convert the base image to eStargz if it is not already.

```bash
docker buildx build --tag "<registry>/<namespace>/<repo>:<version>" \
--output type=registry,compression=estargz,force-compression=true,oci-mediatypes=true \
.
```

Then reference the container image as normal in your Modal code.

```python notest
app = modal.App(
    "example-estargz-pull",
    image=modal.Image.from_registry(
        "public.ecr.aws/modal/estargz-example-images:text-generation-v1-esgz"
    )
)
```

At build time you should see the eStargz-enabled puller activate:

```
Building image im-TinABCTIf12345ydEwTXYZ

=> Step 0: FROM public.ecr.aws/modal/estargz-example-images:text-generation-v1-esgz
Using estargz to speed up image pull (index loaded in 1.86s)...
Progress: 10% complete... (1.11s elapsed)
Progress: 20% complete... (3.10s elapsed)
Progress: 30% complete... (4.18s elapsed)
Progress: 40% complete... (4.76s elapsed)
Progress: 50% complete... (5.51s elapsed)
Progress: 62% complete... (6.17s elapsed)
Progress: 74% complete... (6.99s elapsed)
Progress: 81% complete... (7.23s elapsed)
Progress: 99% complete... (8.90s elapsed)
Progress: 100% complete... (8.90s elapsed)
Copying image...
Copied image in 5.81s
```

## Supported registries

Currently, Modal supports fast estargz pulling images with the following registries:

- AWS Elastic Container Registry (ECR)
- Docker Hub (docker.io)
- Google Artifact Registry (gcr.io, pkg.dev)

We are working on adding support for GitHub Container Registry (ghcr.io).

### GPUs and other resources

#### GPU acceleration

# GPU acceleration

Modal makes it easy to run any code on GPUs.

## Quickstart

Here's a simple example of a function running on an A100 in Modal:

```python
import modal

app = modal.App()
image = modal.Image.debian_slim().pip_install("torch")

@app.function(gpu="A100", image=image)
def run():
    import torch
    print(torch.cuda.is_available())
```

This installs PyTorch on top of a base image, and is able to use GPUs with
PyTorch.

## Specifying GPU type

You can pick a specific GPU type for your function via the `gpu` argument.
Modal supports the following values for this parameter:

- `T4`
- `L4`
- `A10G`
- `A100-40GB`
- `A100-80GB`
- `L40S`
- `H100`
- `H200`
- `B200`

For instance, to use an H100, you can use `@app.function(gpu="H100")`.

Refer to our [pricing page](https://modal.com/pricing) for the latest pricing on each GPU type.

## Specifying GPU count

You can specify more than 1 GPUs per container by appending `:n` to the GPU
argument. For instance, to run a function with 8\*H100:

```python

@app.function(gpu="H100:8")
def run_llama_405b_fp8():
    ...
```

Currently B200, H200, H100, A100, L4, T4 and L40S instances support up to 8 GPUs (up to 640 GB GPU RAM),
and A10G instances support up to 4 GPUs (up to 96 GB GPU RAM). Note that requesting
more than 2 GPUs per container will usually result in larger wait times. These
GPUs are always attached to the same physical machine.

## Picking a GPU

For running, rather than training, neural networks, we recommend starting off
with the [L40S](https://resources.nvidia.com/en-us-l40s/l40s-datasheet-28413),
which offers an excellent trade-off of cost and performance and 48 GB of GPU
RAM for storing model weights.

For more on how to pick a GPU for use with neural networks like LLaMA or Stable
Diffusion, and for tips on how to make that GPU go brrr, check out
[Tim Dettemers' blog post](https://timdettmers.com/2023/01/30/which-gpu-for-deep-learning/)
or the
[Full Stack Deep Learning page on Cloud GPUs](https://fullstackdeeplearning.com/cloud-gpus/).

## GPU fallbacks

Modal allows specifying a list of possible GPU types, suitable for functions that are
compatible with multiple options. Modal respects the ordering of this list and
will try to allocate the most preferred GPU type before falling back to less
preferred ones.

```python
@app.function(gpu=["H100", "A100-40GB:2"])
def run_on_80gb():
    ...
```

See [this example](https://modal.com/docs/examples/gpu_fallbacks) for more detail.

## H100 GPUs

Modal's fastest GPUs are the
[H100s](https://www.nvidia.com/en-us/data-center/h100/), NVIDIA's
flagship data center chip for the Hopper/Lovelace [architecture](https://modal.com/gpu-glossary/device-hardware/streaming-multiprocessor-architecture).

To request an H100, set the `gpu` argument to `"H100"`

```python
@app.function(gpu="H100")
def run_text_to_video():
    ...
```

Check out [this example](https://modal.com/docs/examples/flux) to see how you can generate images
from the Flux.schnell model in under a second using an H100.

Before you jump for the most powerful (and so most expensive) GPU, make sure you
understand where the bottlenecks are in your computations. For example, running
language models with small batch sizes (e.g. one prompt at a time) results in a
[bottleneck on memory, not arithmetic](https://kipp.ly/transformer-inference-arithmetic/).
Since arithmetic throughput has risen faster than memory throughput in recent
hardware generations, speedups for memory-bound GPU jobs are not as extreme and
may not be worth the extra cost.

**H200 GPUs**

Modal may automatically upgrade an H100 request to an
[H200](https://www.nvidia.com/en-us/data-center/h200/), NVIDIA's evolution of the H100 chip
for the Hopper/Lovelace [architecture](https://modal.com/gpu-glossary/device-hardware/streaming-multiprocessor-architecture).
This automatic upgrade _does not_ change the cost of the GPU.

H200s are software compatible with H100s, so your code always works for both, but an upgrade
to an H200 brings higher memory bandwidth! NVIDIA H200’s HBM3e memory bandwidth of 4.8TB/s is 1.4x faster than NVIDIA H100 with HBM3.

In cases where an automatic upgrade to H200 would not be desired (e.g., benchmarking) you can pass
`gpu=H100!` to avoid it.

## A100 GPUs

[A100s](https://www.nvidia.com/en-us/data-center/a100/) are the previous
generation of top-of-the-line data center chip from NVIDIA, based on the Ampere [architecture](https://modal.com/gpu-glossary/device-hardware/streaming-multiprocessor-architecture).
Modal offers two versions of the A100: one with 40 GB of RAM and another with 80 GB of RAM.

To request an A100 with 40 GB of [GPU memory](https://modal.com/gpu-glossary/device-hardware/gpu-ram), use `gpu="A100"`:

```python
@app.function(gpu="A100")
def llama_7b():
    ...
```

To request an 80 GB A100, use the string `A100-80GB`:

```python
@app.function(gpu="A100-80GB")
def llama_70b_fp8():
    ...
```

## Multi GPU training

Modal currently supports multi-GPU training on a single machine, with multi-node training in closed beta ([contact us](https://modal.com/slack) for access). Depending on which framework you are using, you may need to use different techniques to train on multiple GPUs.

If the framework re-executes the entrypoint of the Python process (like [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/index.html)) you need to either set the strategy to `ddp_spawn` or `ddp_notebook` if you wish to invoke the training directly. Another option is to run the training script as a subprocess instead.

```python
@app.function(gpu="A100:2")
def run():
    import subprocess
    import sys
    subprocess.run(
        ["python", "train.py"],
        stdout=sys.stdout, stderr=sys.stderr,
        check=True,
    )
```

## Examples and more resources.

For more information about GPUs in general, check out our [GPU Glossary](https://modal.com/gpu-glossary/readme).

Or take a look some examples of Modal apps using GPUs:

- [Fine-tune a character LoRA for your pet](https://modal.com/docs/examples/dreambooth_app)
- [Fast LLM inference with vLLM](https://modal.com/docs/examples/vllm_inference)
- [Stable Diffusion with a CLI, API, and web UI](https://modal.com/docs/examples/stable_diffusion_cli)
- [Rendering Blender videos](https://modal.com/docs/examples/blender_video)

#### Using CUDA on Modal

# Using CUDA on Modal

Modal makes it easy to accelerate your workloads with datacenter-grade NVIDIA GPUs.

To take advantage of the hardware, you need to use matching software: the CUDA stack.
This guide explains the components of that stack and how to install them on Modal.
For more on which GPUs are available on Modal and how to choose a GPU for your use case,
see [this guide](https://modal.com/docs/guide/gpu). For a deep dive on both the
[GPU hardware](https://modal.com/gpu-glossary/device-hardware) and [software](https://modal.com/gpu-glossary/device-software)
and for even more detail on [the CUDA stack](https://modal.com/gpu-glossary/host-software/),
see our [GPU Glossary](https://modal.com/gpu-glossary/readme).

Here's the tl;dr:

- The [NVIDIA Accelerated Graphics Driver for Linux-x86_64](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#driver-installation), version 575.57.08,
  and [CUDA Driver API](https://docs.nvidia.com/cuda/archive/12.9.0/cuda-driver-api/index.html), version 12.8, are already installed.
  You can call `nvidia-smi` or run compiled CUDA programs from any Modal Function with access to a GPU.
- That means you can install many popular libraries like `torch` that bundle their other CUDA dependencies [with a simple `pip_install`](#install-gpu-accelerated-torch-and-transformers-with-pip_install).
- For bleeding-edge libraries like `flash-attn`, you may need to install CUDA dependencies manually.
  To make your life easier, [use an existing image](#for-more-complex-setups-use-an-officially-supported-cuda-image).

## What is CUDA?

When someone refers to "installing CUDA" or "using CUDA",
they are referring not to a library, but to a
[stack](https://modal.com/gpu-glossary/host-software/cuda-software-platform) with multiple layers.
Your application code (and its dependencies) can interact
with the stack at different levels.

![The CUDA stack](../../assets/docs/cuda-stack-diagram.png)

This leads to a lot of confusion. To help clear that up, the following sections explain each component in detail.

### Level 0: Kernel-mode driver components

At the lowest level are the [_kernel-mode driver components_](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#nvidia-open-gpu-kernel-modules).
The Linux kernel is essentially a single program operating the entire machine and all of its hardware.
To add hardware to the machine, this program is extended by loading new modules into it.
These components communicate directly with hardware -- in this case the GPU.

Because they are kernel modules, these driver components are tightly integrated with the host operating system
that runs your containerized Modal Functions and are not something you can inspect or change yourself.

### Level 1: User-mode driver API

All action in Linux that doesn't occur in the kernel occurs in [user space](https://en.wikipedia.org/wiki/User_space).
To talk to the kernel drivers from our user space programs, we need _user-mode driver components_.

Most prominently, that includes:

- the [CUDA Driver API](https://modal.com/gpu-glossary/host-software/cuda-driver-api),
  a [shared object](https://en.wikipedia.org/wiki/Shared_library) called `libcuda.so`.
  This object exposes functions like [`cuMemAlloc`](https://docs.nvidia.com/cuda/archive/12.8.0/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gb82d2a09844a58dd9e744dc31e8aa467),
  for allocating GPU memory.
- the [NVIDIA management library](https://developer.nvidia.com/management-library-nvml), `libnvidia-ml.so`, and its command line interface [`nvidia-smi`](https://developer.nvidia.com/system-management-interface).
  You can use these tools to check the status of the system's GPU(s).

These components are installed on all Modal machines with access to GPUs.
Because they are user-level components, you can use them directly:

```python runner:ModalRunner
import modal

app = modal.App()

@app.function(gpu="any")
def check_nvidia_smi():
    import subprocess
    output = subprocess.check_output(["nvidia-smi"], text=True)
    assert "Driver Version:" in output
    assert "CUDA Version:" in output
    print(output)
    return output
```

### Level 2: CUDA Toolkit

Wrapping the CUDA Driver API is the [CUDA Runtime API](https://modal.com/gpu-glossary/host-software/cuda-runtime-api), the `libcudart.so` shared library.
This API includes functions like [`cudaLaunchKernel`](https://docs.nvidia.com/cuda/archive/12.8.0/cuda-runtime-api/group__CUDART__HIGHLEVEL.html#group__CUDART__HIGHLEVEL_1g7656391f2e52f569214adbfc19689eb3)
and is more commonly used in CUDA programs (see [this HackerNews comment](https://news.ycombinator.com/item?id=20616385) for color commentary on why).
This shared library is _not_ installed by default on Modal.

The CUDA Runtime API is generally installed as part of the larger [NVIDIA CUDA Toolkit](https://docs.nvidia.com/cuda/index.html),
which includes the [NVIDIA CUDA compiler driver](https://modal.com/gpu-glossary/host-software/nvcc) (`nvcc`) and its toolchain
and a number of [useful goodies](https://modal.com/gpu-glossary/host-software/cuda-binary-utilities) for writing and debugging CUDA programs (`cuobjdump`, `cudnn`, profilers, etc.).

Contemporary GPU-accelerated machine learning workloads like LLM inference frequently make use of many components of the CUDA Toolkit,
such as the run-time compilation library [`nvrtc`](https://docs.nvidia.com/cuda/archive/12.8.0/nvrtc/index.html).

So why aren't these components installed along with the drivers?
A compiled CUDA program can run without the CUDA Runtime API installed on the system,
by [statically linking](https://en.wikipedia.org/wiki/Static_library) the CUDA Runtime API into the program binary,
though this is fairly uncommon for CUDA-accelerated Python programs.
Additionally, older versions of these components are needed for some applications
and some application deployments even use several versions at once.
Both patterns are compatible with the host machine driver provided on Modal.

## Install GPU-accelerated `torch` and `transformers` with `pip_install`

The components of the CUDA Toolkit can be installed via `pip`,
via PyPI packages like [`nvidia-cuda-runtime-cu12`](https://pypi.org/project/nvidia-cuda-runtime-cu12/)
and [`nvidia-cuda-nvrtc-cu12`](https://pypi.org/project/nvidia-cuda-nvrtc-cu12/).
These components are listed as dependencies of some popular GPU-accelerated Python libraries, like `torch`.

Because Modal already includes the lower parts of the CUDA stack, you can install these libraries
with [the `pip_install` method of `modal.Image`](https://modal.com/docs/guide/images#add-python-packages-with-pip_install), just like any other Python library:

```python
image = modal.Image.debian_slim().pip_install("torch")

@app.function(gpu="any", image=image)
def run_torch():
    import torch
    has_cuda = torch.cuda.is_available()
    print(f"It is {has_cuda} that torch can access CUDA")
    return has_cuda
```

Many libraries for running open-weights models, like `transformers` and `vllm`,
use `torch` under the hood and so can be installed in the same way:

```python
image = modal.Image.debian_slim().pip_install("transformers[torch]")
image = image.apt_install("ffmpeg")  # for audio processing

@app.function(gpu="any", image=image)
def run_transformers():
    from transformers import pipeline
    transcriber = pipeline(model="openai/whisper-tiny.en", device="cuda")
    result = transcriber("https://modal-cdn.com/mlk.flac")
    print(result["text"])  # I have a dream that one day this nation will rise up live out the true meaning of its creed
```

## For more complex setups, use an officially-supported CUDA image

The disadvantage of installing the CUDA stack via `pip` is that
many other libraries that depend on its components being installed as normal system packages cannot find them.

For these cases, we recommend you use an image that already has the full CUDA stack installed as system packages
and all environment variables set correctly, like the [`nvidia/cuda:*-devel-*` images on Docker Hub](https://hub.docker.com/r/nvidia/cuda).

[TensorRT-LLM](https://nvidia.github.io/TensorRT-LLM/overview.html) is an inference engine that accelerates and optimizes performance for the large language models. It requires the full CUDA toolkit for installation.

```python
cuda_version = "12.8.1"  # should be no greater than host CUDA version
flavor = "devel"  # includes full CUDA toolkit
operating_sys = "ubuntu24.04"
tag = f"{cuda_version}-{flavor}-{operating_sys}"
HF_CACHE_PATH = "/cache"

image = (
    modal.Image.from_registry(f"nvidia/cuda:{tag}", add_python="3.12")
    .entrypoint([])  # remove verbose logging by base image on entry
    .apt_install("libopenmpi-dev")  # required for tensorrt
    .pip_install("tensorrt-llm==0.19.0", "pynvml", extra_index_url="https://pypi.nvidia.com")
    .pip_install("hf-transfer", "huggingface_hub[hf_xet]")
    .env({"HF_HUB_CACHE": HF_CACHE_PATH, "HF_HUB_ENABLE_HF_TRANSFER": "1", "PMIX_MCA_gds": "hash"})
)

app = modal.App("tensorrt-llm", image=image)
hf_cache_volume = modal.Volume.from_name("hf_cache_tensorrt", create_if_missing=True)

@app.function(gpu="A10G", volumes={HF_CACHE_PATH: hf_cache_volume})
def run_tiny_model():
    from tensorrt_llm import LLM, SamplingParams

    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

    llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")

    output = llm.generate("The capital of France is", sampling_params)
    print(f"Generated text: {output.outputs[0].text}")
    return output.outputs[0].text
```

Make sure to choose a version of CUDA that is no greater than the version provided by the host machine.
Older minor (`12.*`) versions are guaranteed to be compatible with the host machine's driver,
but older major (`11.*`, `10.*`, etc.) versions may not be.

## What next?

For more on accessing and choosing GPUs on Modal, check out [this guide](https://modal.com/docs/guide/gpu).
To dive deep on GPU internals, check out our [GPU Glossary](https://modal.com/gpu-glossary/readme).

To see these installation patterns in action, check out these examples:

- [Fast LLM inference with vLLM](https://modal.com/docs/examples/vllm_inference)
- [Finetune a character LoRA for your pet](https://modal.com/docs/examples/diffusers_lora_finetune)
- [Optimized Flux inference](https://modal.com/docs/examples/flux)

#### Reserving CPU and memory

# Reserving CPU and memory

Each Modal container has a default reservation of 0.125 CPU cores and 128 MiB of memory.
Containers can exceed this minimum if the worker has available CPU or memory.
You can also guarantee access to more resources by requesting a higher reservation.

## CPU cores

If you have code that must run on a larger number of cores, you can
request that using the `cpu` argument. This allows you to specify a
floating-point number of CPU cores:

```python
import modal

app = modal.App()

@app.function(cpu=8.0)
def my_function():
    # code here will have access to at least 8.0 cores
    ...
```

## Memory

If you have code that needs more guaranteed memory, you can request it using the
`memory` argument. This expects an integer number of megabytes:

```python
import modal

app = modal.App()

@app.function(memory=32768)
def my_function():
    # code here will have access to at least 32 GiB of RAM
    ...
```

## How much can I request?

For both CPU and memory, a maximum is enforced at function creation time to
ensure your application can be scheduled for execution. Requests exceeding the
maximum will be rejected with an
[`InvalidError`](https://modal.com/docs/reference/modal.exception#modalexceptioninvaliderror).

As the platform grows, we plan to support larger CPU and memory reservations.

## Billing

For CPU and memory, you'll be charged based on whichever is higher: your reservation or actual usage.

Disk requests are billed by increasing the memory request at a 20:1 ratio. For example, requesting 500 GiB of disk will increase the memory request to 25 GiB, if it is not already set higher.

## Resource limits

### CPU limits

Modal containers have a default soft CPU limit that is set at 16 physical cores above the CPU request.
Given that the default CPU request is 0.125 cores the default soft CPU limit is 16.125 cores.
Above this limit the host will begin to throttle the CPU usage of the container.

You can alternatively set the CPU limit explicitly.

```python
cpu_request = 1.0
cpu_limit = 4.0
@app.function(cpu=(cpu_request, cpu_limit))
def f():
    ...
```

### Memory limits

Modal containers can have a hard memory limit which will 'Out of Memory' (OOM) kill
containers which attempt to exceed the limit. This functionality is useful when a container
has a serious memory leak. You can set the limit and have the container killed to avoid paying
for the leaked GBs of memory.

```python
mem_request = 1024
mem_limit = 2048
@app.function(
    memory=(mem_request, mem_limit),
)
def f():
    ...
```

Specify this limit using the [`memory` parameter](https://modal.com/docs/reference/modal.App#function) on Modal Functions.

### Disk limits

Running Modal containers have access to many GBs of SSD disk, but the amount
of writes is limited by:

1. The size of the underlying worker's SSD disk capacity
2. A per-container disk quota that is set in the 100s of GBs.

Hitting either limit will cause the container's disk writes to be rejected, which
typically manifests as an `OSError`.

Increased disk sizes can be requested with the [`ephemeral_disk` parameter](https://modal.com/docs/reference/modal.App#function). The maximum
disk size is 3.0 TiB (3,145,728 MiB). Larger disks are intended to be used for [dataset processing](https://modal.com/docs/guide/dataset-ingestion).

### Scaling out

#### Scaling out

# Scaling out

Modal makes it trivially easy to scale compute across thousands of containers.
You won't have to worry about your App crashing if it goes viral or need to wait
a long time for your batch jobs to complete.

For the the most part, scaling out will happen automatically, and you won't need
to think about it. But it can be helpful to understand how Modal's autoscaler
works and how you can control its behavior when you need finer control.

## How does autoscaling work on Modal?

Every Modal Function corresponds to an autoscaling pool of containers. The size
of the pool is managed by Modal's autoscaler. The autoscaler will spin up new
containers when there is no capacity available for new inputs, and it will spin
down containers when resources are idling. By default, Modal Functions will
scale to zero when there are no inputs to process.

Autoscaling decisions are made quickly and frequently so that your batch jobs
can ramp up fast and your deployed Apps can respond to any sudden changes in
traffic.

## Configuring autoscaling behavior

Modal exposes a few settings that allow you to configure the autoscaler's
behavior. These settings can be passed to the `@app.function` or `@app.cls`
decorators:

- `max_containers`: The upper limit on containers for the specific Function.
- `min_containers`: The minimum number of containers that should be kept warm,
  even when the Function is inactive.
- `buffer_containers`: The size of the buffer to maintain while the Function is
  active, so that additional inputs will not need to queue for a new container.
- `scaledown_window`: The maximum duration (in seconds) that individual
  containers can remain idle when scaling down.

In general, these settings allow you to trade off cost and latency. Maintaining
a larger warm pool or idle buffer will increase costs but reduce the chance that
inputs will need to wait for a new container to start.

Similarly, a longer scaledown window will let containers idle for longer, which
might help avoid unnecessary churn for Apps that receive regular but infrequent
inputs. Note that containers may not wait for the entire scaledown window before
shutting down if the App is substantially overprovisioned.

## Dynamic autoscaler updates

It's also possible to update the autoscaler settings dynamically (i.e., without redeploying
the App) using the [`Function.update_autoscaler()`](https://modal.com/docs/reference/modal.Function#update_autoscaler)
method:

```python notest
f = modal.Function.from_name("my-app", "f")
f.update_autoscaler(max_containers=100)
```

The autoscaler settings will revert to the configuration in the function
decorator the next time you deploy the App. Or they can be overridden by
further dynamic updates:

```python notest
f.update_autoscaler(min_containers=2, max_containers=10)
f.update_autoscaler(min_containers=4)  # max_containers=10 will still be in effect
```

A common pattern is to run this method in a [scheduled function](https://modal.com/docs/guide/cron)
that adjusts the size of the warm pool (or container buffer) based on the time of day:

```python
@app.function()
def inference_server():
    ...

@app.function(schedule=modal.Cron("0 6 * * *", timezone="America/New_York"))
def increase_warm_pool():
    inference_server.update_autoscaler(min_containers=4)

@app.function(schedule=modal.Cron("0 22 * * *", timezone="America/New_York"))
def decrease_warm_pool():
    inference_server.update_autoscaler(min_containers=0)
```

When you have a [`modal.Cls`](https://modal.com/docs/reference/modal.Cls), `update_autoscaler`
is a method on an _instance_ and will control the autoscaling behavior of
containers serving the Function with that specific set of parameters:

```python notest
MyClass = modal.Cls.from_name("my-app", "MyClass")
obj = MyClass(model_version="3.5")
obj.update_autoscaler(buffer_containers=2)  # type: ignore
```

Note that it's necessary to disable type checking on this line, because the
object will appear as an instance of the class that you defined rather than the
Modal wrapper type.

## Parallel execution of inputs

If your code is running the same function repeatedly with different independent
inputs (e.g., a grid search), the easiest way to increase performance is to run
those function calls in parallel using Modal's
[`Function.map()`](https://modal.com/docs/reference/modal.Function#map) method.

Here is an example if we had a function `evaluate_model` that takes a single
argument:

```python
import modal

app = modal.App()

@app.function()
def evaluate_model(x):
    ...

@app.local_entrypoint()
def main():
    inputs = list(range(100))
    for result in evaluate_model.map(inputs):  # runs many inputs in parallel
        ...
```

In this example, `evaluate_model` will be called with each of the 100 inputs
(the numbers 0 - 99 in this case) roughly in parallel and the results are
returned as an iterable with the results ordered in the same way as the inputs.

### Exceptions

By default, if any of the function calls raises an exception, the exception will
be propagated. To treat exceptions as successful results and aggregate them in
the results list, pass in
[`return_exceptions=True`](https://modal.com/docs/reference/modal.Function#map).

```python
@app.function()
def my_func(a):
    if a == 2:
        raise Exception("ohno")
    return a ** 2

@app.local_entrypoint()
def main():
    print(list(my_func.map(range(3), return_exceptions=True, wrap_returned_exceptions=False)))
    # [0, 1, Exception('ohno'))]
```

Note: prior to version 1.0.5, the returned exceptions inadvertently leaked an internal
wrapper type (`modal.exceptions.UserCodeException`). To avoid breaking any user code that
was checking exception types, we're taking a gradual approach to fixing this bug. Adding
`wrap_returned_exceptions=True` will opt-in to the future default behavior and return the
underlying exception type without a wrapper.

### Starmap

If your function takes multiple variable arguments, you can either use
[`Function.map()`](https://modal.com/docs/reference/modal.Function#map) with one input iterator
per argument, or [`Function.starmap()`](https://modal.com/docs/reference/modal.Function#starmap)
with a single input iterator containing sequences (like tuples) that can be
spread over the arguments. This works similarly to Python's built in `map` and
`itertools.starmap`.

```python
@app.function()
def my_func(a, b):
    return a + b

@app.local_entrypoint()
def main():
    assert list(my_func.starmap([(1, 2), (3, 4)])) == [3, 7]
```

### Gotchas

Note that `.map()` is a method on the modal function object itself, so you don't
explicitly _call_ the function.

Incorrect usage:

```python notest
results = evaluate_model(inputs).map()
```

Modal's map is also not the same as using Python's builtin `map()`. While the
following will technically work, it will execute all inputs in sequence rather
than in parallel.

Incorrect usage:

```python notest
results = map(evaluate_model, inputs)
```

## Asynchronous usage

All Modal APIs are available in both blocking and asynchronous variants. If you
are comfortable with asynchronous programming, you can use it to create
arbitrary parallel execution patterns, with the added benefit that any Modal
functions will be executed remotely. See the [async guide](https://modal.com/docs/guide/async) or
the examples for more information about asynchronous usage.

## GPU acceleration

Sometimes you can speed up your applications by utilizing GPU acceleration. See
the [gpu section](https://modal.com/docs/guide/gpu) for more information.

## Scaling Limits

Modal enforces the following limits for every function:

- 2,000 pending inputs (inputs that haven't been assigned to a container yet)
- 25,000 total inputs (which include both running and pending inputs)

For inputs created with `.spawn()` for async jobs, Modal allows up to 1 million pending inputs instead of 2,000.

If you try to create more inputs and exceed these limits, you'll receive a `Resource Exhausted` error, and you should retry your request later. If you need higher limits, please reach out!

Additionally, each `.map()` invocation can process at most 1000 inputs concurrently.

#### Input concurrency

# Input concurrency

As traffic to your application increases, Modal will automatically scale up the
number of containers running your Function:

<div class="flex justify-center"></div>

By default, each container will be assigned one input at a time. Autoscaling
across containers allows your Function to process inputs in parallel. This is
ideal when the operations performed by your Function are CPU-bound.

For some workloads, though, it is inefficient for containers to process inputs
one-by-one. Modal supports these workloads with its _input concurrency_ feature,
which allows individual containers to process multiple inputs at the same time:

<div class="flex justify-center"></div>

When used effectively, input concurrency can reduce latency and lower costs.

## Use cases

Input concurrency can be especially effective for workloads that are primarily
I/O-bound, e.g.:

- Querying a database
- Making external API requests
- Making remote calls to other Modal Functions

For such workloads, individual containers may be able to concurrently process
large numbers of inputs with minimal additional latency. This means that your
Modal application will be more efficient overall, as it won't need to scale
containers up and down as traffic ebbs and flows.

Another use case is to leverage _continuous batching_ on GPU-accelerated
containers. Frameworks such as [vLLM](https://modal.com/docs/examples/vllm_inference) can
achieve the benefits of batching across multiple inputs even when those
inputs do not arrive simultaneously (because new batches are formed for each
forward pass of the model).

Note that for CPU-bound workloads, input concurrency will likely not be as
effective (or will even be counterproductive), and you may want to use
Modal's [_dynamic batching_ feature](https://modal.com/docs/guide/dynamic-batching) instead.

## Enabling input concurrency

To enable input concurrency, add the `@modal.concurrent` decorator:

```python
@app.function()
@modal.concurrent(max_inputs=100)
def my_function(input: str):
    ...

```

When using the class pattern, the decorator should be applied at the level of
the _class_, not on individual methods:

```python
@app.cls()
@modal.concurrent(max_inputs=100)
class MyCls:

    @modal.method()
    def my_method(self, input: str):
        ...
```

Because all methods on a class will be served by the same containers, a class
with input concurrency enabled will concurrently run distinct methods in
addition to multiple inputs for the same method.

**Note:** The `@modal.concurrent` decorator was added in v0.73.148 of the Modal
Python SDK. Input concurrency could previously be enabled by setting the
`allow_concurrent_inputs` parameter on the `@app.function` decorator.

## Setting a concurrency target

When using the `@modal.concurrent` decorator, you must always configure the
maximum number of inputs that each container will concurrently process. If
demand exceeds this limit, Modal will automatically scale up more containers.

Additional inputs may need to queue up while these additional containers cold
start. To help avoid degraded latency during scaleup, the `@modal.concurrent`
decorator has a separate `target_inputs` parameter. When set, Modal's autoscaler
will aim for this target as it provisions resources. If demand increases faster
than new containers can spin up, the active containers will be allowed to burst
above the target up to the `max_inputs` limit:

```python
@app.function()
@modal.concurrent(max_inputs=120, target_inputs=100)  # Allow a 20% burst
def my_function(input: str):
    ...
```

It may take some experimentation to find the right settings for these parameters
in your particular application. Our suggestion is to set the `target_inputs`
based on your desired latency and the `max_inputs` based on resource constraints
(i.e., to avoid GPU OOM). You may also consider the relative latency cost of
scaling up a new container versus overloading the existing containers.

## Concurrency mechanisms

Modal uses different concurrency mechanisms to execute your Function depending
on whether it is defined as synchronous or asynchronous. Each mechanism imposes
certain requirements on the Function implementation. Input concurrency is an
advanced feature, and it's important to make sure that your implementation
complies with these requirements to avoid unexpected behavior.

For synchronous Functions, Modal will execute concurrent inputs on separate
threads. _This means that the Function implementation must be thread-safe._

```python
# Each container can execute up to 10 inputs in separate threads
@app.function()
@modal.concurrent(max_inputs=10)
def sleep_sync():
    # Function must be thread-safe
    time.sleep(1)
```

For asynchronous Functions, Modal will execute concurrent inputs using
separate `asyncio` tasks on a single thread. This does not require thread
safety, but it does mean that the Function needs to participate in
collaborative multitasking (i.e., it should not block the event loop).

```python
# Each container can execute up to 10 inputs with separate async tasks
@app.function()
@modal.concurrent(max_inputs=10)
async def sleep_async():
    # Function must not block the event loop
    await asyncio.sleep(1)
```

## Gotchas

Input concurrency is a powerful feature, but there are a few caveats that can
be useful to be aware of before adopting it.

### Input cancellations

Synchronous and asynchronous Functions handle input cancellations differently.
Modal will raise a `modal.exception.InputCancellation` exception in synchronous
Functions and an `asyncio.CancelledError` in asynchronous Functions.

When using input concurrency with a synchronous Function, a single input
cancellation will terminate the entire container. If your workflow depends on
graceful input cancellations, we recommend using an asynchronous
implementation.

### Concurrent logging

The separate threads or tasks that are executing the concurrent inputs will
write any logs to the same stream. This makes it difficult to associate logs
with a specific input, and filtering for a specific function call in Modal's web
dashboard will show logs for all inputs running at the same time.

To work around this, we recommend including a unique identifier in the messages
you log (either your own identifier or the `modal.current_input_id()`) so that
you can use the search functionality to surface logs for a specific input:

```python
@app.function()
@modal.concurrent(max_inputs=10)
async def better_concurrent_logging(x: int):
    logger.info(f"{modal.current_input_id()}: Starting work with {x}")
```

#### Batch processing

# Batch Processing

Modal is optimized for large-scale batch processing, allowing functions to scale to thousands of parallel containers with zero additional configuration. Function calls can be submitted asynchronously for background execution, eliminating the need to wait for jobs to finish or tune resource allocation.

This guide covers Modal's batch processing capabilities, from basic invocation to integration with existing pipelines.

## Background Execution with `.spawn_map`

The fastest way to submit multiple jobs for asynchronous processing is by invoking a function with `.spawn_map`. When combined with the [`--detach`](https://modal.com/docs/reference/cli/run) flag, your App continues running until all jobs are completed.

Here's an example of submitting 100,000 videos for parallel embedding. You can disconnect after submission, and the processing will continue to completion in the background:

```python
# Kick off asynchronous jobs with `modal run --detach batch_processing.py`
import modal

app = modal.App("batch-processing-example")
volume = modal.Volume.from_name("video-embeddings", create_if_missing=True)

@app.function(volumes={"/data": volume})
def embed_video(video_id: int):
    # Business logic:
    # - Load the video from the volume
    # - Embed the video
    # - Save the embedding to the volume
    ...

@app.local_entrypoint()
def main():
    embed_video.spawn_map(range(100_000))
```

This pattern works best for jobs that store results externally—for example, in a [Modal Volume](https://modal.com/docs/guide/volumes), [Cloud Bucket Mount](https://modal.com/docs/guide/cloud-bucket-mounts), or your own database\*.

_\* For database connections, consider using [Modal Proxy](https://modal.com/docs/guide/proxy-ips) to maintain a static IP across thousands of containers._

## Parallel Processing with `.map`

Using `.map` allows you to offload expensive computations to powerful machines while gathering results. This is particularly useful for pipeline steps with bursty resource demands. Modal handles all infrastructure provisioning and de-provisioning automatically.

Here's how to implement parallel video similarity queries as a single Modal function call:

```python
# Run jobs and collect results with `modal run gather.py`
import modal

app = modal.App("gather-results-example")

@app.function(gpu="L40S")
def compute_video_similarity(query: str, video_id: int) -> tuple[int, int]:
    # Embed video with GPU acceleration & compute similarity with query
    return video_id, score

@app.local_entrypoint()
def main():
    import itertools

    queries = itertools.repeat("Modal for batch processing")
    video_ids = range(100_000)

    for video_id, score in compute_video_similarity.map(queries, video_ids):
        # Process results (e.g., extract top 5 most similar videos)
        pass
```

This example runs `compute_video_similarity` on an autoscaling pool of L40S GPUs, returning scores to a local process for further processing.

## Integration with Existing Systems

The recommended way to use Modal Functions within your existing data pipeline is through [deployed function invocation](https://modal.com/docs/guide/trigger-deployed-functions). After deployment, you can call Modal functions from external systems:

```python
def external_function(inputs):
    compute_similarity = modal.Function.from_name(
        "gather-results-example",
        "compute_video_similarity"
    )
    for result in compute_similarity.map(inputs):
        # Process results
        pass
```

You can invoke Modal Functions from any Python context, gaining access to built-in observability, resource management, and GPU acceleration.

#### Job queues

# Job processing

Modal can be used as a scalable job queue to handle asynchronous tasks submitted
from a web app or any other Python application. This allows you to offload up to 1 million
long-running or resource-intensive tasks to Modal, while your main application
remains responsive.

## Creating jobs with .spawn()

The basic pattern for using Modal as a job queue involves three key steps:

1. Defining and deploying the job processing function using `modal deploy`.
2. Submitting a job using
   [`modal.Function.spawn()`](https://modal.com/docs/reference/modal.Function#spawn)
3. Polling for the job's result using
   [`modal.FunctionCall.get()`](https://modal.com/docs/reference/modal.FunctionCall#get)

Here's a simple example that you can run with `modal run my_job_queue.py`:

```python
# my_job_queue.py
import modal

app = modal.App("my-job-queue")

@app.function()
def process_job(data):
    # Perform the job processing here
    return {"result": data}

def submit_job(data):
    # Since the `process_job` function is deployed, need to first look it up
    process_job = modal.Function.from_name("my-job-queue", "process_job")
    call = process_job.spawn(data)
    return call.object_id

def get_job_result(call_id):
    function_call = modal.FunctionCall.from_id(call_id)
    try:
        result = function_call.get(timeout=5)
    except modal.exception.OutputExpiredError:
        result = {"result": "expired"}
    except TimeoutError:
        result = {"result": "pending"}
    return result

@app.local_entrypoint()
def main():
    data = "my-data"

    # Submit the job to Modal
    call_id = submit_job(data)
    print(get_job_result(call_id))
```

In this example:

- `process_job` is the Modal function that performs the actual job processing.
  To deploy the `process_job` function on Modal, run
  `modal deploy my_job_queue.py`.
- `submit_job` submits a new job by first looking up the deployed `process_job`
  function, then calling `.spawn()` with the job data. It returns the unique ID
  of the spawned function call.
- `get_job_result` attempts to retrieve the result of a previously submitted job
  using [`FunctionCall.from_id()`](https://modal.com/docs/reference/modal.FunctionCall#from_id) and
  [`FunctionCall.get()`](https://modal.com/docs/reference/modal.FunctionCall#get).
  [`FunctionCall.get()`](https://modal.com/docs/reference/modal.FunctionCall#get) waits indefinitely
  by default. It takes an optional timeout argument that specifies the maximum
  number of seconds to wait, which can be set to 0 to poll for an output
  immediately. Here, if the job hasn't completed yet, we return a pending
  response.
- The results of a `.spawn()` are accessible via `FunctionCall.get()` for up to
  7 days after completion. After this period, we return an expired response.

[Document OCR Web App](https://modal.com/docs/examples/doc_ocr_webapp) is an example that uses
this pattern.

## Integration with web frameworks

You can easily integrate the job queue pattern with web frameworks like FastAPI.
Here's an example, assuming that you have already deployed `process_job` on
Modal with `modal deploy` as above. This example won't work if you haven't
deployed your app yet.

```python
# my_job_queue_endpoint.py
import fastapi
import modal

image = modal.Image.debian_slim().pip_install("fastapi[standard]")
app = modal.App("fastapi-modal", image=image)
web_app = fastapi.FastAPI()

@app.function()
@modal.asgi_app()
def fastapi_app():
    return web_app

@web_app.post("/submit")
async def submit_job_endpoint(data):
    process_job = modal.Function.from_name("my-job-queue", "process_job")

    call = process_job.spawn(data)
    return {"call_id": call.object_id}

@web_app.get("/result/{call_id}")
async def get_job_result_endpoint(call_id: str):
    function_call = modal.FunctionCall.from_id(call_id)
    try:
        result = function_call.get(timeout=0)
    except modal.exception.OutputExpiredError:
        return fastapi.responses.JSONResponse(content="", status_code=404)
    except TimeoutError:
        return fastapi.responses.JSONResponse(content="", status_code=202)

    return result
```

In this example:

- The `/submit` endpoint accepts job data, submits a new job using
  `process_job.spawn()`, and returns the job's ID to the client.
- The `/result/{call_id}` endpoint allows the client to poll for the job's
  result using the job ID. If the job hasn't completed yet, it returns a 202
  status code to indicate that the job is still being processed. If the job
  has expired, it returns a 404 status code to indicate that the job is not found.

You can try this app by serving it with `modal serve`:

```shell
modal serve my_job_queue_endpoint.py
```

Then interact with its endpoints with `curl`:

```shell
# Make a POST request to your app endpoint with.
$ curl -X POST $YOUR_APP_ENDPOINT/submit?data=data
{"call_id":"fc-XXX"}

# Use the call_id value from above.
$ curl -X GET $YOUR_APP_ENDPOINT/result/fc-XXX
```

## Scaling and reliability

Modal automatically scales the job queue based on the workload, spinning up new
instances as needed to process jobs concurrently. It also provides built-in
reliability features like automatic retries and timeout handling.

You can customize the behavior of the job queue by configuring the
`@app.function()` decorator with options like
[`retries`](https://modal.com/docs/guide/retries#function-retries),
[`timeout`](https://modal.com/docs/guide/timeouts#timeouts), and
[`max_containers`](https://modal.com/docs/guide/scale#configuring-autoscaling-behavior).

#### Dynamic batching (beta)

# Dynamic batching (beta)

Modal's `@batched` feature allows you to accumulate requests
and process them in dynamically-sized batches, rather than one-by-one.

Batching increases throughput at a potential cost to latency.
Batched requests can share resources and reuse work, reducing the time and cost per request.
Batching is particularly useful for GPU-accelerated machine learning workloads,
as GPUs are designed to maximize throughput and are frequently bottlenecked on shareable resources,
like weights stored in memory.

Static batching can lead to unbounded latency, as the function waits for a fixed number of requests to arrive.
Modal's dynamic batching waits for the lesser of a fixed time _or_ a fixed number of requests before executing,
maximizing the throughput benefit of batching while minimizing the latency penalty.

## Enable dynamic batching with `@batched`

To enable dynamic batching, apply the
[`@modal.batched` decorator](https://modal.com/docs/reference/modal.batched) to the target
Python function. Then, wrap it in `@app.function()` and run it on Modal,
and the inputs will be accumulated and processed in batches.

Here's what that looks like:

```python
import modal

app = modal.App()

@app.function()
@modal.batched(max_batch_size=2, wait_ms=1000)
async def batch_add(xs: list[int], ys: list[int]) -> list[int]:
    return [x + y for x, y in zip(xs, ys)]
```

When you invoke a function decorated with `@batched`, you invoke it asynchronously on individual inputs.
Outputs are returned where they were invoked.

For instance, the code below invokes the decorated `batch_add` function above three times, but `batch_add`
only executes twice:

```python continuation
@app.local_entrypoint()
async def main():
    inputs = [(1, 300), (2, 200), (3, 100)]
    async for result in batch_add.starmap.aio(inputs):
        print(f"Sum: {result}")
        # Sum: 301
        # Sum: 202
        # Sum: 103
```

The first time it is executed with `xs` batched to `[1, 2]`
and `ys` batched to `[300, 200]`. After about a one second delay, it is executed with `xs`
batched to `[3]` and `ys` batched to `[100]`.
The result is an iterator that yields `301`, `202`, and `101`.

## Use `@batched` with functions that take and return lists

For a Python function to be compatible with `@modal.batched`, it must adhere to
the following rules:

- ** The inputs to the function must be lists. **
  In the example above, we pass `xs` and `ys`, which are both lists of `int`s.
- ** The function must return a list**. In the example above, the function returns
  a list of sums.
- ** The lengths of all the input lists and the output list must be the same. **
  In the example above, if `L == len(xs) == len(ys)`, then `L == len(batch_add(xs, ys))`.

## Modal `Cls` methods are compatible with dynamic batching

Methods on Modal [`Cls`](https://modal.com/docs/guide/lifecycle-functions)es also support dynamic batching.

```python
import modal

app = modal.App()

@app.cls()
class BatchedClass():
    @modal.batched(max_batch_size=2, wait_ms=1000)
    async def batch_add(self, xs: list[int], ys: list[int]) -> list[int]:
        return [x + y for x, y in zip(xs, ys)]
```

One additional rule applies to classes with Batched Methods:

- If a class has a Batched Method, it **cannot have other Batched Methods or [Methods](https://modal.com/docs/reference/modal.method#modalmethod)**.

## Configure the wait time and batch size of dynamic batches

The `@batched` decorator takes in two required configuration parameters:

- `max_batch_size` limits the number of inputs combined into a single batch.
- `wait_ms` limits the amount of time the Function waits for more inputs after
  the first input is received.

The first invocation of the Batched Function initiates a new batch, and subsequent
calls add requests to this ongoing batch. If `max_batch_size` is reached,
the batch immediately executes. If the `max_batch_size` is not met but `wait_ms`
has passed since the first request was added to the batch, the unfilled batch is
executed.

### Selecting a batch configuration

To optimize the batching configurations for your application, consider the following heuristics:

- Set `max_batch_size` to the largest value your function can handle, so you
  can amortize and parallelize as much work as possible.

- Set `wait_ms` to the difference between your targeted latency and the execution time. Most applications
  have a targeted latency, and this allows the latency of any request to stay
  within that limit.

## Serve web endpoints with dynamic batching

Here's a simple example of serving a Function that batches requests dynamically
with a [`@modal.fastapi_endpoint`](https://modal.com/docs/guide/webhooks). Run
[`modal serve`](https://modal.com/docs/reference/cli/serve), submit requests to the endpoint,
and the Function will batch your requests on the fly.

```python
import modal

app = modal.App(image=modal.Image.debian_slim().pip_install("fastapi"))

@app.function()
@modal.batched(max_batch_size=2, wait_ms=1000)
async def batch_add(xs: list[int], ys: list[int]) -> list[int]:
    return [x + y for x, y in zip(xs, ys)]

@app.function()
@modal.fastapi_endpoint(method="POST", docs=True)
async def add(body: dict[str, int]) -> dict[str, int]:
    result = await batch_add.remote.aio(body["x"], body["y"])
    return {"result": result}
```

Now, you can submit requests to the web endpoint and process them in batches. For instance, the three requests
in the following example, which might be requests from concurrent clients in a real deployment,
will be batched into two executions:

```python notest
import asyncio
import aiohttp

async def send_post_request(session, url, data):
    async with session.post(url, json=data) as response:
        return await response.json()

async def main():
    # Enter the URL of your web endpoint here
    url = "https://workspace--app-name-endpoint-name.modal.run"

    async with aiohttp.ClientSession() as session:
        # Submit three requests asynchronously
        tasks = [
            send_post_request(session, url, {"x": 1, "y": 300}),
            send_post_request(session, url, {"x": 2, "y": 200}),
            send_post_request(session, url, {"x": 3, "y": 100}),
        ]
        results = await asyncio.gather(*tasks)
        for result in results:
            print(f"Sum: {result['result']}")

asyncio.run(main())
```

#### Multi-node clusters (beta)

# Multi-node clusters (beta)

> 🚄 Multi-node clusters with RDMA are in **private beta.** Please contact us via the [Modal Slack](https://modal.com/slack) or support@modal.com to get access.

Modal supports running a training job across several coordinated containers. Each container can saturate the available GPU devices on its host (a.k.a node) and communicate with peer containers which do the same. By scaling a training job from a single GPU to 16 GPUs you can achieve nearly 16x improvements in training time.

### Cluster compute capability

Modal H100 clusters provide:

- A 50 Gbps [IPv6 private network](https://modal.com/docs/guide/private-networking) for orchestration, dataset downloading.
- A 3200 Gbps RDMA scale-out network ([RoCE](https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet)).
- Up-to 64 H100 SXM devices.
- At least 1TB of RAM and 4TB of local NVMe SSD per node.
- Deep burn-in testing.
- Interopability with all Modal platform functionality (Volumes, Dicts, Tunnels, etc.).

The guide will walk you through how the Modal client library enables multi-node training and integrates with `torchrun`.

### @clustered

Unlike standard Modal serverless containers, containers in a multi-node training job must be able to:

1. Perform fast, direct network communication between each other.
2. Be scheduled together, all or nothing, at the same time.

The `@clustered` decorator enables this behavior.

```python notest
import modal
import modal.experimental

@app.function(
    gpu="H100:8",
    timeout=60 * 60 * 24,
    retries=modal.Retries(initial_delay=0.0, max_retries=10),
)
@modal.experimental.clustered(size=4)
def train_model():
    cluster_info = modal.experimental.get_cluster_info()

    container_rank = cluster_info.rank
    world_size = len(cluster_info.container_ips)
    main_addr = cluster_info.container_ips[0]
    is_main = "(main)" if container_rank == 0 else ""

    print(f"{container_rank=} {is_main} {world_size=} {main_addr=}")
    ...
```

Applying this decorator under `@app.function` modifies the Function so that remote calls to it are serviced by a multi-node container group. The above configuration creates a group of four containers each having 8 H100 GPU devices, for a total of 32 devices.

## Scheduling

A `modal.experimental.clustered` Function runs on multiple nodes in our cloud, but executes like a normal function call. For example, all nodes are scheduled together ([gang scheduling](https://en.wikipedia.org/wiki/Gang_scheduling)) so that your code runs on all of the requested hardware or not at all.

Traditionally this kind of cluster and scheduling management would be handled by SLURM, Kubernetes, or manually. But with Modal it’s all provided serverlessly with just an application of the decorator!

### Rank & input broadcast

![diagram](https://modal-cdn.com/cdnbot/multinodepmgnla70_4b57a155.webp)

You may notice above that a single `.remote` Function call created three input executions but returned only one output. This is how input-output is structured for multi-node training jobs on Modal. The Function call’s arguments are replicated to each container, but only the rank zero container’s is returned to the caller.

A container’s rank is a key concept in multi-node training jobs. Rank zero is the ‘leader’ rank and typically coordinates the job. Rank zero is also known as the “main” container. Rank zero’s output will always be the output of a multi-node training run.

## Networking

Function containers cannot normally make direct network connections to other Function containers, but this is a requirement for multi-node training communication. So, along with gang scheduling, the `@clustered` decorator enables Modal’s workspace-private inter-container networking called [i6pn](https://www.notion.so/Multi-node-docs-1281e7f16949806f966adedfe8b2cb74?pvs=21).

The [cluster networking guide](https://modal.com/docs/guide/private-networking) goes into more detail on i6pn, but the upshot is that each container in the cluster is made aware of the network address of all the other containers in the cluster, enabling them to communicate with each other quickly via [TCP](https://pytorch.org/docs/stable/elastic/rendezvous.html).

### RDMA (Infiniband)

H100 clusters are equipped with Infiniband providing up-to 3,200Gbps scale-out bandwidth for inter-node communication.
RDMA scale-out networking is enabled with the `rdma` parameter of `modal.experimental.clustered.`

```python notest
@modal.experimental.clustered(size=2, rdma=True)
def train():
    ...
```

To run a simple Infiniband RDMA performance test see the [`modal-examples` repository example](https://github.com/modal-labs/multinode-training-guide/tree/main/benchmark).

## Cluster Info

`modal.experimental.get_cluster_info()` exposes the following information about the cluster:

- `rank: int` is the container's order within the cluster, starting from `0`, the leader.
- `container_ips: list[str]` contains the ipv6 addresses of each container in the cluster, sorted by rank.

## Fault Tolerance

For a clustered Function, failures in inputs and containers are handled differently.

If an input fails on any container, this failure **is not propagated** to other containers in the cluster. Containers are responsible for detecting and responding to input failures on other containers.

Only rank 0’s output matters: if an input fails on the leader container (rank 0), the input is marked as failed, even if the input succeeds on another container. Similarly, if an input succeeds on the leader container but fails on another container, the input will still be marked as successful.

If a container in the cluster is preempted, Modal will terminate all remaining containers in the cluster, and retry the input.

### Input Synchronization

_**Important:**_ synchronization is not relevant for single training runs, and applies mostly to inference use-cases.

Modal does not synchronize input execution across containers. Containers are responsible for ensuring that they do not process inputs faster than other containers in their cluster.

In particular, it is important that the leader container (rank 0) only starts processing the next input after all other containers have finished processing the current input.

## Examples

To get hands-on with multi-node training you can jump into the [`multinode-training-guide` repository](https://github.com/modal-labs/multinode-training-guide) or [`modal-examples` repository](https://github.com/modal-labs/modal-examples/tree/main/12_datasets) and `modal run` something!

- [Simple ‘hello world’ 4 X 1 H100 torch cluster example](https://github.com/modal-labs/modal-examples/blob/main/14_clusters/simple_torch_cluster.py)
- [Infiniband RDMA performance test](https://github.com/modal-labs/multinode-training-guide/tree/main/benchmark)
- [Use 2 x 8 H100s to train a ResNet50 model on the ImageNet dataset](https://github.com/modal-labs/multinode-training-guide/tree/main/resnet50)
- [Speedrun GPT-2 training with modded-nanogpt](https://github.com/modal-labs/multinode-training-guide/tree/main/nanoGPT)
<!-- - Use 2 x 8 H100s to run multi-node _inference_ on LLaMA 3.1 405B in 16bit precision. **[TODO]** -->

### Torchrun Example

```python
import modal
import modal.experimental

image = (
    modal.Image.debian_slim(python_version="3.12")
    .pip_install("torch~=2.5.1", "numpy~=2.2.1")
    .add_local_dir(
        "training", remote_path="/root/training"
    )
)
app = modal.App("example-simple-torch-cluster", image=image)

n_nodes = 4

@app.function(
    gpu=f"H100:8",
    timeout=60 * 60 * 24,
)
@modal.experimental.clustered(size=n_nodes)
def launch_torchrun():
    # import the 'torchrun' interface directly.
    from torch.distributed.run import parse_args, run

    cluster_info = modal.experimental.get_cluster_info()

    run(
        parse_args(
            [
                f"--nnodes={n_nodes}",
                f"--node_rank={cluster_info.rank}",
                f"--master_addr={cluster_info.container_ips[0]}",
                f"--nproc-per-node=8",
                "--master_port=1234",
                "training/train.py",
            ]
        )
    )
```

### Scheduling and cron jobs

# Scheduling remote cron jobs

A common requirement is to perform some task at a given time every day or week
automatically. Modal facilitates this through function schedules.

## Basic scheduling

Let's say we have a Python module `heavy.py` with a function,
`perform_heavy_computation()`.

```python
# heavy.py
def perform_heavy_computation():
    ...

if __name__ == "__main__":
    perform_heavy_computation()
```

To schedule this function to run once per day, we create a Modal App and attach
our function to it with the `@app.function` decorator and a schedule parameter:

```python
# heavy.py
import modal

app = modal.App()

@app.function(schedule=modal.Period(days=1))
def perform_heavy_computation():
    ...
```

To activate the schedule, deploy your app, either through the CLI:

```shell
modal deploy --name daily_heavy heavy.py
```

Or programmatically:

```python
if __name__ == "__main__":
   app.deploy()
```

Now the function will run every day, at the time of the initial deployment,
without any further interaction on your part.

When you make changes to your function, just rerun the deploy command to
overwrite the old deployment.

Note that when you redeploy your function, `modal.Period` resets, and the
schedule will run X hours after this most recent deployment.

If you want to run your function at a regular schedule not disturbed by deploys,
`modal.Cron` (see below) is a better option.

## Monitoring your scheduled runs

To see past execution logs for the scheduled function, go to the
[Apps](https://modal.com/apps) section on the Modal web site.

Schedules currently cannot be paused. Instead the schedule should be removed and
the app redeployed. Schedules can be started manually on the app's dashboard
page, using the "run now" button.

## Schedule types

There are two kinds of base schedule values -
[`modal.Period`](https://modal.com/docs/reference/modal.Period) and
[`modal.Cron`](https://modal.com/docs/reference/modal.Cron).

[`modal.Period`](https://modal.com/docs/reference/modal.Period) lets you specify an interval
between function calls, e.g. `Period(days=1)` or `Period(hours=5)`:

```python
# runs once every 5 hours
@app.function(schedule=modal.Period(hours=5))
def perform_heavy_computation():
    ...
```

[`modal.Cron`](https://modal.com/docs/reference/modal.Cron) gives you finer control using
[cron](https://en.wikipedia.org/wiki/Cron) syntax:

```python
# runs at 8 am (UTC) every Monday
@app.function(schedule=modal.Cron("0 8 * * 1"))
def perform_heavy_computation():
    ...

# runs daily at 6 am (New York time)
@app.function(schedule=modal.Cron("0 6 * * *", timezone="America/New_York"))
def send_morning_report():
    ...
```

For more details, see the API reference for
[Period](https://modal.com/docs/reference/modal.Period), [Cron](https://modal.com/docs/reference/modal.Cron) and
[Function](https://modal.com/docs/reference/modal.Function)

### Deployment

#### Apps, Functions, and entrypoints

# Apps, Functions, and entrypoints

An `App` is the object that represents an application running on Modal.
All functions and classes are associated with an
[`App`](https://modal.com/docs/reference/modal.App#modalapp).

When you [`run`](https://modal.com/docs/reference/cli/run) or
[`deploy`](https://modal.com/docs/reference/cli/deploy) an `App`, it creates an ephemeral or a
deployed `App`, respectively.

You can view a list of all currently running Apps on the [`apps`](https://modal.com/apps) page.

## Ephemeral Apps

An ephemeral App is created when you use the
[`modal run`](https://modal.com/docs/reference/cli/run) CLI command, or the
[`app.run`](https://modal.com/docs/reference/modal.App#run) method. This creates a temporary
App that only exists for the duration of your script.

Ephemeral Apps are stopped automatically when the calling program exits, or when
the server detects that the client is no longer connected.
You can use
[`--detach`](https://modal.com/docs/reference/cli/run) in order to keep an ephemeral App running even
after the client exits.

By using `app.run` you can run your Modal apps from within your Python scripts:

```python
def main():
    ...
    with app.run():
        some_modal_function.remote()
```

By default, running your app in this way won't propagate Modal logs and progress bar messages. To enable output, use the [`modal.enable_output`](https://modal.com/docs/reference/modal.enable_output) context manager:

```python
def main():
    ...
    with modal.enable_output():
        with app.run():
            some_modal_function.remote()
```

## Deployed Apps

A deployed App is created using the [`modal deploy`](https://modal.com/docs/reference/cli/deploy)
CLI command. The App is persisted indefinitely until you delete it via the
[web UI](https://modal.com/apps). Functions in a deployed App that have an attached
[schedule](https://modal.com/docs/guide/cron) will be run on a schedule. Otherwise, you can
invoke them manually using
[web endpoints or Python](https://modal.com/docs/guide/trigger-deployed-functions).

Deployed Apps are named via the [`App`](https://modal.com/docs/reference/modal.App#modalapp)
constructor. Re-deploying an existing `App` (based on the name) will update it
in place.

## Entrypoints for ephemeral Apps

The code that runs first when you `modal run` an App is called the "entrypoint".

You can register a local entrypoint using the
[`@app.local_entrypoint()`](https://modal.com/docs/reference/modal.App#local_entrypoint)
decorator. You can also use a regular Modal function as an entrypoint, in which
case only the code in global scope is executed locally.

### Argument parsing

If your entrypoint function takes arguments with primitive types, `modal run`
automatically parses them as CLI options. For example, the following function
can be called with `modal run script.py --foo 1 --bar "hello"`:

```python
# script.py

@app.local_entrypoint()
def main(foo: int, bar: str):
    some_modal_function.remote(foo, bar)
```

If you wish to use your own argument parsing library, such as `argparse`, you can instead accept a variable-length argument list for your entrypoint or your function. In this case, Modal skips CLI parsing and forwards CLI arguments as a tuple of strings. For example, the following function can be invoked with `modal run my_file.py --foo=42 --bar="baz"`:

```python
import argparse

@app.function()
def train(*arglist):
    parser = argparse.ArgumentParser()
    parser.add_argument("--foo", type=int)
    parser.add_argument("--bar", type=str)
    args = parser.parse_args(args = arglist)
```

### Manually specifying an entrypoint

If there is only one `local_entrypoint` registered,
[`modal run script.py`](https://modal.com/docs/reference/cli/run) will automatically use it. If
you have no entrypoint specified, and just one decorated Modal function, that
will be used as a remote entrypoint instead. Otherwise, you can direct
`modal run` to use a specific entrypoint.

For example, if you have a function decorated with
[`@app.function()`](https://modal.com/docs/reference/modal.App#function) in your file:

```python
# script.py

@app.function()
def f():
    print("Hello world!")

@app.function()
def g():
    print("Goodbye world!")

@app.local_entrypoint()
def main():
    f.remote()
```

Running [`modal run script.py`](https://modal.com/docs/reference/cli/run) will execute the `main`
function locally, which would call the `f` function remotely. However you can
instead run `modal run script.py::app.f` or `modal run script.py::app.g` to
execute `f` or `g` directly.

## Apps were once Stubs

The `modal.App` class in the client was previously called `modal.Stub`. The
old name was kept as an alias for some time, but from Modal 1.0.0 onwards,
using `modal.Stub` will result in an error.

#### Managing deployments

# Managing deployments

Once you've finished using `modal run` or `modal serve` to iterate on your Modal
code, it's time to deploy. A Modal deployment creates and then persists an
application and its objects, providing the following benefits:

- Repeated application function executions will be grouped under the deployment,
  aiding observability and usage tracking. Programmatically triggering lots of
  ephemeral App runs can clutter your web and CLI interfaces.
- Function calls are much faster because deployed functions are persistent and
  reused, not created on-demand by calls. Learn how to trigger deployed
  functions in
  [Invoking deployed functions](https://modal.com/docs/guide/trigger-deployed-functions).
- [Scheduled functions](https://modal.com/docs/guide/cron) will continue scheduling separate from
  any local iteration you do, and will notify you on failure.
- [Web endpoints](https://modal.com/docs/guide/webhooks) keep running when you close your laptop,
  and their URL address matches the deployment name.

## Creating deployments

Deployments are created using the
[`modal deploy` command](https://modal.com/docs/reference/cli/app#modal-app-list).

```
 % modal deploy -m whisper_pod_transcriber.main
✓ Initialized. View app page at https://modal.com/apps/ap-PYc2Tb7JrkskFUI8U5w0KG.
✓ Created objects.
├── 🔨 Created populate_podcast_metadata.
├── 🔨 Mounted /home/ubuntu/whisper_pod_transcriber at /root/whisper_pod_transcriber
├── 🔨 Created fastapi_app => https://modal-labs-whisper-pod-transcriber-fastapi-app.modal.run
├── 🔨 Mounted /home/ubuntu/whisper_pod_transcriber/whisper_frontend/dist at /assets
├── 🔨 Created search_podcast.
├── 🔨 Created refresh_index.
├── 🔨 Created transcribe_segment.
├── 🔨 Created transcribe_episode..
└── 🔨 Created fetch_episodes.
✓ App deployed! 🎉

View Deployment: https://modal.com/apps/modal-labs/whisper-pod-transcriber
```

Running this command on an existing deployment will redeploy the App,
incrementing its version. For detail on how live deployed apps transition
between versions, see the [Updating deployments](#updating-deployments) section.

Deployments can also be created programmatically using Modal's
[Python API](https://modal.com/docs/reference/modal.App#deploy).

## Viewing deployments

Deployments can be viewed either on the [apps](https://modal.com/apps) web page or by using the
[`modal app list` command](https://modal.com/docs/reference/cli/app#modal-app-list).

## Updating deployments

A deployment can deploy a new App or redeploy a new version of an existing
deployed App. It's useful to understand how Modal handles the transition between
versions when an App is redeployed. In general, Modal aims to support
zero-downtime deployments by gradually transitioning traffic to the new version.

If the deployment involves building new versions of the Images used by the App,
the build process will need to complete succcessfully. The existing version of
the App will continue to handle requests during this time. Errors during the
build will abort the deployment with no change to the status of the App.

After the build completes, Modal will start to bring up new containers running
the latest version of the App. The existing containers will continue handling
requests (using the previous version of the App) until the new containers have
completed their cold start.

Once the new containers are ready, old containers will stop accepting new
requests. However, the old containers will continue running any requests they
had previously accepted. The old containers will not terminate until they have
finished processing all ongoing requests.

Any warm pool containers will also be cycled during a deployment, as the
previous version's warm pool are now outdated.

## Deployment rollbacks

To quickly reset an App back to a previous version, you can perform a deployment
_rollback_. Rollbacks can be triggered from either the App dashboard or the CLI.
Rollback deployments look like new deployments: they increment the version number
and are attributed to the user who triggered the rollback. But the App's functions
and metadata will be reset to their previous state independently of your current
App codebase.

Note that deployment rollbacks are supported only on the Team and Enterprise plans.

## Stopping deployments

Deployed apps can be stopped in the web UI by clicking the red "Stop app" button on
the App's "Overview" page, or alternatively from the command line using the
[`modal app stop` command](https://modal.com/docs/reference/cli/app#modal-app-stop).

Stopping an App is a destructive action. Apps cannot be restarted from this state;
a new App will need to be deployed from the same source files. Objects associated
with stopped deployments will eventually be garbage collected.

#### Invoking deployed functions

# Invoking deployed functions

Modal lets you take a function created by a
[deployment](https://modal.com/docs/guide/managing-deployments) and call it from other contexts.

There are two ways of invoking deployed functions. If the invoking client is
running Python, then the same
[Modal client library](https://pypi.org/project/modal/) used to write Modal code
can be used. HTTPS is used if the invoking client is not running Python and
therefore cannot import the Modal client library.

## Invoking with Python

Some use cases for Python invocation include:

- An existing Python web server (eg. Django, Flask) wants to invoke Modal
  functions.
- You have split your product or system into multiple Modal applications that
  deploy independently and call each other.

### Function lookup and invocation basics

Let's say you have a script `my_shared_app.py` and this script defines a Modal
app with a function that computes the square of a number:

```python
import modal

app = modal.App("my-shared-app")

@app.function()
def square(x: int):
    return x ** 2
```

You can deploy this app to create a persistent deployment:

```
% modal deploy shared_app.py
✓ Initialized.
✓ Created objects.
├── 🔨 Created square.
├── 🔨 Mounted /Users/erikbern/modal/shared_app.py.
✓ App deployed! 🎉

View Deployment: https://modal.com/apps/erikbern/my-shared-app
```

Let's try to run this function from a different context. For instance, let's
fire up the Python interactive interpreter:

```bash
% python
Python 3.9.5 (default, May  4 2021, 03:29:30)
[Clang 12.0.0 (clang-1200.0.32.27)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import modal
>>> f = modal.Function.from_name("my-shared-app", "square")
>>> f.remote(42)
1764
>>>
```

This works exactly the same as a regular modal `Function` object. For example,
you can `.map()` over functions invoked this way too:

```bash
>>> f = modal.Function.from_name("my-shared-app", "square")
>>> f.map([1, 2, 3, 4, 5])
[1, 4, 9, 16, 25]
```

#### Authentication

The Modal Python SDK will read the token from `~/.modal.toml` which typically is
created using `modal token new`.

Another method of providing the credentials is to set the environment variables
`MODAL_TOKEN_ID` and `MODAL_TOKEN_SECRET`. If you want to call a Modal function
from a context such as a web server, you can expose these environment variables
to the process.

#### Lookup of lifecycle functions

[Lifecycle functions](https://modal.com/docs/guide/lifecycle-functions) are defined on classes,
which you can look up in a different way. Consider this code:

```python
import modal

app = modal.App("my-shared-app")

@app.cls()
class MyLifecycleClass:
    @modal.enter()
    def enter(self):
        self.var = "hello world"

    @modal.method()
    def foo(self):
        return self.var
```

Let's say you deploy this app. You can then call the function by doing this:

```bash
>>> cls = modal.Cls.from_name("my-shared-app", "MyLifecycleClass")
>>> obj = cls()  # You can pass any constructor arguments here
>>> obj.foo.remote()
'hello world'
```

### Asynchronous invocation

In certain contexts, a Modal client will need to trigger Modal functions without
waiting on the result. This is done by spawning functions and receiving a
[`FunctionCall`](https://modal.com/docs/reference/modal.FunctionCall) as a
handle to the triggered execution.

The following is an example of a Flask web server (running outside Modal) which
accepts model training jobs to be executed within Modal. Instead of the HTTP
POST request waiting on a training job to complete, which would be infeasible,
the relevant Modal function is spawned and the
[`FunctionCall`](https://modal.com/docs/reference/modal.FunctionCall)
object is stored for later polling of execution status.

```python
from uuid import uuid4
from flask import Flask, jsonify, request

app = Flask(__name__)
pending_jobs = {}

...

@app.route("/jobs", methods = ["POST"])
def create_job():
    predict_fn = modal.Function.from_name("example", "train_model")
    job_id = str(uuid4())
    function_call = predict_fn.spawn(
        job_id=job_id,
        params=request.json,
    )
    pending_jobs[job_id] = function_call
    return {
        "job_id": job_id,
        "status": "pending",
    }
```

### Importing a Modal function between Modal apps

You can also import one function defined in an app from another app:

```python
import modal

app = modal.App("another-app")

square = modal.Function.from_name("my-shared-app", "square")

@app.function()
def cube(x):
    return x * square.remote(x)

@app.local_entrypoint()
def main():
    assert cube.remote(42) == 74088
```

### Comparison with HTTPS

Compared with HTTPS invocation, Python invocation has the following benefits:

- Avoids the need to create web endpoint functions.
- Avoids handling serialization of request and response data between Modal and
  your client.
- Uses the Modal client library's built-in authentication.
  - Web endpoints are public to the entire internet, whereas function `lookup`
    only exposes your code to you (and your org).
- You can work with shared Modal functions as if they are normal Python
  functions, which might be more convenient.

## Invoking with HTTPS

Any non-Python application client can interact with deployed Modal applications
via [web endpoint functions](https://modal.com/docs/guide/webhooks).

Anything able to make HTTPS requests can trigger a Modal web endpoint function.
Note that all deployed web endpoint functions have
[a stable HTTPS URL](https://modal.com/docs/guide/webhook-urls).

Some use cases for HTTPS invocation include:

- Calling Modal functions from a web browser client running Javascript
- Calling Modal functions from non-Python backend services (Java, Go, Ruby,
  NodeJS, etc)
- Calling Modal functions using UNIX tools (`curl`, `wget`)

However, if the client of your Modal deployment is running Python, it's better
to use the [Modal client library](https://pypi.org/project/modal/) to invoke
your Modal code.

For more detail on setting up functions for invocation over HTTP see the
[web endpoints guide](https://modal.com/docs/guide/webhooks).

#### Continuous deployment

# Continuous deployment

It's a common pattern to auto-deploy your Modal App as part of a CI/CD pipeline.
To get you started, below is a guide to doing continuous deployment of a Modal
App in GitHub.

## GitHub Actions

Here's a sample GitHub Actions workflow that deploys your App on every push to
the `main` branch.

This requires you to create a [Modal token](https://modal.com/settings/tokens) and add it as a
[secret for your Github Actions workflow](https://github.com/Azure/actions-workflow-samples/blob/master/assets/create-secrets-for-GitHub-workflows.md).

After setting up secrets, create a new workflow file in your repository at
`.github/workflows/ci-cd.yml` with the following contents:

```yaml
name: CI/CD

on:
  push:
    branches:
      - main

jobs:
  deploy:
    name: Deploy
    runs-on: ubuntu-latest
    env:
      MODAL_TOKEN_ID: ${{ secrets.MODAL_TOKEN_ID }}
      MODAL_TOKEN_SECRET: ${{ secrets.MODAL_TOKEN_SECRET }}

    steps:
      - name: Checkout Repository
        uses: actions/checkout@v4

      - name: Install Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.10"

      - name: Install Modal
        run: |
          python -m pip install --upgrade pip
          pip install modal

      - name: Deploy job
        run: |
          modal deploy -m my_package.my_file
```

Be sure to replace `my_package.my_file` with your actual entrypoint.

If you use multiple Modal [Environments](https://modal.com/docs/guide/environments), you can
additionally specify the target environment in the YAML using
`MODAL_ENVIRONMENT=xyz`.

#### Running untrusted code in Functions

# Running untrusted code in Functions

Modal provides two primitives for running untrusted code: Restricted Functions and [Sandboxes](https://modal.com/docs/guide/sandbox). While both can be used for running untrusted code, they serve different purposes: Sandboxes provide a container-like interface while Restricted Functions provide an interface similar to a traditional Function.

Restricted Functions are useful for executing:

- Code generated by language models (LLMs)
- User-submitted code in interactive environments
- Third-party plugins or extensions

## Using `restrict_modal_access`

To restrict a Function's access to Modal resources, set `restrict_modal_access=True` on the Function definition:

```python
import modal

app = modal.App()

@app.function(restrict_modal_access=True)
def run_untrusted_code(code_input: str):
    # This function cannot access Modal resources
    return eval(code_input)
```

When `restrict_modal_access` is enabled:

- The Function cannot access Modal resources (Queues, Dicts, etc.)
- The Function cannot call other Functions
- The Function cannot access Modal's internal APIs

## Comparison with Sandboxes

While both `restrict_modal_access` and [Sandboxes](https://modal.com/docs/guide/sandbox) can be used for running untrusted code, they serve different purposes:

| Feature   | Restricted Function            | Sandbox                                        |
| --------- | ------------------------------ | ---------------------------------------------- |
| State     | Stateless                      | Stateful                                       |
| Interface | Function-like                  | Container-like                                 |
| Setup     | Simple decorator               | Requires explicit creation/termination         |
| Use case  | Quick, isolated code execution | Interactive development, long-running sessions |

## Best Practices

When running untrusted code, consider these additional security measures:

1. Use `max_inputs=1` to ensure each container only handles one request. Containers that get reused could cause information leakage between users.

```python
@app.function(restrict_modal_access=True, max_inputs=1)
def isolated_function(input_data):
    # Each input gets a fresh container
    return process(input_data)
```

2. Set appropriate timeouts to prevent long-running operations:

```python
@app.function(
    restrict_modal_access=True,
    timeout=30,  # 30 second timeout
    max_inputs=1
)
def time_limited_function(input_data):
    return process(input_data)
```

3. Consider using `block_network=True` to prevent the container from making outbound network requests:

```python
@app.function(
    restrict_modal_access=True,
    block_network=True,
    max_inputs=1
)
def network_isolated_function(input_data):
    return process(input_data)
```

## Example: Running LLM-generated Code

Below is a complete example of running code generated by a language model:

```python
import modal

app = modal.App("restricted-access-example")

@app.function(restrict_modal_access=True, max_inputs=1, timeout=30, block_network=True)
def run_llm_code(generated_code: str):
    try:
        # Create a restricted environment
        execution_scope = {}

        # Execute the generated code
        exec(generated_code, execution_scope)

        # Return the result if it exists
        return execution_scope.get("result", None)
    except Exception as e:
        return f"Error executing code: {str(e)}"

@app.local_entrypoint()
def main():
    # Example LLM-generated code
    code = """
def calculate_fibonacci(n):
    if n <= 1:
        return n
    return calculate_fibonacci(n-1) + calculate_fibonacci(n-2)

result = calculate_fibonacci(10)
    """

    result = run_llm_code.remote(code)
    print(f"Result: {result}")

```

This example locks down the container to ensure that the code is safe to execute by:

- Restricting Modal access
- Using a fresh container for each execution
- Setting a timeout
- Blocking network access
- Catching and handling potential errors

## Error Handling

When a restricted Function attempts to access Modal resources, it will raise an `AuthError`:

```python
@app.function(restrict_modal_access=True)
def restricted_function(q: modal.Queue):
    try:
        # This will fail because the Function is restricted
        return q.get()
    except modal.exception.AuthError as e:
        return f"Access denied: {e}"
```

The error message will indicate that the operation is not permitted due to restricted Modal access.

### Secrets and environment variables

#### Secrets

# Secrets

Securely provide credentials and other sensitive information to your Modal Functions with Secrets.

You can create and edit Secrets via
the [dashboard](https://modal.com/secrets),
the command line interface ([`modal secret`](https://modal.com/docs/reference/cli/secret)), and
programmatically from Python code ([`modal.Secret`](https://modal.com/docs/reference/modal.Secret)).

To inject Secrets into the container running your Function, add the
`secrets=[...]` argument to your `app.function` or `app.cls` decoration.

## Deploy Secrets from the Modal Dashboard

The most common way to create a Modal Secret is to use the
[Secrets panel of the Modal dashboard](https://modal.com/secrets),
which also shows any existing Secrets.

When you create a new Secret, you'll be prompted with a number of templates to help you get started.
These templates demonstrate standard formats for credentials for everything from Postgres and MongoDB
to Weights & Biases and Hugging Face.

## Use Secrets in your Modal Apps

You can then use your Secret by constructing it `from_name` when defining a Modal App
and then accessing its contents as environment variables.
For example, if you have a Secret called `secret-keys` containing the key
`MY_PASSWORD`:

```python
@app.function(secrets=[modal.Secret.from_name("secret-keys")])
def some_function():
    import os

    secret_key = os.environ["MY_PASSWORD"]
    ...
```

Each Secret can contain multiple keys and values but you can also inject
multiple Secrets, allowing you to separate Secrets into smaller reusable units:

```python
@app.function(secrets=[
    modal.Secret.from_name("my-secret-name"),
    modal.Secret.from_name("other-secret"),
])
def other_function():
    ...
```

The Secrets are applied in order, so key-values from later `modal.Secret`
objects in the list will overwrite earlier key-values in the case of a clash.
For example, if both `modal.Secret` objects above contained the key `FOO`, then
the value from `"other-secret"` would always be present in `os.environ["FOO"]`.

## Create Secrets programmatically

In addition to defining Secrets on the web dashboard, you can
programmatically create a Secret directly in your script and send it along to
your Function using `Secret.from_dict(...)`. This can be useful if you want to
send Secrets from your local development machine to the remote Modal App.

```python
import os

if modal.is_local():
    local_secret = modal.Secret.from_dict({"FOO": os.environ["LOCAL_FOO"]})
else:
    local_secret = modal.Secret.from_dict({})

@app.function(secrets=[local_secret])
def some_function():
    import os

    print(os.environ["FOO"])
```

If you have [`python-dotenv`](https://pypi.org/project/python-dotenv/) installed,
you can also use `Secret.from_dotenv()` to create a Secret from the variables in a `.env`
file

```python
@app.function(secrets=[modal.Secret.from_dotenv()])
def some_other_function():
    print(os.environ["USERNAME"])
```

## Interact with Secrets from the command line

You can create, list, and delete your Modal Secrets with the `modal secret` command line interface.

View your Secrets and their timestamps with

```bash
modal secret list
```

Create a new Secret by passing `{KEY}={VALUE}` pairs to `modal secret create`:

```bash
modal secret create database-secret PGHOST=uri PGPORT=5432 PGUSER=admin PGPASSWORD=hunter2
```

or using environment variables (assuming below that the `PGPASSWORD` environment variable is set
e.g. by your CI system):

```bash
modal secret create database-secret PGHOST=uri PGPORT=5432 PGUSER=admin PGPASSWORD="$PGPASSWORD"
```

Remove Secrets by passing their name to `modal secret delete`:

```bash
modal secret delete database-secret
```

#### Environment variables

# Environment variables

The Modal runtime sets several environment variables during initialization. The
keys for these environment variables are reserved and cannot be overridden by
your Function or Sandbox configuration.

These variables provide information about the containers's runtime
environment.

## Container runtime environment variables

The following variables are present in every Modal container:

- **`MODAL_CLOUD_PROVIDER`** — Modal executes containers across a number of cloud
  providers ([AWS](https://aws.amazon.com/), [GCP](https://cloud.google.com/),
  [OCI](https://www.oracle.com/cloud/)). This variable specifies which cloud
  provider the Modal container is running within.
- **`MODAL_IMAGE_ID`** — The ID of the
  [`modal.Image`](https://modal.com/docs/reference/modal.Image) used by the Modal container.
- **`MODAL_REGION`** — This will correspond to a geographic area identifier from
  the cloud provider associated with the Modal container (see above). For AWS, the
  identifier is a "region". For GCP it is a "zone", and for OCI it is an
  "availability domain". Example values are `us-east-1` (AWS), `us-central1`
  (GCP), `us-ashburn-1` (OCI).
- **`MODAL_TASK_ID`** — The ID of the container running the Modal Function or Sandbox.

## Function runtime environment variables

The following variables are present in containers running Modal Functions:

- **`MODAL_ENVIRONMENT`** — The name of the
  [Modal Environment](https://modal.com/docs/guide/environments) the container is running within.
- **`MODAL_IS_REMOTE`** - Set to '1' to indicate that Modal Function code is running in
  a remote container.
- **`MODAL_IDENTITY_TOKEN`** — An [OIDC token](https://modal.com/docs/guide/oidc-integration)
  encoding the identity of the Modal Function.

## Sandbox environment variables

The following variables are present within [`modal.Sandbox`](https://modal.com/docs/reference/modal.Sandbox) instances.

- **`MODAL_SANDBOX_ID`** — The ID of the Sandbox.

## Container image environment variables

The container image layers used by a `modal.Image` may set
environment variables. These variables will be present within your container's runtime
environment. For example, the
[`debian_slim`](https://modal.com/docs/reference/modal.Image#debian_slim) image sets the
`GPG_KEY` variable.

To override image variables or set new ones, use the
[`.env`](https://modal.com/docs/reference/modal.Image#env) method provided by
`modal.Image`.

### Web endpoints

#### Web endpoints

# Web endpoints

This guide explains how to set up web endpoints with Modal.

All deployed Modal Functions can be [invoked from any other Python application](https://modal.com/docs/guide/trigger-deployed-functions)
using the Modal client library. We additionally provide multiple ways to expose
your Functions over the web for non-Python clients.

You can [turn any Python function into a web endpoint](#simple-endpoints) with a single line
of code, you can [serve a full app](#serving-asgi-and-wsgi-apps) using
frameworks like FastAPI, Django, or Flask, or you can
[serve anything that speaks HTTP and listens on a port](#non-asgi-web-servers).

Below we walk through each method, assuming you're familiar with web applications outside of Modal.
For a detailed walkthrough of basic web endpoints on Modal aimed at developers new to web applications,
see [this tutorial](https://modal.com/docs/examples/basic_web).

## Simple endpoints

The easiest way to create a web endpoint from an existing Python function is to use the
[`@modal.fastapi_endpoint` decorator](https://modal.com/docs/reference/modal.fastapi_endpoint).

```python
image = modal.Image.debian_slim().pip_install("fastapi[standard]")

@app.function(image=image)
@modal.fastapi_endpoint()
def f():
    return "Hello world!"
```

This decorator wraps the Modal Function in a
[FastAPI application](#how-do-web-endpoints-run-in-the-cloud).

_Note: Prior to v0.73.82, this function was named `@modal.web_endpoint`_.

### Developing with `modal serve`

You can run this code as an ephemeral app, by running the command

```shell
modal serve server_script.py
```

Where `server_script.py` is the file name of your code. This will create an
ephemeral app for the duration of your script (until you hit Ctrl-C to stop it).
It creates a temporary URL that you can use like any other REST endpoint. This
URL is on the public internet.

The `modal serve` command will live-update an app when any of its supporting
files change.

Live updating is particularly useful when working with apps containing web
endpoints, as any changes made to web endpoint handlers will show up almost
immediately, without requiring a manual restart of the app.

### Deploying with `modal deploy`

You can also deploy your app and create a persistent web endpoint in the cloud
by running `modal deploy`:

### Passing arguments to an endpoint

When using `@modal.fastapi_endpoint`, you can add
[query parameters](https://fastapi.tiangolo.com/tutorial/query-params/) which
will be passed to your Function as arguments. For instance

```python
image = modal.Image.debian_slim().pip_install("fastapi[standard]")

@app.function(image=image)
@modal.fastapi_endpoint()
def square(x: int):
    return {"square": x**2}
```

If you hit this with a URL-encoded query string with the `x` parameter present,
the Function will receive the value as an argument:

```
$ curl https://modal-labs--web-endpoint-square-dev.modal.run?x=42
{"square":1764}
```

If you want to use a `POST` request, you can use the `method` argument to
`@modal.fastapi_endpoint` to set the HTTP verb. To accept any valid JSON object,
[use `dict` as your type annotation](https://fastapi.tiangolo.com/tutorial/body-nested-models/?h=dict#bodies-of-arbitrary-dicts)
and FastAPI will handle the rest.

```python
image = modal.Image.debian_slim().pip_install("fastapi[standard]")

@app.function(image=image)
@modal.fastapi_endpoint(method="POST")
def square(item: dict):
    return {"square": item['x']**2}
```

This now creates an endpoint that takes a JSON body:

```
$ curl -X POST -H 'Content-Type: application/json' --data-binary '{"x": 42}' https://modal-labs--web-endpoint-square-dev.modal.run
{"square":1764}
```

This is often the easiest way to get started, but note that FastAPI recommends
that you use
[typed Pydantic models](https://fastapi.tiangolo.com/tutorial/body/) in order to
get automatic validation and documentation. FastAPI also lets you pass data to
web endpoints in other ways, for instance as
[form data](https://fastapi.tiangolo.com/tutorial/request-forms/) and
[file uploads](https://fastapi.tiangolo.com/tutorial/request-files/).

## How do web endpoints run in the cloud?

Note that web endpoints, like everything else on Modal, only run when they need
to. When you hit the web endpoint the first time, it will boot up the container,
which might take a few seconds. Modal keeps the container alive for a short
period in case there are subsequent requests. If there are a lot of requests,
Modal might create more containers running in parallel.

For the shortcut `@modal.fastapi_endpoint` decorator, Modal wraps your function in a
[FastAPI](https://fastapi.tiangolo.com/) application. This means that the
[Image](https://modal.com/docs/guide/images)
your Function uses must have FastAPI installed, and the Functions that you write
need to follow its request and response
[semantics](https://fastapi.tiangolo.com/tutorial). Web endpoint Functions can use
all of FastAPI's powerful features, such as Pydantic models for automatic validation,
typed query and path parameters, and response types.

Here's everything together, combining Modal's abilities to run functions in
user-defined containers with the expressivity of FastAPI:

```python
import modal
from fastapi.responses import HTMLResponse
from pydantic import BaseModel

image = modal.Image.debian_slim().pip_install("fastapi[standard]", "boto3")
app = modal.App(image=image)

class Item(BaseModel):
    name: str
    qty: int = 42

@app.function()
@modal.fastapi_endpoint(method="POST")
def f(item: Item):
    import boto3
    # do things with boto3...
    return HTMLResponse(f"<html>Hello, {item.name}!</html>")
```

This endpoint definition would be called like so:

```bash
curl -d '{"name": "Erik", "qty": 10}' \
    -H "Content-Type: application/json" \
    -X POST https://ecorp--web-demo-f-dev.modal.run
```

Or in Python with the [`requests`](https://pypi.org/project/requests/) library:

```python
import requests

data = {"name": "Erik", "qty": 10}
requests.post("https://ecorp--web-demo-f-dev.modal.run", json=data, timeout=10.0)
```

## Serving ASGI and WSGI apps

You can also serve any app written in an
[ASGI](https://asgi.readthedocs.io/en/latest/) or
[WSGI](https://en.wikipedia.org/wiki/Web_Server_Gateway_Interface)-compatible
web framework on Modal.

ASGI provides support for async web frameworks. WSGI provides support for
synchronous web frameworks.

### ASGI apps - FastAPI, FastHTML, Starlette

For ASGI apps, you can create a function decorated with
[`@modal.asgi_app`](https://modal.com/docs/reference/modal.asgi_app) that returns a reference to
your web app:

```python
image = modal.Image.debian_slim().pip_install("fastapi[standard]")

@app.function(image=image)
@modal.concurrent(max_inputs=100)
@modal.asgi_app()
def fastapi_app():
    from fastapi import FastAPI, Request

    web_app = FastAPI()

    @web_app.post("/echo")
    async def echo(request: Request):
        body = await request.json()
        return body

    return web_app
```

Now, as before, when you deploy this script as a Modal App, you get a URL for
your app that you can hit:

The `@modal.concurrent` decorator enables a single container
to process multiple inputs at once, taking advantage of the asynchronous
event loops in ASGI applications. See [this guide](https://modal.com/docs/guide/concurrent-inputs)
for details.

#### ASGI Lifespan

While we recommend using [`@modal.enter`](https://modal.com/docs/guide/lifecycle-functions#enter) for defining container lifecycle hooks, we also support the [ASGI lifespan protocol](https://asgi.readthedocs.io/en/latest/specs/lifespan.html). Lifespans begin when containers start, typically at the time of the first request. Here's an example using [FastAPI](https://fastapi.tiangolo.com/advanced/events/#lifespan):

```python
import modal

app = modal.App("fastapi-lifespan-app")

image = modal.Image.debian_slim().pip_install("fastapi[standard]")

@app.function(image=image)
@modal.asgi_app()
def fastapi_app_with_lifespan():
    from fastapi import FastAPI, Request

    def lifespan(wapp: FastAPI):
        print("Starting")
        yield
        print("Shutting down")

    web_app = FastAPI(lifespan=lifespan)

    @web_app.get("/")
    async def hello(request: Request):
        return "hello"

    return web_app
```

### WSGI apps - Django, Flask

You can serve WSGI apps using the
[`@modal.wsgi_app`](https://modal.com/docs/reference/modal.wsgi_app) decorator:

```python
image = modal.Image.debian_slim().pip_install("flask")

@app.function(image=image)
@modal.concurrent(max_inputs=100)
@modal.wsgi_app()
def flask_app():
    from flask import Flask, request

    web_app = Flask(__name__)

    @web_app.post("/echo")
    def echo():
        return request.json

    return web_app
```

See [Flask's docs](https://flask.palletsprojects.com/en/2.1.x/deploying/asgi/)
for more information on using Flask as a WSGI app.

Because WSGI apps are synchronous, concurrent inputs will be run on separate
threads. See [this guide](https://modal.com/docs/guide/concurrent-inputs) for details.

## Non-ASGI web servers

Not all web frameworks offer an ASGI or WSGI interface. For example,
[`aiohttp`](https://docs.aiohttp.org/) and [`tornado`](https://www.tornadoweb.org/)
use their own asynchronous network binding, while others like
[`text-generation-inference`](https://github.com/huggingface/text-generation-inference)
actually expose a Rust-based HTTP server running as a subprocess.

For these cases, you can use the
[`@modal.web_server`](https://modal.com/docs/reference/modal.web_server) decorator to "expose" a
port on the container:

```python
@app.function()
@modal.concurrent(max_inputs=100)
@modal.web_server(8000)
def my_file_server():
    import subprocess
    subprocess.Popen("python -m http.server -d / 8000", shell=True)
```

Just like all web endpoints on Modal, this is only run on-demand. The function
is executed on container startup, creating a file server at the root directory.
When you hit the web endpoint URL, your request will be routed to the file
server listening on port `8000`.

For `@web_server` endpoints, you need to make sure that the application binds to
the external network interface, not just localhost. This usually means binding
to `0.0.0.0` instead of `127.0.0.1`.

See our examples of how to serve [Streamlit](https://modal.com/docs/examples/serve_streamlit) and
[ComfyUI](https://modal.com/docs/examples/comfyapp) on Modal.

## Serve many configurations with parametrized functions

Python functions that launch ASGI/WSGI apps or web servers on Modal
cannot take arguments.

One simple pattern for allowing client-side configuration of these web endpoints
is to use [parametrized functions](https://modal.com/docs/guide/parametrized-functions).
Each different choice for the values of the parameters will create a distinct
auto-scaling container pool.

```python
@app.cls()
@modal.concurrent(max_inputs=100)
class Server:
    root: str = modal.parameter(default=".")

    @modal.web_server(8000)
    def files(self):
        import subprocess
        subprocess.Popen(f"python -m http.server -d {self.root} 8000", shell=True)
```

The values are provided in URLs as query parameters:

```bash
curl https://ecorp--server-files.modal.run		# use the default value
curl https://ecorp--server-files.modal.run?root=.cache  # use a different value
curl https://ecorp--server-files.modal.run?root=%2F	# don't forget to URL encode!
```

For details, see [this guide to parametrized functions](https://modal.com/docs/guide/parametrized-functions).

## WebSockets

Functions annotated with `@web_server`, `@asgi_app`, or `@wsgi_app` also support
the WebSocket protocol. Consult your web framework for appropriate documentation
on how to use WebSockets with that library.

WebSockets on Modal maintain a single function call per connection, which can be
useful for keeping state around. Most of the time, you will want to set your
handler function to [allow concurrent inputs](https://modal.com/docs/guide/concurrent-inputs),
which allows multiple simultaneous WebSocket connections to be handled by the
same container.

We support the full WebSocket protocol as per
[RFC 6455](https://www.rfc-editor.org/rfc/rfc6455), but we do not yet have
support for [RFC 8441](https://www.rfc-editor.org/rfc/rfc8441) (WebSockets over
HTTP/2) or [RFC 7692](https://datatracker.ietf.org/doc/html/rfc7692)
(`permessage-deflate` extension). WebSocket messages can be up to 2 MiB each.

## Performance and scaling

If you have no active containers when the web endpoint receives a request, it will
experience a "cold start". Consult the guide page on
[cold start performance](https://modal.com/docs/guide/cold-start) for more information on when
Functions will cold start and advice how to mitigate the impact.

If your Function uses `@modal.concurrent`, multiple requests to the same
endpoint may be handled by the same container. Beyond this limit, additional
containers will start up to scale your App horizontally. When you reach the
Function's limit on containers, requests will queue for handling.

Each workspace on Modal has a rate limit on total operations. For a new account,
this is set to 200 function inputs or web endpoint requests per second, with a
burst multiplier of 5 seconds. If you reach the rate limit, excess requests to
web endpoints will return a
[429 status code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429),
and you'll need to [get in touch](mailto:support@modal.com) with us about
raising the limit.

Web endpoint request bodies can be up to 4 GiB, and their response bodies are
unlimited in size.

## Authentication

Modal offers first-class web endpoint protection via [proxy auth tokens](https://modal.com/docs/guide/webhook-proxy-auth).
Proxy auth tokens protect web endpoints by requiring a key and token combination to be passed
in the `Modal-Key` and `Modal-Secret` headers.
Modal works as a proxy, rejecting requests that aren't authorized to access
your endpoint.

We also support standard techniques for securing web servers.

### Token-based authentication

This is easy to implement in whichever framework you're using. For example, if
you're using `@modal.fastapi_endpoint` or `@modal.asgi_app` with FastAPI, you
can validate a Bearer token like this:

```python
from fastapi import Depends, HTTPException, status, Request
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials

import modal

image = modal.Image.debian_slim().pip_install("fastapi[standard]")
app = modal.App("auth-example", image=image)

auth_scheme = HTTPBearer()

@app.function(secrets=[modal.Secret.from_name("my-web-auth-token")])
@modal.fastapi_endpoint()
async def f(request: Request, token: HTTPAuthorizationCredentials = Depends(auth_scheme)):
    import os

    print(os.environ["AUTH_TOKEN"])

    if token.credentials != os.environ["AUTH_TOKEN"]:
        raise HTTPException(
            status_code=status.HTTP_401_UNAUTHORIZED,
            detail="Incorrect bearer token",
            headers={"WWW-Authenticate": "Bearer"},
        )

    # Function body
    return "success!"
```

This assumes you have a [Modal Secret](https://modal.com/secrets) named
`my-web-auth-token` created, with contents `{AUTH_TOKEN: secret-random-token}`.
Now, your endpoint will return a 401 status code except when you hit it with the
correct `Authorization` header set (note that you have to prefix the token with
`Bearer `):

```bash
curl --header "Authorization: Bearer secret-random-token" https://modal-labs--auth-example-f.modal.run
```

### Client IP address

You can access the IP address of the client making the request. This can be used
for geolocation, whitelists, blacklists, and rate limits.

```python
from fastapi import Request

import modal

image = modal.Image.debian_slim().pip_install("fastapi[standard]")
app = modal.App(image=image)

@app.function()
@modal.fastapi_endpoint()
def get_ip_address(request: Request):
    return f"Your IP address is {request.client.host}"
```

#### Streaming endpoints

# Streaming endpoints

Modal web endpoints support streaming responses using FastAPI's
[`StreamingResponse`](https://fastapi.tiangolo.com/advanced/custom-response/#streamingresponse)
class. This class accepts asynchronous generators, synchronous generators, or
any Python object that implements the
[_iterator protocol_](https://docs.python.org/3/library/stdtypes.html#typeiter),
and can be used with Modal Functions!

## Simple example

This simple example combines Modal's `@modal.fastapi_endpoint` decorator with a
`StreamingResponse` object to produce a real-time SSE response.

```python
import time

def fake_event_streamer():
    for i in range(10):
        yield f"data: some data {i}\n\n".encode()
        time.sleep(0.5)

@app.function(image=modal.Image.debian_slim().pip_install("fastapi[standard]"))
@modal.fastapi_endpoint()
def stream_me():
    from fastapi.responses import StreamingResponse
    return StreamingResponse(
        fake_event_streamer(), media_type="text/event-stream"
    )
```

If you serve this web endpoint and hit it with `curl`, you will see the ten SSE
events progressively appear in your terminal over a ~5 second period.

```shell
curl --no-buffer https://modal-labs--example-streaming-stream-me.modal.run
```

The MIME type of `text/event-stream` is important in this example, as it tells
the downstream web server to return responses immediately, rather than buffering
them in byte chunks (which is more efficient for compression).

You can still return other content types like large files in streams, but they
are not guaranteed to arrive as real-time events.

## Streaming responses with `.remote`

A Modal Function wrapping a generator function body can have its response passed
directly into a `StreamingResponse`. This is particularly useful if you want to
do some GPU processing in one Modal Function that is called by a CPU-based web
endpoint Modal Function.

```python
@app.function(gpu="any")
def fake_video_render():
    for i in range(10):
        yield f"data: finished processing some data from GPU {i}\n\n".encode()
        time.sleep(1)

@app.function(image=modal.Image.debian_slim().pip_install("fastapi[standard]"))
@modal.fastapi_endpoint()
def hook():
    from fastapi.responses import StreamingResponse
    return StreamingResponse(
        fake_video_render.remote_gen(), media_type="text/event-stream"
    )
```

## Streaming responses with `.map` and `.starmap`

You can also combine Modal Function parallelization with streaming responses,
enabling applications to service a request by farming out to dozens of
containers and iteratively returning result chunks to the client.

```python
@app.function()
def map_me(i):
    return f"segment {i}\n"

@app.function(image=modal.Image.debian_slim().pip_install("fastapi[standard]"))
@modal.fastapi_endpoint()
def mapped():
    from fastapi.responses import StreamingResponse
    return StreamingResponse(map_me.map(range(10)), media_type="text/plain")
```

This snippet will spread the ten `map_me(i)` executions across containers, and
return each string response part as it completes. By default the results will be
ordered, but if this isn't necessary pass `order_outputs=False` as keyword
argument to the `.map` call.

### Asynchronous streaming

The example above uses a synchronous generator, which automatically runs on its
own thread, but in asynchronous applications, a loop over a `.map` or `.starmap`
call can block the event loop. This will stop the `StreamingResponse` from
returning response parts iteratively to the client.

To avoid this, you can use the `.aio()` method to convert a synchronous `.map`
into its async version. Also, other blocking calls should be offloaded to a
separate thread with `asyncio.to_thread()`. For example:

```python
@app.function(gpu="any", image=modal.Image.debian_slim().pip_install("fastapi[standard]"))
@modal.fastapi_endpoint()
async def transcribe_video(request):
    from fastapi.responses import StreamingResponse

    segments = await asyncio.to_thread(split_video, request)
    return StreamingResponse(wrapper(segments), media_type="text/event-stream")

# Notice that this is an async generator.
async def wrapper(segments):
    async for partial_result in transcribe_video.map.aio(segments):
        yield "data: " + partial_result + "\n\n"
```

## Further examples

- Complete code the for the simple examples given above is available
  [in our modal-examples Github repository](https://github.com/modal-labs/modal-examples/blob/main/07_web_endpoints/streaming.py).
- [An end-to-end example of streaming Youtube video transcriptions with OpenAI's whisper model.](https://github.com/modal-labs/modal-examples/blob/main/06_gpu_and_ml/openai_whisper/streaming/main.py)

#### Web endpoint URLs

# Web endpoint URLs

This guide documents the behavior of URLs for [web endpoints](https://modal.com/docs/guide/webhooks)
on Modal: automatic generation, configuration, programmatic retrieval, and more.

## Determine the URL of a web endpoint from code

Modal Functions with the
[`fastapi_endpoint`](https://modal.com/docs/reference/modal.fastapi_endpoint),
[`asgi_app`](https://modal.com/docs/reference/modal.asgi_app),
[`wsgi_app`](https://modal.com/docs/reference/modal.wsgi_app),
or [`web_server`](https://modal.com/docs/reference/modal.web_server) decorator
are made available over the Internet when they are
[`serve`d](https://modal.com/docs/reference/cli/serve) or [`deploy`ed](https://modal.com/docs/reference/cli/deploy)
and so they have a URL.

This URL is displayed in the `modal` CLI output
and is available in the Modal [dashboard](https://modal.com/apps) for the Function.

To determine a Function's URL programmatically,
check its [`get_web_url()`](https://modal.com/docs/reference/modal.Function#get_web_url)
property:

```python
@app.function(image=modal.Image.debian_slim().pip_install("fastapi[standard]"))
@modal.fastapi_endpoint(docs=True)
def show_url() -> str:
    return show_url.get_web_url()
```

For deployed Functions, this also works from other Python code!
You just need to do a [`from_name`](https://modal.com/docs/reference/modal.Function#from_name)
based on the name of the Function and its [App](https://modal.com/docs/guide/apps):

```python notest
import requests

remote_function = modal.Function.from_name("app", "show_url")
remote_function.get_web_url() == requests.get(handle.get_web_url()).json()
```

## Auto-generated URLs

By default, Modal Functions
will be served from the `modal.run` domain.
The full URL will be constructed from a number of pieces of information
to uniquely identify the endpoint.

At a high-level, web endpoint URLs for deployed applications have the
following structure: `https://<source>--<label>.modal.run`.

The `source` component represents the workspace and environment where the App is
deployed. If your workspace has only a single environment, the `source` will
just be the workspace name. Multiple environments are disambiguated by an
["environment suffix"](https://modal.com/docs/guide/environments#environment-web-suffixes), so
the full source would be `<workspace>-<suffix>`. However, one environment per
workspace is allowed to have a null suffix, in which case the source would just
be `<workspace>`.

The `label` component represents the specific App and Function that the endpoint
routes to. By default, these are concatenated with a hyphen, so the label would
be `<app>-<function>`.

These components are normalized to contain only lowercase letters, numerals, and dashes.

To put this all together, consider the following example. If a member of the
`ECorp` workspace uses the `main` environment (which has `prod` as its web
suffix) to deploy the `text_to_speech` app with a webhook for the `flask-app`
function, the URL will have the following components:

- _Source_:
  - _Workspace name slug_: `ECorp` → `ecorp`
  - _Environment web suffix slug_: `main` → `prod`
- _Label_:
  - _App name slug_: `text_to_speech` → `text-to-speech`
  - _Function name slug_: `flask_app` → `flask-app`

The full URL will be `https://ecorp-prod--text-to-speech-flask-app.modal.run`.

## User-specified labels

It's also possible to customize the `label` used for each Function
by passing a parameter to the relevant endpoint decorator:

```python
import modal

image = modal.Image.debian_slim().pip_install("fastapi")
app = modal.App(name="text_to_speech", image=image)

@app.function()
@modal.fastapi_endpoint(label="speechify")
def web_endpoint_handler():
    ...
```

Building on the example above, this code would produce the following URL:
`https://ecorp-prod--speechify.modal.run`.

User-specified labels are not automatically normalized, but labels with
invalid characters will be rejected.

## Ephemeral apps

To support development workflows, webhooks for ephemeral apps (i.e., apps
created with `modal serve`) will have a `-dev` suffix appended to their URL
label (regardless of whether the label is auto-generated or user-specified).
This prevents development work from interfering with deployed versions of the
same app.

If an ephemeral app is serving a webhook while another ephemeral webhook is
created seeking the same web endpoint label, the new function will _steal_ the
running webhook's label.

This ensures that the latest iteration of the ephemeral function is
serving requests and that older ones stop receiving web traffic.

## Truncation

If a generated subdomain label is longer than 63 characters, it will be
truncated.

For example, the following subdomain label is too long, 67 characters:
`ecorp--text-to-speech-really-really-realllly-long-function-name-dev`.

The truncation happens by calculating a SHA-256 hash of the overlong label, then
taking the first 6 characters of this hash. The overlong subdomain label is
truncated to 56 characters, and then joined by a dash to the hash prefix. In
the above example, the resulting URL would be
`ecorp--text-to-speech-really-really-rea-1b964b-dev.modal.run`.

The combination of the label hashing and truncation provides a unique list of 63
characters, complying with both DNS system limits and uniqueness requirements.

## Custom domains

**Custom domains are available on our
[Team and Enterprise plans](https://modal.com/settings/plans).**

For more customization, you can use your own domain names with Modal web
endpoints. If your [plan](https://modal.com/pricing) supports custom domains, visit the [Domains
tab](https://modal.com/settings/domains) in your workspace settings to add a domain name to your
workspace.

You can use three kinds of domains with Modal:

- **Apex:** root domain names like `example.com`
- **Subdomain:** single subdomain entries such as `my-app.example.com`,
  `api.example.com`, etc.
- **Wildcard domain:** either in a subdomain like `*.example.com`, or in a
  deeper level like `*.modal.example.com`

You'll be asked to update your domain DNS records with your domain name
registrar and then validate the configuration in Modal. Once the records have
been properly updated and propagated, your custom domain will be ready to use.

You can assign any Modal web endpoint to any registered domain in your workspace
with the `custom_domains` argument.

```python
import modal

app = modal.App("custom-domains-example")

@app.function()
@modal.fastapi_endpoint(custom_domains=["api.example.com"])
def hello(message: str):
    return {"message": f"hello {message}"}
```

You can then run `modal deploy` to put your web endpoint online, live.

```shell
$ curl -s https://api.example.com?message=world
{"message": "hello world"}
```

Note that Modal automatically generates and renews TLS certificates for your
custom domains. Since we do this when your domain is first accessed, there may
be an additional 1-2s latency on the first request. Additional requests use a
cached certificate.

You can also register multiple domain names and associate them with the same web
endpoint.

```python
import modal

app = modal.App("custom-domains-example-2")

@app.function()
@modal.fastapi_endpoint(custom_domains=["api.example.com", "api.example.net"])
def hello(message: str):
    return {"message": f"hello {message}"}
```

For **Wildcard** domains, Modal will automatically resolve arbitrary custom
endpoints (and issue TLS certificates). For example, if you add the wildcard
domain `*.example.com`, then you can create any custom domains under
`example.com`:

```python
import random
import modal

app = modal.App("custom-domains-example-2")

random_domain_name = random.choice(range(10))

@app.function()
@modal.fastapi_endpoint(custom_domains=[f"{random_domain_name}.example.com"])
def hello(message: str):
    return {"message": f"hello {message}"}
```

Custom domains can also be used with
[ASGI](https://modal.com/docs/reference/modal.asgi_app#modalasgi_app) or
[WSGI](https://modal.com/docs/reference/modal.wsgi_app) apps using the same
`custom_domains` argument.

#### Request timeouts

# Request timeouts

Web endpoint (a.k.a. webhook) requests should complete quickly, ideally within a
few seconds. All web endpoint function types
([`web_endpoint`, `asgi_app`, `wsgi_app`](https://modal.com/docs/reference/modal.web_endpoint))
have a maximum HTTP request timeout of 150 seconds enforced. However, the
underlying Modal function can have a longer [timeout](https://modal.com/docs/guide/timeouts).

In case the function takes more than 150 seconds to complete, a HTTP status 303
redirect response is returned pointing at the original URL with a special query
parameter linking it that request. This is the _result URL_ for your function.
Most web browsers allow for up to 20 such redirects, effectively allowing up to
50 minutes (20 \* 150 s) for web endpoints before the request times out.

(**Note:** This does not work with requests that require
[CORS](https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS), since the
response will not have been returned from your code in time for the server to
populate CORS headers.)

Some libraries and tools might require you to add a flag or option in order to
follow redirects automatically, e.g. `curl -L ...` or `http --follow ...`.

The _result URL_ can be reloaded without triggering a new request. It will block
until the request completes.

(**Note:** As of March 2025, the Python standard library's `urllib` module has the
maximum number of redirects to any single URL set to 4 by default ([source](https://github.com/python/cpython/blob/main/Lib/urllib/request.py)), which would limit the total timeout to 12.5 minutes (5 \* 150 s = 750 s) unless this setting is overridden.)

## Polling solutions

Sometimes it can be useful to be able to poll for results rather than wait for a
long running HTTP request. The easiest way to do this is to have your web
endpoint spawn a `modal.Function` call and return the function call id that
another endpoint can use to poll the submitted function's status. Here is an
example:

```python
import fastapi

import modal

image = modal.Image.debian_slim().pip_install("fastapi[standard]")
app = modal.App(image=image)

web_app = fastapi.FastAPI()

@app.function()
@modal.asgi_app()
def fastapi_app():
    return web_app

@app.function()
def slow_operation():
    ...

@web_app.post("/accept")
async def accept_job(request: fastapi.Request):
    call = slow_operation.spawn()
    return {"call_id": call.object_id}

@web_app.get("/result/{call_id}")
async def poll_results(call_id: str):
    function_call = modal.FunctionCall.from_id(call_id)
    try:
        return function_call.get(timeout=0)
    except TimeoutError:
        http_accepted_code = 202
        return fastapi.responses.JSONResponse({}, status_code=http_accepted_code)
```

[_Document OCR Web App_](https://modal.com/docs/examples/doc_ocr_webapp) is an example that uses
this pattern.

#### Proxy Auth Tokens

# Proxy Auth Tokens

Use Proxy Auth Tokens to prevent unauthorized clients from triggering your web endpoints.

```python
import modal

image = modal.Image.debian_slim().pip_install("fastapi")
app = modal.App("proxy-auth-public", image=image)

@app.function()
@modal.fastapi_endpoint()
def public():
    return "hello world"

@app.function()
@modal.fastapi_endpoint(requires_proxy_auth=True)
def private():
    return "hello friend"
```

The `public` endpoint can be hit by any client over the Internet.

```bash
curl https://public-url--goes-here.modal.run
```

The `private` endpoint cannot.

```bash
curl --fail-with-body https://private-url--goes-here.modal.run
# modal-http: missing credentials for proxy authorization
# curl: (22) The requested URL returned error: 401
# https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/401
```

Authorization is demonstrated via a Proxy Auth Token. You can create a Proxy Auth Token for your workspace [here](https://modal.com/settings/proxy-auth-tokens).
In requests to the web endpoint, clients supply the Token ID and Token Secret in the `Modal-Key` and `Modal-Secret` HTTP headers.

```bash
export TOKEN_ID=wk-1234abcd
export TOKEN_SECRET=ws-1234abcd
curl -H "Modal-Key: $TOKEN_ID" \
     -H "Modal-Secret: $TOKEN_SECRET" \
     https://private-url--goes-here.modal.run
```

Proxy authorization can be added to [web endpoints](https://modal.com/docs/guide/webhooks) created by the
[`fastapi_endpoint`](https://modal.com/docs/reference/modal.fastapi_endpoint),
[`asgi_app`](https://modal.com/docs/reference/modal.asgi_app),
[`wsgi_app`](https://modal.com/docs/reference/modal.wsgi_app), or
[`web_server`](https://modal.com/docs/reference/modal.web_server) decorators,
which are otherwise publicly available.

Everyone within the workspace of the web endpoint can manage its Proxy Auth Tokens.

### Networking

#### Tunnels

# Tunnels

Modal allows you to expose live TCP ports on a Modal container. This is done by
creating a _tunnel_ that forwards the port to the public Internet.

```python
import modal

app = modal.App()

@app.function()
def start_app():
    # Inside this `with` block, port 8000 on the container can be accessed by
    # the address at `tunnel.url`, which is randomly assigned.
    with modal.forward(8000) as tunnel:
        print(f"tunnel.url        = {tunnel.url}")
        print(f"tunnel.tls_socket = {tunnel.tls_socket}")
        # ... start some web server at port 8000, using any framework
```

Tunnels are direct connections and terminate TLS automatically. Within a few
milliseconds of container startup, this function prints a message such as:

```
tunnel.url        = https://wtqcahqwhd4tu0.r5.modal.host
tunnel.tls_socket = ('wtqcahqwhd4tu0.r5.modal.host', 443)
```

You can also create tunnels on a [Sandbox](https://modal.com/docs/guide/sandbox-networking#forwarding-ports)
to directly expose the container's ports.

## Build with tunnels

Tunnels are the fastest way to get a low-latency, direct connection to a running
container. You can use them to run live browser applications with **interactive
terminals**, **Jupyter notebooks**, **VS Code servers**, and more.

As a quick example, here is how you would expose a Jupyter notebook:

```python
import os
import secrets
import subprocess

import modal

app = modal.App()
app.image = modal.Image.debian_slim().pip_install("jupyterlab")

@app.function()
def run_jupyter():
    token = secrets.token_urlsafe(13)
    with modal.forward(8888) as tunnel:
        url = tunnel.url + "/?token=" + token
        print(f"Starting Jupyter at {url}")
        subprocess.run(
            [
                "jupyter",
                "lab",
                "--no-browser",
                "--allow-root",
                "--ip=0.0.0.0",
                "--port=8888",
                "--LabApp.allow_origin='*'",
                "--LabApp.allow_remote_access=1",
            ],
            env={**os.environ, "JUPYTER_TOKEN": token, "SHELL": "/bin/bash"},
            stderr=subprocess.DEVNULL,
        )
```

When you run the function, it starts Jupyter and gives you the public URL. It's
as simple as that.

All Modal features are supported. If you
[need GPUs](https://modal.com/docs/guide/gpu), pass `gpu=` to the
`@app.function()` decorator. If you
[need more CPUs, RAM](https://modal.com/docs/guide/resources), or to attach
[volumes](https://modal.com/docs/guide/volumes), those
also just work.

### Programmable startup

The tunnel API is completely on-demand, so you can start them as the result of a
web request.

For example, you could make something like Jupyter Hub without leaving Modal,
giving your users their own Jupyter notebooks when they visit a URL:

```python
import modal

image = modal.Image.debian_slim().pip_install("fastapi[standard]")
app = modal.App(image=image)

@app.function(timeout=900)  # 15 minutes
def run_jupyter(q):
    ...  # as before, but return the URL on app.q

@app.function()
@modal.fastapi_endpoint(method="POST")
def jupyter_hub():
    from fastapi import HTTPException
    from fastapi.responses import RedirectResponse

    ...  # do some validation on the secret or bearer token

    if is_valid:
        with modal.Queue.ephemeral() as q:
            run_jupyter.spawn(q)
            url = q.get()
            return RedirectResponse(url, status_code=303)

    else:
        raise HTTPException(401, "Not authenticated")
```

This gives every user who sends a POST request to the web endpoint their own
Jupyter notebook server, on a fully isolated Modal container.

You could do the same with VS Code and get some basic version of an instant,
serverless IDE!

### Advanced: Unencrypted TCP tunnels

By default, tunnels are only exposed to the Internet at a secure random URL, and
connections have automatic TLS (the "S" in HTTPS). However, sometimes you might
need to expose a protocol like an SSH server that goes directly over TCP. In
this case, we have support for _unencrypted_ tunnels:

```python notest
with modal.forward(8000, unencrypted=True) as tunnel:
    print(f"tunnel.tcp_socket = {tunnel.tcp_socket}")
```

Might produce an output like:

```
tunnel.tcp_socket = ('r3.modal.host', 23447)
```

You can then connect over TCP, for example with `nc r3.modal.host 23447`. Unlike
encrypted TLS sockets, these cannot be given a non-guessable, cryptographically
random URL due to how the TCP protocol works, so they are assigned a random port
number instead.

## Pricing

Modal only charges for containers based on
[the resources you use](https://modal.com/pricing). There is no additional
charge for having an active tunnel.

For example, if you start a Jupyter notebook on port 8888 and access it via
tunnel, you can use it for an hour for development (with 0.01 CPUs) and then
actually run an intensive job with 16 CPUs for one minute. The amount you would
be billed for in that hour is 0.01 + 16 \* (1/60) = **0.28 CPUs**, even though
you had access to 16 CPUs without needing to restart your notebook.

## Security

Tunnels are run on Modal's private global network of Internet relays. On
startup, your container will connect to the nearest tunnel so you get the
minimum latency, very similar in performance to a direct connection with the
machine.

This makes them ideal for live debugging sessions, using web-based terminals
like [ttyd](https://github.com/tsl0922/ttyd).

The generated URLs are cryptographically random, but they are also public on the
Internet, so anyone can access your application if they are given the URL.

We do not currently do any detection of requests above L4, so if you are running
a web server, we will not add special proxy HTTP headers or translate HTTP/2.
You're just getting the TLS-encrypted TCP stream directly!

#### Proxies (beta)

# Proxies (beta)

You can securely connect with resources in your private network
using a Modal Proxy. Proxies are a secure tunnel between
Apps and exit nodes with static IPs. You can allow-list those static IPs
in your network firewall, making sure that only traffic originating from these
IP addresses is allowed into your network.

Proxies are unique and not shared between workspaces. All traffic
between your Apps and the Proxy server is encrypted using
[Wireguard](https://www.wireguard.com/).

Modal Proxies are built on top of [vprox](https://github.com/modal-labs/vprox),
a Modal open-source project used to create highly available proxy servers
using Wireguard.

_Modal Proxies are in beta. Please let us know if you run into issues._

## Creating a Proxy

Proxies are available for [Team Plan](https://modal.com/pricing) or [Enterprise](https://modal.com/pricing) users.

You can create Proxies in your workspace [Settings](https://modal.com/settings) page.
Team Plan users can create one Proxy and Enterprise users three Proxies. Each Proxy
can have a maximum of five static IP addresses.

Please reach out to [support@modal.com](mailto:support@modal.com) if you need greater limits.

## Using a Proxy

After a Proxy is online, add it to a Modal Function with the argument
`proxy=Proxy.from_name("<your-proxy>")`. For example:

```python
import modal
import subprocess

app = modal.App(image=modal.Image.debian_slim().apt_install("curl"))

@app.function(proxy=modal.Proxy.from_name("<your-proxy>"))
def my_ip():
    subprocess.run(["curl", "-s", "ifconfig.me"])

@app.local_entrypoint()
def main():
    my_ip.remote()
```

All network traffic from your Function will now use the Proxy as a tunnel.

The program above will always print the same IP address independent
of where it runs in Modal's infrastructure. If that same program
were to run without a Proxy, it would print a different IP
address depending on where it runs.

## Proxy performance

All traffic that goes through a Proxy is encrypted by Wireguard. This adds
latency to your Function's networking. If are experiencing networking issues
with Proxies related to performance, first add more IP addresses to your
Proxy (see [Adding more IP addresses a Proxy](#adding-more-ip-addresses-to-a-proxy)).

## Adding more IP addresses to a Proxy

Proxies support up to five static IP addresses. Adding IP addresses improves
throughput linearly.

You can add an IP address to your workspace in [Settings](https://modal.com/settings) > Proxies.
Select the desired Proxy and add a new IP.

If a Proxy has multiple IPs, Modal will randomly pick one when running your Function.

## Proxies and Sandboxes

Proxies can also be used with [Sandboxes](https://modal.com/docs/guide/sandbox). For example:

```python notest
import modal

app = modal.App.lookup("sandbox-proxy", create_if_missing=True)
sb = modal.Sandbox.create(
    app=app,
    image=modal.Image.debian_slim().apt_install("curl"),
    proxy=modal.Proxy.from_name("<your-proxy>"))

process = sb.exec("curl", "-s", "https://ifconfig.me")
stdout = process.stdout.read()
print(stdout)

sb.terminate()
```

Similarly to our Function implementation, this Sandbox program will
always print the same IP address.

#### Cluster networking

# Cluster networking

i6pn (IPv6 private networking) is Modal’s private container-to-container networking solution. It allows users to create clusters of Modal containers which can send network traffic to each other with low latency and high bandwidth (≥ 50Gbps).

Normally, `modal.Function` containers can initiate outbound network connections to the internet but they are not directly addressable by other containers. i6pn-enabled containers, on the other hand, can be directly connected to by other i6pn-enabled containers and this is a key enabler of Modal’s preview `@modal.experimental.clustered` functionality.

You can enable i6pn on any `modal.Function`:

```python
@app.function(i6pn=True)
def hello_private_network():
    import socket

    i6pn_addr = socket.getaddrinfo("i6pn.modal.local", None, socket.AF_INET6)[0][4][0]
    print(i6pn_addr) # fdaa:5137:3ebf:a70:1b9d:3a11:71f2:5f0f
```

In this snippet we see that the i6pn-enabled container is able to retrieve its own IPv6 address by
resolving `i6pn.modal.local`. For this Function container to discover the addresses of _other_ containers,
address sharing must be implemented using an auxiliary data structure, such as a shared `modal.Dict` or `modal.Queue`.

## Private networking

All i6pn network traffic is _Workspace private_.

![i6pn-diagram](https://modal-cdn.com/cdnbot/i6pn-1eksk4vuy_c4c4a0df.webp)

In the image above, Workspace A has subnet `fdaa:1::/48`, while Workspace B has subnet `fdaa:2::/48`.

You’ll notice they share the first 16 bits. This is because the `fdaa::/16` prefix contains all of our private network IPv6 addresses, while each workspace is assigned a random 32-bit identifier when it is created. Together, these form the 48-bit subnet.

The upshot of this is that only containers in the same workspace can see each other and send each other network packets. i6pn networking is secure by default.

## Region boundaries

Modal operates a [global fleet](https://modal.com/docs/guide/region-selection) and allows containers to run on multiple cloud providers and in many regions. i6pn networking is however region-scoped functionality, meaning that only i6pn-enabled containers in the same region can perform network communication.

Modal’s i6pn-enabled primitives such as `@modal.experimental.clustered` automatically restrict container geographic placement and cloud placement to ensure inter-container connectivity.

## Public network access to cluster networking

For cluster networked containers that need to be publicly accessible, you need to expose ports with [modal.Tunnel](https://modal.com/docs/guide/tunnels) because i6pn addresses are not publicly exposed.

Consider having a container setup a Tunnel and act as the gateway to the private cluster networking.

### Data sharing and storage

#### Passing local data

# Passing local data

If you have a function that needs access to some data not present in your Python
files themselves you have a few options for bundling that data with your Modal
app.

## Passing function arguments

The simplest and most straight-forward way is to read the data from your local
script and pass the data to the outermost Modal function call:

```python
import json

@app.function()
def foo(a):
    print(sum(a["numbers"]))

@app.local_entrypoint()
def main():
    data_structure = json.load(open("blob.json"))
    foo.remote(data_structure)
```

Any data of reasonable size that is serializable through
[cloudpickle](https://github.com/cloudpipe/cloudpickle) is passable as an
argument to Modal functions.

Refer to the section on [global variables](https://modal.com/docs/guide/global-variables) for how
to work with objects in global scope that can only be initialized locally.

## Including local files

For including local files for your Modal Functions to access, see [Defining Images](https://modal.com/docs/guide/images).

#### Volumes

# Volumes

Modal Volumes provide a high-performance distributed file system for your Modal applications.
They are designed for write-once, read-many I/O workloads, like creating machine learning model
weights and distributing them for inference.

## Creating a Volume

The easiest way to create a Volume and use it as a part of your App is to use
the [`modal volume create`](https://modal.com/docs/reference/cli/volume#modal-volume-create) CLI command. This will create the Volume and output
some sample code:

```bash
% modal volume create my-volume
Created volume 'my-volume' in environment 'main'.
```

## Using a Volume on Modal

To attach an existing Volume to a Modal Function, use [`Volume.from_name`](https://modal.com/docs/reference/modal.Volume#from_name):

```python
vol = modal.Volume.from_name("my-volume")

@app.function(volumes={"/data": vol})
def run():
    with open("/data/xyz.txt", "w") as f:
        f.write("hello")
    vol.commit()  # Needed to make sure all changes are persisted before exit
```

You can also browse and manipulate Volumes from an ad hoc Modal Shell:

```bash
% modal shell --volume my-volume --volume another-volume
```

Volumes will be mounted under `/mnt`.

## Downloading a file from a Volume

While there’s no file size limit for individual files in a volume, the frontend only supports downloading files up to 16 MB. For larger files, please use the CLI:

```bash
% modal volume get my-volume xyz.txt xyz-local.txt
```

### Creating Volumes lazily from code

You can also create Volumes lazily from code using:

```python
vol = modal.Volume.from_name("my-volume", create_if_missing=True)
```

This will create the Volume if it doesn't exist.

## Using a Volume from outside of Modal

Volumes can also be used outside Modal via the [Python SDK](https://modal.com/docs/reference/modal.Volume#modalvolume) or our [CLI](https://modal.com/docs/reference/cli/volume).

### Using a Volume from local code

You can interact with Volumes from anywhere you like using the `modal` Python client library.

```python notest
vol = modal.Volume.from_name("my-volume")

with vol.batch_upload() as batch:
    batch.put_file("local-path.txt", "/remote-path.txt")
    batch.put_directory("/local/directory/", "/remote/directory")
    batch.put_file(io.BytesIO(b"some data"), "/foobar")
```

For more details, see the [reference documentation](https://modal.com/docs/reference/modal.Volume).

### Using a Volume via the command line

You can also interact with Volumes using the command line interface. You can run
`modal volume` to get a full list of its subcommands:

```bash
% modal volume
Usage: modal volume [OPTIONS] COMMAND [ARGS]...

 Read and edit modal.Volume volumes.
 Note: users of modal.NetworkFileSystem should use the modal nfs command instead.

╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --help          Show this message and exit.                                                                                                                                                            │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ File operations ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ cp       Copy within a modal.Volume. Copy source file to destination file or multiple source files to destination directory.                                                                           │
│ get      Download files from a modal.Volume object.                                                                                                                                                    │
│ ls       List files and directories in a modal.Volume volume.                                                                                                                                          │
│ put      Upload a file or directory to a modal.Volume.                                                                                                                                                 │
│ rm       Delete a file or directory from a modal.Volume.                                                                                                                                               │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Management ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ create   Create a named, persistent modal.Volume.                                                                                                                                                      │
│ delete   Delete a named, persistent modal.Volume.                                                                                                                                                      │
│ list     List the details of all modal.Volume volumes in an Environment.                                                                                                                               │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
```

For more details, see the [reference documentation](https://modal.com/docs/reference/cli/volume).

## Volume commits and reloads

Unlike a normal filesystem, you need to explicitly reload the Volume to see
changes made since it was first mounted. This reload is handled by invoking the
[`.reload()`](https://modal.com/docs/reference/modal.Volume#reload) method on a Volume object.
Similarly, any Volume changes made within a container need to be committed for
those the changes to become visible outside the current container. This is handled
periodically by [background commits](#background-commits) and directly by invoking
the [`.commit()`](https://modal.com/docs/reference/modal.Volume#commit)
method on a `modal.Volume` object.

At container creation time the latest state of an attached Volume is mounted. If
the Volume is then subsequently modified by a commit operation in another
running container, that Volume modification won't become available until the
original container does a [`.reload()`](https://modal.com/docs/reference/modal.Volume#reload).

Consider this example which demonstrates the effect of a reload:

```python
import pathlib
import modal

app = modal.App()

volume = modal.Volume.from_name("my-volume")

p = pathlib.Path("/root/foo/bar.txt")

@app.function(volumes={"/root/foo": volume})
def f():
    p.write_text("hello")
    print(f"Created {p=}")
    volume.commit()  # Persist changes
    print(f"Committed {p=}")

@app.function(volumes={"/root/foo": volume})
def g(reload: bool = False):
    if reload:
        volume.reload()  # Fetch latest changes
    if p.exists():
        print(f"{p=} contains '{p.read_text()}'")
    else:
        print(f"{p=} does not exist!")

@app.local_entrypoint()
def main():
    g.remote()  # 1. container for `g` starts
    f.remote()  # 2. container for `f` starts, commits file
    g.remote(reload=False)  # 3. reuses container for `g`, no reload
    g.remote(reload=True)   # 4. reuses container, but reloads to see file.
```

The output for this example is this:

```
p=PosixPath('/root/foo/bar.txt') does not exist!
Created p=PosixPath('/root/foo/bar.txt')
Committed p=PosixPath('/root/foo/bar.txt')
p=PosixPath('/root/foo/bar.txt') does not exist!
p=PosixPath('/root/foo/bar.txt') contains hello
```

This code runs two containers, one for `f` and one for `g`. Only the last
function invocation reads the file created and committed by `f` because it was
configured to reload.

### Background commits

Modal Volumes run background commits:
every few seconds while your Function executes,
the contents of attached Volumes will be committed
without your application code calling `.commit`.
A final snapshot and commit is also automatically performed on container shutdown.

Being able to persist changes to Volumes without changing your application code
is especially useful when [training or fine-tuning models using frameworks](#model-checkpointing).

## Model serving

A single ML model can be served by simply baking it into a `modal.Image` at
build time using [`run_function`](https://modal.com/docs/reference/modal.Image#run_function). But
if you have dozens of models to serve, or otherwise need to decouple image
builds from model storage and serving, use a `modal.Volume`.

Volumes can be used to save a large number of ML models and later serve any one
of them at runtime with much better performance than can be achieved with a
[`modal.NetworkFileSystem`](https://modal.com/docs/reference/modal.NetworkFileSystem).

This snippet below shows the basic structure of the solution.

```python
import modal

app = modal.App()
volume = modal.Volume.from_name("model-store")
model_store_path = "/vol/models"

@app.function(volumes={model_store_path: volume}, gpu="any")
def run_training():
    model = train(...)
    save(model_store_path, model)
    volume.commit()  # Persist changes

@app.function(volumes={model_store_path: volume})
def inference(model_id: str, request):
    try:
        model = load_model(model_store_path, model_id)
    except NotFound:
        volume.reload()  # Fetch latest changes
        model = load_model(model_store_path, model_id)
    return model.run(request)
```

For more details, see our [guide to storing model weights on Modal](https://modal.com/docs/guide/model-weights).

## Model checkpointing

Checkpoints are snapshots of an ML model and can be configured by the callback
functions of ML frameworks. You can use saved checkpoints to restart a training
job from the last saved checkpoint. This is particularly helpful in managing
[preemption](https://modal.com/docs/guide/preemption).

For more, see our [example code for long-running training](https://modal.com/docs/examples/long-training).

### Hugging Face `transformers`

To periodically checkpoint into a `modal.Volume`, just set the `Trainer`'s
[`output_dir`](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments.output_dir)
to a directory in the Volume.

```python
import pathlib

volume = modal.Volume.from_name("my-volume")
VOL_MOUNT_PATH = pathlib.Path("/vol")

@app.function(
    gpu="A10G",
    timeout=2 * 60 * 60,  # run for at most two hours
    volumes={VOL_MOUNT_PATH: volume},
)
def finetune():
    from transformers import Seq2SeqTrainer
    ...

    training_args = Seq2SeqTrainingArguments(
        output_dir=str(VOL_MOUNT_PATH / "model"),
        # ... more args here
    )

    trainer = Seq2SeqTrainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_xsum_train,
        eval_dataset=tokenized_xsum_test,
    )
```

## Volumes versus Network File Systems

Like the [`modal.NetworkFileSystem`](https://modal.com/docs/reference/modal.NetworkFileSystem),
Volumes can be simultaneously attached to multiple Modal Functions, supporting
concurrent reading and writing. But unlike the `modal.NetworkFileSystem`, the
`modal.Volume` has been designed for fast reads and does not automatically
synchronize writes between mounted Volumes.

## Volume performance

Volumes work best when they contain less than 50,000 files and directories. The
latency to attach or modify a Volume scales linearly with the number of files in
the Volume, and past a few tens of thousands of files the linear component
starts to dominate the fixed overhead.

There is currently a hard limit of 500,000 inodes (files, directories and
symbolic links) per Volume. If you reach this limit, any further attempts to
create new files or directories will error with
[`ENOSPC` (No space left on device)](https://pubs.opengroup.org/onlinepubs/9799919799/).

## Filesystem consistency

### Concurrent modification

Concurrent modification from multiple containers is supported, but concurrent
modifications of the same files should be avoided. Last write wins in case of
concurrent modification of the same file — any data the last writer didn't have
when committing changes will be lost!

The number of commits you can run concurrently is limited. If you run too many
concurrent commits each commit will take longer due to contention. If you are
committing small changes, avoid doing more than 5 concurrent commits (the number
of concurrent commits you can make is proportional to the size of the changes
being committed).

As a result, Volumes are typically not a good fit for use cases where you need
to make concurrent modifications to the same file (nor is distributed file
locking supported).

While a reload is in progress the Volume will appear empty to the container that
initiated the reload. That means you cannot read from or write to a Volume in a
container where a reload is ongoing (note that this only applies to the
container where the reload was issued, other containers remain unaffected).

### Busy Volume errors

You can only reload a Volume when there no open files on the Volume. If you have
open files on the Volume the [`.reload()`](https://modal.com/docs/reference/modal.Volume#reload)
operation will fail with "volume busy". The following is a simple example of how
a "volume busy" error can occur:

```python
volume = modal.Volume.from_name("my-volume")

@app.function(volumes={"/vol": volume})
def reload_with_open_files():
    f = open("/vol/data.txt", "r")
    volume.reload()  # Cannot reload when files in the Volume are open.
```

### Can't find file on Volume errors

When accessing files in your Volume, don't forget to pre-pend where your Volume
is mounted in the container.

In the example below, where the Volume has been mounted at `/data`, "hello" is
being written to `/data/xyz.txt`.

```python
import modal

app = modal.App()
vol = modal.Volume.from_name("my-volume")

@app.function(volumes={"/data": vol})
def run():
    with open("/data/xyz.txt", "w") as f:
        f.write("hello")
    vol.commit()
```

If you instead write to `/xyz.txt`, the file will be saved to the local disk of the Modal Function.
When you dump the contents of the Volume, you will not see the `xyz.txt` file.

## Further examples

- [Character LoRA fine-tuning](https://modal.com/docs/examples/diffusers_lora_finetune) with model storage on a Volume
- [Protein folding](https://modal.com/docs/examples/chai1) with model weights and output files stored on Volumes
- [Dataset visualization with Datasette](https://modal.com/docs/example/cron_datasette) using a SQLite database on a Volume

#### Storing model weights

# Storing model weights on Modal

Efficiently managing the weights of large models is crucial for optimizing the
build times and startup latency of many ML and AI applications.

Our recommended method for working with model weights is to store them in a Modal [Volume](https://modal.com/docs/guide/volumes),
which acts as a distributed file system, a "shared disk" all of your Modal Functions can access.

## Storing weights in a Modal Volume

To store your model weights in a Volume, you need to either
make the Volume available to a Modal Function that saves the model weights
or upload the model weights into the Volume from a client.

### Saving model weights into a Modal Volume from a Modal Function

If you're already generating the weights on Modal, you just need to
attach the Volume to your Modal Function, making it available for reading and writing:

```python
from pathlib import Path

volume = modal.Volume.from_name("model-weights-vol", create_if_missing=True)
MODEL_DIR = Path("/models")

@app.function(gpu="any", volumes={MODEL_DIR: volume})  # attach the Volume
def train_model(data, config):
    import run_training

    model = run_training(config, data)
    model.save(config, MODEL_DIR)
```

Volumes are attached by including them in a dictionary that maps
a path on the remote machine to a `modal.Volume` object.
They look just like a normal file system, so model weights can be saved to them
without adding any special code.

If the model weights are generated outside of Modal and made available
over the Internet, for example by an open-weights model provider
or your own training job on a dedicated cluster,
you can also download them into a Volume from a Modal Function:

```python continuation
@app.function(volumes={MODEL_DIR: volume})
def download_model(model_id):
    import model_hub

    model_hub.download(model_id, local_dir=MODEL_DIR / model_id)
```

Add [Modal Secrets](https://modal.com/docs/guide/secrets) to access weights that require authentication.

See [below](#storing-weights-from-the-hugging-face-hub-on-modal) for
more on downloading from the popular Hugging Face Hub.

### Uploading model weights into a Modal Volume

Instead of pulling weights into a Modal Volume from inside a Modal Function,
you might wish to push weights into Modal from a client,
like your laptop or a dedicated training cluster.

For that, you can use the `batch_upload` method of
[`modal.Volume`](https://modal.com/docs/reference/modal.Volume)s
via the Modal Python client library:

```python continuation
volume = modal.Volume.from_name("model-weights-vol", create_if_missing=True)

@app.local_entrypoint()
def main(local_path: str, remote_path: str):
    with volume.batch_upload() as upload:
        upload.put_directory(local_path, remote_path)
```

Alternatively, you can upload model weights using the
[`modal volume`](https://modal.com/docs/reference/cli/volume) CLI command:

```bash
modal volume put model-weights-vol path/to/model path/on/volume
```

### Mounting cloud buckets as Modal Volumes

If your model weights are already in cloud storage,
for example in an S3 bucket, you can connect them
to Modal Functions with a `CloudBucketMount`.

See [the guide](https://modal.com/docs/guide/cloud-bucket-mounts) for details.

## Reading model weights from a Modal Volume

You can read weights from a Volume as you would normally read them
from disk, so long as you attach the Volume to your Function.

```python continuation
@app.function(gpu="any", volumes={MODEL_DIR: volume})
def inference(prompt, model_id):
    import load_model

    model = load_model(MODEL_DIR / model_id)
    model.run(prompt)
```

## Storing weights in the Modal Image

It is also possible to store weights in your Function's Modal [Image](https://modal.com/docs/guide/images),
the private file system state that a Function sees when it starts up.
The weights might be downloaded via shell commands with [`Image.run_commands`](https://modal.com/docs/guide/images)
or downloaded using a Python function with [`Image.run_function`](https://modal.com/docs/guide/images).

We recommend storing model weights in a Modal [Volume](https://modal.com/docs/guide/volumes),
as described [above](#storing-weights-in-a-modal-volume). Performance is similar
for the two methods. Volumes are more flexible.
Images are rebuilt when their definition changes, starting from the changed layer,
which increases reproducibility for some builds but leads to unnecessary extra downloads
in most cases.

## Optimizing model weight reads with `@enter`

In the above code samples, weights are loaded from disk into memory each time
the `inference` function is run. This isn't so bad if inference is much
slower than model loading (e.g. it is run on very large datasets)
or if the model loading logic is smart enough to skip reloading.

To guarantee a particular model's weights are only loaded once, you can use the `@enter`
[container lifecycle hook](https://modal.com/docs/guide/lifecycle-functions)
to load the weights only when a new container starts.

```python continuation
MODEL_ID = "some-model-id"

@app.cls(gpu="any", volumes={MODEL_DIR: volume})
class Model:
    @modal.enter()
    def setup(self, model_id=MODEL_ID):
        import load_model

        self.model = load_model(MODEL_DIR, model_id)

    @modal.method()
    def inference(self, prompt):
        return self.model.run(prompt)
```

Note that methods decorated with `@enter` can't be passed dynamic arguments.

If you need to load a single but possibly different model on each container start, you can
[parametrize](https://modal.com/docs/guide/parametrized-functions) your Modal Cls.
Below, we use the `modal.parameter` syntax.

```python continuation
@app.cls(gpu="any", volumes={MODEL_DIR: volume})
class ParametrizedModel:
    model_id: str = modal.parameter()

    @modal.enter()
    def setup(self):
        import load_model

        self.model = load_model(MODEL_DIR, self.model_id)

    @modal.method()
    def inference(self, prompt):
        return self.model.run(prompt)
```

## Storing weights from the Hugging Face Hub on Modal

The [Hugging Face Hub](https://huggingface.co/models) has over 1,000,000 models
with weights available for download.

The snippet below shows some additional tricks for downloading models
from the Hugging Face Hub on Modal.

```python
from typing import Optional
from pathlib import Path

import modal

# create a Volume, or retrieve it if it exists
volume = modal.Volume.from_name("model-weights-vol", create_if_missing=True)
MODEL_DIR = Path("/models")

# define dependencies for downloading model
download_image = (
    modal.Image.debian_slim()
    .pip_install("huggingface_hub[hf_transfer]")  # install fast Rust download client
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})  # and enable it
)
app = modal.App()

@app.function(
    volumes={MODEL_DIR.as_posix(): volume},  # "mount" the Volume, sharing it with your function
    image=download_image,  # only download dependencies needed here
)
def download_model(
    repo_id: str = "hf-internal-testing/tiny-random-GPTNeoXForCausalLM",
    revision: Optional[str] = None,  # include a revision to prevent surprises!
):
    from huggingface_hub import snapshot_download

    snapshot_download(repo_id=repo_id, local_dir=MODEL_DIR / repo_id, revision=revision)
    print(f"Model downloaded to {MODEL_DIR / repo_id}")
```

#### Cloud bucket mounts

# Cloud bucket mounts

The [`modal.CloudBucketMount`](https://modal.com/docs/reference/modal.CloudBucketMount) is a
mutable volume that allows for both reading and writing files from a cloud
bucket. It supports AWS S3, Cloudflare R2, and Google Cloud Storage buckets.

Cloud bucket mounts are built on top of AWS'
[`mountpoint`](https://github.com/awslabs/mountpoint-s3) technology and inherits
its limitations. See the [Limitations and troubleshooting](#limitations-and-troubleshooting) section for more details.

## Mounting Cloudflare R2 buckets

`CloudBucketMount` enables Cloudflare R2 buckets to be mounted as file system
volumes. Because Cloudflare R2 is
[S3-Compatible](https://developers.cloudflare.com/r2/api/s3/api/) the setup is
very similar between R2 and S3. See
[modal.CloudBucketMount](https://modal.com/docs/reference/modal.CloudBucketMount#modalcloudbucketmount)
for usage instructions.

When creating the R2 API token for use with the mount, you need to have the
ability to read, write, and list objects in the specific buckets you will mount.
You do _not_ need admin permissions, and you should _not_ use "Client IP Address
Filtering".

## Mounting Google Cloud Storage buckets

`CloudBucketMount` enables Google Cloud Storage (GCS) buckets to be mounted as file system
volumes. See [modal.CloudBucketMount](https://modal.com/docs/reference/modal.CloudBucketMount#modalcloudbucketmount)
for GCS setup instructions.

## Mounting S3 buckets

`CloudBucketMount` enables S3 buckets to be mounted as file system volumes. To
interact with a bucket, you must have the appropriate IAM permissions configured
(refer to the section on [IAM Permissions](#iam-permissions)).

```python
import modal
import subprocess

app = modal.App()

s3_bucket_name = "s3-bucket-name"  # Bucket name not ARN.
s3_access_credentials = modal.Secret.from_dict({
    "AWS_ACCESS_KEY_ID": "...",
    "AWS_SECRET_ACCESS_KEY": "...",
    "AWS_REGION": "..."
})

@app.function(
    volumes={
        "/my-mount": modal.CloudBucketMount(s3_bucket_name, secret=s3_access_credentials)
    }
)
def f():
    subprocess.run(["ls", "/my-mount"])
```

### Specifying S3 bucket region

Amazon S3 buckets are associated with a single AWS Region. [`Mountpoint`](https://github.com/awslabs/mountpoint-s3) attempts to automatically detect the region for your S3 bucket at startup time and directs all S3 requests to that region. However, in certain scenarios, like if your container is running on an AWS worker in a certain region, while your bucket is in a different region, this automatic detection may fail.

To avoid this issue, you can specify the region of your S3 bucket by adding an `AWS_REGION` key to your Modal secrets, as in the code example above.

### Using AWS temporary security credentials

`CloudBucketMount`s also support AWS temporary security credentials by passing
the additional environment variable `AWS_SESSION_TOKEN`. Temporary credentials
will expire and will not get renewed automatically. You will need to update
the corresponding Modal Secret in order to prevent failures.

You can get temporary credentials with the [AWS CLI](https://aws.amazon.com/cli/) with:

```shell
$ aws configure export-credentials --format env
export AWS_ACCESS_KEY_ID=XXX
export AWS_SECRET_ACCESS_KEY=XXX
export AWS_SESSION_TOKEN=XXX...
```

All these values are required.

### Using OIDC identity tokens

Modal provides [OIDC integration](https://modal.com/docs/guide/oidc-integration) and will automatically generate identity tokens to authenticate to AWS.
OIDC eliminates the need for manual token passing through Modal secrets and is based on short-lived tokens, which limits the window of exposure if a token is compromised.
To use this feature, you must [configure AWS to trust Modal's OIDC provider](https://modal.com/docs/guide/oidc-integration#step-1-configure-aws-to-trust-modals-oidc-provider)
and [create an IAM role that can be assumed by Modal Functions](https://modal.com/docs/guide/oidc-integration#step-2-create-an-iam-role-that-can-be-assumed-by-modal-functions).

Then, you specify the IAM role that your Modal Function should assume to access the S3 bucket.

```python
import modal

app = modal.App()

s3_bucket_name = "s3-bucket-name"
role_arn = "arn:aws:iam::123456789abcd:role/s3mount-role"

@app.function(
    volumes={
        "/my-mount": modal.CloudBucketMount(
            bucket_name=s3_bucket_name,
            oidc_auth_role_arn=role_arn
        )
    }
)
def f():
    subprocess.run(["ls", "/my-mount"])
```

### Mounting a path within a bucket

To mount only the files under a specific subdirectory, you can specify a path prefix using `key_prefix`.
Since this prefix specifies a directory, it must end in a `/`.
The entire bucket is mounted when no prefix is supplied.

```python
import modal
import subprocess

app = modal.App()

s3_bucket_name = "s3-bucket-name"
prefix = 'path/to/dir/'

s3_access_credentials = modal.Secret.from_dict({
    "AWS_ACCESS_KEY_ID": "...",
    "AWS_SECRET_ACCESS_KEY": "...",
})

@app.function(
    volumes={
        "/my-mount": modal.CloudBucketMount(
            bucket_name=s3_bucket_name,
            key_prefix=prefix,
            secret=s3_access_credentials
        )
    }
)
def f():
    subprocess.run(["ls", "/my-mount"])
```

This will only mount the files in the bucket `s3-bucket-name` that are prefixed by `path/to/dir/`.

### Read-only mode

To mount a bucket in read-only mode, set `read_only=True` as an argument.

```python
import modal
import subprocess

app = modal.App()

s3_bucket_name = "s3-bucket-name"  # Bucket name not ARN.
s3_access_credentials = modal.Secret.from_dict({
    "AWS_ACCESS_KEY_ID": "...",
    "AWS_SECRET_ACCESS_KEY": "...",
})

@app.function(
    volumes={
        "/my-mount": modal.CloudBucketMount(s3_bucket_name, secret=s3_access_credentials, read_only=True)
    }
)
def f():
    subprocess.run(["ls", "/my-mount"])
```

While S3 mounts support both write and read operations, they are optimized for
reading large files sequentially. Certain file operations, such as renaming
files, are not supported. For a comprehensive list of supported operations,
consult the
[Mountpoint documentation](https://github.com/awslabs/mountpoint-s3/blob/main/doc/SEMANTICS.md).

### IAM permissions

To utilize `CloudBucketMount` for reading and writing files from S3 buckets,
your IAM policy must include permissions for `s3:PutObject`,
`s3:AbortMultipartUpload`, and `s3:DeleteObject`. These permissions are not
required for mounts configured with `read_only=True`.

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ModalListBucketAccess",
      "Effect": "Allow",
      "Action": ["s3:ListBucket"],
      "Resource": ["arn:aws:s3:::"]
    },
    {
      "Sid": "ModalBucketAccess",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:AbortMultipartUpload",
        "s3:DeleteObject"
      ],
      "Resource": ["arn:aws:s3:::/*"]
    }
  ]
}
```

## Limitations and troubleshooting

Cloud Bucket Mounts have certain limitations that do not apply to [Volumes](https://modal.com/docs/guide/volumes).
These limitations are primarily around the way that files can be opened and edited in Cloud Bucket Mounts. For
a comprehensive list of limitations, see the [Mountpoint troubleshooting documentation](https://github.com/awslabs/mountpoint-s3/blob/a6179c72bfc237a1fdd06eb4a0863ca537f8d8a7/doc/TROUBLESHOOTING.md)
and the [Mountpoint semantics documentation](https://github.com/awslabs/mountpoint-s3/blob/main/doc/SEMANTICS.md).

The most common issues that users encounter are:

- Files cannot be opened in append mode.
- Files cannot be written to at arbitrary offsets i.e. `seek` and write are not supported together.
- To write to a file, you must open it in `truncate` mode.

These operations typically result in a `PermissionError: [Errno 1] Operation not permitted` error.

If you need these features, give [Volumes](https://modal.com/docs/guide/volumes) a try! If you need these features in S3
and are willing to pay extra for your bucket, you may be able to use [S3 Express](https://aws.amazon.com/s3/storage-classes/express-one-zone/).
Contact us [in our Slack](https://modal.com/slack) if you're interested in using S3 Express.

### Writing files in append mode

If you're using a library which must open a file in append mode, it's best to write to a temporary file
and then move it to your bucket's mount path. A similar approach can be used to write to a file at an arbitrary offset.

```python notest
import tempfile
import shutil

@app.function(
    volumes={"/bucket": modal.CloudBucketMount("my-bucket", secret=s3_credentials)}
)
def append_to_log():
    # Write to a temporary file that supports append mode
    with tempfile.NamedTemporaryFile(mode='a', delete=False) as temp_file:
        temp_file.write("Log entry 1\n")
        temp_file.write("Log entry 2\n")
        temp_path = temp_file.name

    # Move the completed file to the bucket mount
    shutil.move(temp_path, "/bucket/logfile.txt")
```

### Creating a file without a parent directory

If you try to create a file in a directory that doesn't exist, you'll get a `Operation not permitted` error.
To fix this, create the parent directory first with `Path(dst).parent.mkdir(exist_ok=True, parents=True)`.

### Using `np.savez`

`np.savez` seeks to random offsets in a file, making it unsafe for Cloud Bucket Mounts. If your file is large,
you can write it to a temporary file and then move it to your bucket's mount path. If it's small, however,
you can solve this with an in-memory buffer:

```python notest
import io
import numpy as np
import shutil

data = np.random.rand(1000, 512)

# 1. Build the archive entirely in memory
tmp = io.BytesIO()
np.savez_compressed(tmp, array=data)

# 2. Copy it once, sequentially, to the mount point
dest = "/bucket/data.npz"
with open(dest, "wb") as f:
    shutil.copyfileobj(tmp, f)
```

### Torchtune writing checkpoint files

Old versions of [Torchtune](https://github.com/pytorch/torchtune) are incompatible with Cloud Bucket Mounts.
Upgrade to a version greater than or equal to `0.6.1` to ensure checkpoints can be written to the bucket.

### Using the TensorBoard `SummaryWriter`

The TensorBoard `SummaryWriter` opens log files in append mode. These files are quite small, though,
so we recommend writing to a temporary directory and using the [Watchdog](https://github.com/gorakhargosh/watchdog)
Python library to copy the files to the bucket mount path as they come in.

This is a case where it may be worth it to use [Volumes](https://modal.com/docs/guide/volumes) instead - in particular,
training logs are sometimes not subject to the same compliance requirements that force something like checkpoints
or model weights to be stored in a secure location. We even have an example of
[how to use TensorBoard on Volumes](https://modal.com/docs/examples/torch_profiling#serving-tensorboard-on-modal-to-view-pytorch-profiles-and-traces).

#### Dicts

# Dicts

Modal Dicts provide distributed key-value storage to your Modal Apps.

```python runner:ModalRunner
import modal

app = modal.App()
kv = modal.Dict.from_name("kv", create_if_missing=True)

@app.local_entrypoint()
def main(key="cloud", value="dictionary", put=True):
    if put:
        kv[key] = value
    print(f"{key}: {kv[key]}")
```

This page is a high-level guide to using Modal Dicts.
For reference documentation on the `modal.Dict` object, see
[this page](https://modal.com/docs/reference/modal.Dict).
For reference documentation on the `modal dict` CLI command, see
[this page](https://modal.com/docs/reference/cli/dict).

## Modal Dicts are Python dicts in the cloud

Dicts provide distributed key-value storage to your Modal Apps.
Much like a standard Python dictionary, a Dict lets you store and retrieve
values using keys. However, unlike a regular dictionary, a Dict in Modal is
accessible from anywhere, concurrently and in parallel.

```python
# create a remote Dict
dictionary = modal.Dict.from_name("my-dict", create_if_missing=True)

dictionary["key"] = "value"  # set a value from anywhere
value = dictionary["key"]    # get a value from anywhere
```

Dicts are persisted, which means that the data in the dictionary is
stored and can be retrieved even after the application is redeployed.

## You can access Modal Dicts asynchronously

Modal Dicts live in the cloud, which means reads and writes
against them go over the network. That has some unavoidable latency overhead,
relative to just reading from memory, of a few dozen ms.
Reads from Dicts via `["key"]`-style indexing are synchronous,
which means that latency is often directly felt by the application.

But like all Modal objects, you can also interact with Dicts asynchronously
by putting the `.aio` suffix on methods -- in this case, `put` and `get`,
which are synonyms for bracket-based indexing.
Just add the `async` keyword to your `local_entrypoint`s or remote Functions
and `await` the method calls.

```python runner:ModalRunner
import modal

app = modal.App()
dictionary = modal.Dict.from_name("async-dict", create_if_missing=True)

@app.local_entrypoint()
async def main():
    await dictionary.put.aio("key", "value")  # setting a value asynchronously
    assert await dictionary.get.aio("key")   # getting a value asyncrhonrously
```

See the guide to [asynchronous functions](https://modal.com/docs/guide/async) for more
information.

## Modal Dicts are not _exactly_ Python dicts

Python dicts can have keys of any hashable type and values of any type.

You can store Python objects of any serializable type within Dicts as keys or values.

Objects are serialized using [`cloudpickle`](https://github.com/cloudpipe/cloudpickle),
so precise support is inherited from that library. `cloudpickle` can serialize a surprising variety of objects,
like `lambda` functions or even Python modules, but it can't serialize a few things that don't
really make sense to serialize, like live system resources (sockets, writable file descriptors).

Note that you will need to have the library defining the type installed in the environment
where you retrieve the object so that it can be deserialized.

```python runner:ModalRunner
import modal

app = modal.App()
dictionary = modal.Dict.from_name("funky-dict", create_if_missing=True)

@app.function(image=modal.Image.debian_slim().pip_install("numpy"))
def fill():
    import numpy

    dictionary["numpy"] = numpy
    dictionary["modal"] = modal
    dictionary[dictionary] = dictionary  # don't try this at home!

@app.local_entrypoint()
def main():
    fill.remote()
    print(dictionary["modal"])
    print(dictionary[dictionary]["modal"].Dict)
    # print(dictionary["numpy"])  # DeserializationError, if no numpy locally
```

Unlike with normal Python dictionaries, updates to mutable value types will not
be reflected in other containers unless the updated object is explicitly put
back into the Dict. As a consequence, patterns like chained updates
(`my_dict["outer_key"]["inner_key"] = value`) cannot be used the same way as
they would with a local dictionary.

Currently, the per-object size limit is 100 MiB and the maximum number of entries
per update is 10,000. It's recommended to use Dicts for smaller objects (under 5 MiB).
Each object in the Dict will expire after 7 days of inactivity (no reads or writes).

Dicts also provide a locking primitive. See
[this blog post](https://modal.com/blog/cache-dict-launch) for details.

#### Queues

# Queues

Modal Queues provide distributed FIFO queues to your Modal Apps.

```python runner:ModalRunner
import modal

app = modal.App()
queue = modal.Queue.from_name("simple-queue", create_if_missing=True)

def producer(x):
    queue.put(x)  # adding a value

@app.function()
def consumer():
    return queue.get()  # retrieving a value

@app.local_entrypoint()
def main(x="some object"):
    # produce and consume tasks from local or remote code
    producer(x)
    print(consumer.remote())
```

This page is a high-level guide to using Modal Queues.
For reference documentation on the `modal.Queue` object, see
[this page](https://modal.com/docs/reference/modal.Queue).
For reference documentation on the `modal queue` CLI command, see
[this page](https://modal.com/docs/reference/cli/queue).

## Modal Queues are Python queues in the cloud

Like [Python `Queue`s](https://docs.python.org/3/library/queue.html),
Modal Queues are multi-producer, multi-consumer first-in-first-out (FIFO) queues.

Queues are particularly useful when you want to handle tasks or process
data asynchronously, or when you need to pass messages between different
components of your distributed system.

Queues are cleared 24 hours after the last `put` operation and are backed by
a replicated in-memory database, so persistence is likely, but not guaranteed.
As such, `Queue`s are best used for communication between active functions and
not relied on for persistent storage.

[Please get in touch](mailto:support@modal.com) if you need durability for Queue objects.

## Queues are partitioned by key

Queues are split into separate FIFO partitions via a string key. By default, one
partition (corresponding to an empty key) is used.

A single `Queue` can contain up to 100,000 partitions, each with up to 5,000
items. Each item can be up to 1 MiB. These limits also apply to the default
partition.

Each partition has an independent TTL, by default 24 hours.
Lower TTLs can be specified by the `partition_ttl` argument in the `put` or
`put_many` methods.

```python runner:ModalRunner
import modal

app = modal.App()
my_queue = modal.Queue.from_name("partitioned-queue", create_if_missing=True)

@app.local_entrypoint()
def main():
    # clear all elements, start from a clean slate
    my_queue.clear()

    my_queue.put("some value")  # first in
    my_queue.put(123)

    assert my_queue.get() == "some value"  # first out
    assert my_queue.get() == 123

    my_queue.put(0)
    my_queue.put(1, partition="foo")
    my_queue.put(2, partition="bar")

    # Default and "foo" partition are ignored by the get operation.
    assert my_queue.get(partition="bar") == 2

    # Set custom 10s expiration time on "foo" partition.
    my_queue.put(3, partition="foo", partition_ttl=10)

    # (beta feature) Iterate through items in place (read immutably)
    my_queue.put(1)
    assert [v for v in my_queue.iterate()] == [0, 1]
```

## You can access Modal Queues synchronously or asynchronously, blocking or non-blocking

Queues are synchronous and blocking by default. Consumers will block and wait
on an empty Queue and producers will block and wait on a full Queue,
both with an `Optional`, configurable `timeout`. If the `timeout` is `None`,
they will wait indefinitely. If a `timeout` is provided, `get` methods will raise
[`queue.Empty`](https://docs.python.org/3/library/queue.html#queue.Empty)
exceptions and `put` methods will raise
[`queue.Full`](https://docs.python.org/3/library/queue.html#queue.Full)
exceptions, both from the Python standard library.

The `get` and `put` methods can be made non-blocking by setting the `block` argument to `False`.
They raise `queue` exceptions without waiting on the `timeout`.

Queues are stored in the cloud, so all interactions require communication over the network.
This adds some extra latency to calls, apart from the `timeout`, on the order of tens of milliseconds.
To avoid this latency impacting application latency, you can asynchronously interact with Queues
by adding the `.aio` function suffix to access methods.

```python notest
@app.local_entrypoint()
async def main(value=None):
    await my_queue.put.aio(value or 200)
    assert await my_queue.get.aio() == value
```

See the guide to [asynchronous functions](https://modal.com/docs/guide/async) for more
information.

## Modal Queues are not _exactly_ Python Queues

Python Queues can have values of any type.

Modal Queues can store Python objects of any serializable type.

Objects are serialized using [`cloudpickle`](https://github.com/cloudpipe/cloudpickle),
so precise support is inherited from that library. `cloudpickle` can serialize a surprising variety of objects,
like `lambda` functions or even Python modules, but it can't serialize a few things that don't
really make sense to serialize, like live system resources (sockets, writable file descriptors).

Note that you will need to have the library defining the type installed in the environment
where you retrieve the object so that it can be deserialized.

```python runner:ModalRunner
import modal

app = modal.App()
queue = modal.Queue.from_name("funky-queue", create_if_missing=True)
queue.clear()  # start from a clean slate

@app.function(image=modal.Image.debian_slim().pip_install("numpy"))
def fill():
    import numpy

    queue.put(modal)
    queue.put(queue)  # don't try this at home!
    queue.put(numpy)

@app.local_entrypoint()
def main():
    fill.remote()
    print(queue.get().Queue)
    print(queue.get())
    # print(queue.get())  # DeserializationError, if no torch locally
```

#### Dataset ingestion

# Large dataset ingestion

This guide provides best practices for downloading, transforming, and storing large datasets within
Modal. A dataset is considered large if it contains hundreds of thousands of files and/or is over
100 GiB in size.

These guidelines ensure that large datasets can be ingested fully and reliably.

## Configure your Function for heavy disk usage

Large datasets should be downloaded and transformed using a `modal.Function` and stored
into a `modal.CloudBucketMount`. We recommend backing the latter with a Cloudflare R2 bucket,
because Cloudflare does not charge network egress fees and has lower GiB/month storage costs than AWS S3.

This `modal.Function` should specify a large `timeout` because large dataset processing can take hours,
and it should request a larger ephemeral disk in cases where the dataset being downloaded and processed
is hundreds of GiBs.

```python
@app.function(
    volumes={
        "/mnt": modal.CloudBucketMount(
            "datasets",
            bucket_endpoint_url="https://abc123example.r2.cloudflarestorage.com",
            secret=modal.Secret.from_name("cloudflare-r2-datasets"),
        )
    },
    ephemeral_disk=1000 * 1000,  # 1 TiB
    timeout=60 * 60 * 12,  # 12 hours

)
def download_and_transform() -> None:
    ...
```

### Use compressed archives on Modal Volumes

`modal.Volume`s are designed for storing tens of thousands of individual files,
but not for hundreds of thousands or millions of files.
However they can be still be used for storing large datasets if files are first combined and compressed
in a dataset transformation step before saving them into the Volume.

See the [transforming](#transforming) section below for more details.

## Experimentation

Downloading and transforming large datasets can be fiddly. While iterating on a reliable ingestion program
it is recommended to start a long-running `modal.Function` serving a JupyterHub server so that you can maintain
disk state in the face of application errors.

See the [running Jupyter server within a Modal function](https://github.com/modal-labs/modal-examples/blob/main/11_notebooks/jupyter_inside_modal.py) example as base code.

## Downloading

The raw dataset data should be first downloaded into the container at `/tmp/` and not placed
directly into the mounted volume. This serves a couple purposes.

1. Certain download libraries and tools (e.g. `wget`) perform filesystem operations not supported properly by `CloudBucketMount`.
2. The raw dataset data may need to be transformed before use, in which case it is wasteful to store it permanently.

This snippet shows the basic download-and-copy procedure:

```python notest
import pathlib
import shutil
import subprocess

tmp_path = pathlib.Path("/tmp/imagenet/")
vol_path = pathlib.Path("/mnt/imagenet/")
filename = "imagenet-object-localization-challenge.zip"
# 1. Download into /tmp/
subprocess.run(
    f"kaggle competitions download -c imagenet-object-localization-challenge --path {tmp_path}",
    shell=True,
    check=True
)
vol_path.mkdir(exist_ok=True)
# 2. Copy (without transform) into mounted volume.
shutil.copyfile(tmp_path / filename, vol_path / filename)
```

## Transforming

When ingesting a large dataset it is sometimes necessary to transform it before storage, so that it is in
an optimal format for loading at runtime. A common kind of necessary transform is gzip decompression. Very large
datasets are often gzipped for storage and network transmission efficiency, but gzip decompression (80 MiB/s)
is hundreds of times slower than reading from a solid state drive (SSD)
and should be done once before storage to avoid decompressing on every read against the dataset.

Transformations should be performed after storing the raw dataset in `/tmp/`. Performing transformations almost always increases container disk usage and this is where the [`ephemeral_disk` parameter](https://modal.com/docs/reference/modal.App#function) parameter becomes important. For example, a
100 GiB raw, compressed dataset may decompress to into 500 GiB, occupying 600 GiB of container disk space.

Transformations should also typically be performed against `/tmp/`. This is because

1. transforms can be IO intensive and IO latency is lower against local SSD.
2. transforms can create temporary data which is wasteful to store permanently.

## Examples

The best practices offered in this guide are demonstrated in the [`modal-examples` repository](https://github.com/modal-labs/modal-examples/tree/main/12_datasets).

The examples include these popular large datasets:

- [ImageNet](https://www.image-net.org/), the image labeling dataset that kicked off the deep learning revolution
- [COCO](https://cocodataset.org/#download), the Common Objects in COntext dataset of densely-labeled images
- [LAION-400M](https://laion.ai/blog/laion-400-open-dataset/), the Stable Diffusion training dataset
- Data derived from the [Big "Fantastic" Database](https://bfd.mmseqs.com/),
  [Protein Data Bank](https://www.wwpdb.org/), and [UniProt Database](https://www.uniprot.org/)
  used in training the [RoseTTAFold](https://github.com/RosettaCommons/RoseTTAFold) protein structure model

### Sandboxes

#### Sandboxes

# Sandboxes

In addition to the Function interface, Modal has a direct
interface for defining containers _at runtime_ and securely running arbitrary code
inside them.

This can be useful if, for example, you want to:

- Execute code generated by a language model.
- Create isolated environments for running untrusted code.
- Check out a git repository and run a command against it, like a test suite, or
  `npm lint`.
- Run containers with arbitrary dependencies and setup scripts.

Each individual job is called a **Sandbox** and can be created using the
[`Sandbox.create`](https://modal.com/docs/reference/modal.Sandbox#create) constructor:

```python notest
import modal

app = modal.App.lookup("my-app", create_if_missing=True)

sb = modal.Sandbox.create(app=app)

p = sb.exec("python", "-c", "print('hello')", timeout=3)
print(p.stdout.read())

p = sb.exec("bash", "-c", "for i in {1..10}; do date +%T; sleep 0.5; done", timeout=5)
for line in p.stdout:
    # Avoid double newlines by using end="".
    print(line, end="")

sb.terminate()
```

**Note:** you can run the above example as a script directly with `python my_script.py`. `modal run` is not needed here since there is no [entrypoint](https://modal.com/docs/guide/apps#entrypoints-for-ephemeral-apps).

Sandboxes require an [`App`](https://modal.com/docs/guide/apps) to be passed when spawned from outside
of a Modal container. You may pass in a regular `App` object or look one up by name with
[`App.lookup`](https://modal.com/docs/reference/modal.App#lookup). The `create_if_missing` flag on `App.lookup`
will create an `App` with the given name if it doesn't exist.

## Running a Sandbox with an entrypoint

In most cases, Sandboxes are treated as a generic container that can run arbitrary
commands. However, in some cases, you may want to run a single command or script
as the entrypoint of the Sandbox. You can do this by passing string arguments to the
Sandbox constructor:

```python notest
sb = modal.Sandbox.create("python", "-m", "http.server", "8080", app=my_app, timeout=10)
for line in sb.stdout:
    print(line, end="")
```

This functionality is most useful for running long-lived services that you want
to keep running in the background. See our [Jupyter notebook example](https://modal.com/docs/examples/jupyter_sandbox)
for a more concrete example of this.

## Referencing Sandboxes from other code

If you have a running Sandbox, you can retrieve it using the [`Sandbox.from_id`](https://modal.com/docs/reference/modal.Sandbox#from_id)
method.

```python notest
sb = modal.Sandbox.create(app=my_app)
sb_id = sb.object_id

# ... later in the program ...

sb2 = modal.Sandbox.from_id(sb_id)

p = sb2.exec("echo", "hello")
print(p.stdout.read())
sb2.terminate()
```

A common use case for this is keeping a pool of Sandboxes available for executing tasks
as they come in. You can keep a list of `object_id`s of Sandboxes that are "open" and
reuse them, closing over the `object_id` in whatever function is using them.

## Parameters

Sandboxes support nearly all configuration options found in regular `modal.Function`s.
Refer to [`Sandbox.create`](https://modal.com/docs/reference/modal.Sandbox#create) for further documentation
on Sandbox parametrization.

For example, Images and Volumes can be used just as with functions:

```python notest
sb = modal.Sandbox.create(
    image=modal.Image.debian_slim().pip_install("pandas"),
    volumes={"/data": modal.Volume.from_name("my-volume")},
    workdir="/repo",
    app=my_app,
)
```

### Using custom images

Sandboxes support custom images just as Functions do. However, while you'll typically
invoke a Modal Function with the `modal run` cli, you typically spawn a Sandbox
with a simple `python` call. As such, you need to manually enable output streaming
to see your image build logs:

```python notest
image = modal.Image.debian_slim().pip_install("pandas", "numpy")

with modal.enable_output():
    sb = modal.Sandbox.create(image=image, app=my_app)
```

### Dynamically defined environments

Note that any valid `Image` or `Mount` can be used with a Sandbox, even if those
images or mounts have not previously been defined. This also means that Images and
Mounts can be built from requirements at **runtime**. For example, you could
use a language model to write some code and define your image, and then spawn a
Sandbox with it. Check out [devlooper](https://github.com/modal-labs/devlooper)
for a concrete example of this.

### Environment variables

You can set environment variables using inline secrets:

```python notest
secret = modal.Secret.from_dict({"MY_SECRET": "hello"})

sb = modal.Sandbox.create(
    secrets=[secret],
    app=my_app,
)
p = sb.exec("bash", "-c", "echo $MY_SECRET")
print(p.stdout.read())
```

### Verbose logging

You can see Sandbox execution logs using `verbose=True`. For example:

```python notest
sb = modal.Sandbox.create(app=my_app, verbose=True)

p = sb.exec("python", "-c", "print('hello')")
print(p.stdout.read())

with sb.open("test.txt", "w") as f:
    f.write("Hello World\n")
```

shows Sandbox logs:

```
Sandbox exec started: python -c print('hello')
Opened file 'test.txt': fd-yErSQzGL9sig6WAjyNgTPR
Wrote to file: fd-yErSQzGL9sig6WAjyNgTPR
Closed file: fd-yErSQzGL9sig6WAjyNgTPR
```

## Named Sandboxes

You can assign a name to a Sandbox when creating it. Each name must be unique within an app -
only one _running_ Sandbox can use a given name at a time. Once a Sandbox completely stops
running, its name becomes available for reuse. Some applications find Sandbox Names to be
useful for ensuring that no more than one Sandbox is running per resource or project. If a
Sandbox with the given name is already running, `create()` will raise a
`modal.exception.AlreadyExistsError`.

```python notest
sb1 = modal.Sandbox.create(app=my_app, name="my-name")
# this will raise a modal.exception.AlreadyExistsError
sb2 = modal.Sandbox.create(app=my_app, name="my-name")
```

A named Sandbox may be fetched using `modal.Sandbox.from_name()` _but only if the
Sandbox is currently running_. If no running Sandbox is found, `from_name()` will
raise a `modal.exception.NotFoundError`.

```python notest
my_app = modal.App.lookup("my-app", create_if_missing=True)
sb1 = modal.Sandbox.create(app=my_app, name="my-name")
# returns the currently running Sandbox with the name "my-name"
sb2 = modal.Sandbox.from_name("my-app", "my-name")
assert sb1.object_id == sb2.object_id # sb1 and sb2 refer to the same Sandbox
```

## Tagging

Sandboxes can also be tagged with arbitrary key-value pairs. These tags can be used
to filter results in [`Sandbox.list`](https://modal.com/docs/reference/modal.Sandbox#list).

```python notest
sandbox_v1_1 = modal.Sandbox.create("sleep", "10", app=my_app)
sandbox_v1_2 = modal.Sandbox.create("sleep", "20", app=my_app)

sandbox_v1_1.set_tags({"major_version": "1", "minor_version": "1"})
sandbox_v1_2.set_tags({"major_version": "1", "minor_version": "2"})

for sandbox in modal.Sandbox.list(app_id=my_app.app_id):  # All sandboxes.
    print(sandbox.object_id)

for sandbox in modal.Sandbox.list(
    app_id=my_app.app_id,
    tags={"major_version": "1"},
):  # Also all sandboxes.
    print(sandbox.object_id)

for sandbox in modal.Sandbox.list(
    app_id=app.app_id,
    tags={"major_version": "1", "minor_version": "2"},
):  # Just the latest sandbox.
    print(sandbox.object_id)
```

#### Running commands

# Running commands in Sandboxes

Once you have created a Sandbox, you can run commands inside it using the
[`Sandbox.exec`](https://modal.com/docs/reference/modal.Sandbox#exec) method.

```python notest
sb = modal.Sandbox.create(app=my_app)

process = sb.exec("echo", "hello", timeout=3)
print(process.stdout.read())

process = sb.exec("python", "-c", "print(1 + 1)", timeout=3)
print(process.stdout.read())

process = sb.exec(
    "bash",
    "-c",
    "for i in $(seq 1 10); do echo foo $i; sleep 0.1; done",
    timeout=5,
)
for line in process.stdout:
    print(line, end="")

sb.terminate()
```

`Sandbox.exec` returns a [`ContainerProcess`](https://modal.com/docs/reference/modal.container_process#modalcontainer_processcontainerprocess)
object, which allows access to the process's `stdout`, `stderr`, and `stdin`.
The `timeout` parameter ensures that the `exec` command will run for at most
`timeout` seconds.

## Input

The Sandbox and ContainerProcess `stdin` handles are [`StreamWriter`](https://modal.com/docs/reference/modal.io_streams#modalio_streamsstreamwriter)
objects. This object supports flushing writes with both synchronous and asynchronous APIs:

```python notest
import asyncio

sb = modal.Sandbox.create(app=my_app)

p = sb.exec("bash", "-c", "while read line; do echo $line; done")
p.stdin.write(b"foo bar\n")
p.stdin.write_eof()
p.stdin.drain()
p.wait()
sb.terminate()

async def run_async():
    sb = await modal.Sandbox.create.aio(app=my_app)
    p = await sb.exec.aio("bash", "-c", "while read line; do echo $line; done")
    p.stdin.write(b"foo bar\n")
    p.stdin.write_eof()
    await p.stdin.drain.aio()
    await p.wait.aio()
    await sb.terminate.aio()

asyncio.run(run_async())
```

## Output

The Sandbox and ContainerProcess `stdout` and `stderr` handles are [`StreamReader`](https://modal.com/docs/reference/modal.io_streams#modalio_streamsstreamreader)
objects. These objects support reading from the stream in both synchronous and asynchronous manners.
These handles also respect the timeout given to `Sandbox.exec`.

To read from a stream after the underlying process has finished, you can use the `read`
method, which blocks until the process finishes and returns the entire output stream.

```python notest
sb = modal.Sandbox.create(app=my_app)
p = sb.exec("echo", "hello")
print(p.stdout.read())
sb.terminate()
```

To stream output, take advantage of the fact that `stdout` and `stderr` are
iterable:

```python notest
import asyncio

sb = modal.Sandbox.create(app=my_app)

p = sb.exec("bash", "-c", "for i in $(seq 1 10); do echo foo $i; sleep 0.1; done")

for line in p.stdout:
    # Lines preserve the trailing newline character, so use end="" to avoid double newlines.
    print(line, end="")
p.wait()
sb.terminate()

async def run_async():
    sb = await modal.Sandbox.create.aio(app=my_app)
    p = await sb.exec.aio("bash", "-c", "for i in $(seq 1 10); do echo foo $i; sleep 0.1; done")
    async for line in p.stdout:
        # Avoid double newlines by using end="".
        print(line, end="")
    await p.wait.aio()
    await sb.terminate.aio()

asyncio.run(run_async())
```

### Stream types

By default, all streams are buffered in memory, waiting to be consumed by the
client. You can control this behavior with the `stdout` and `stderr` parameters.
These parameters are conceptually similar to the `stdout` and `stderr`
parameters of the [`subprocess`](https://docs.python.org/3/library/subprocess.html#subprocess.DEVNULL) module.

```python notest
from modal.stream_type import StreamType

sb = modal.Sandbox.create(app=my_app)

# Default behavior: buffered in memory.
p = sb.exec(
    "bash",
    "-c",
    "echo foo; echo bar >&2",
    stdout=StreamType.PIPE,
    stderr=StreamType.PIPE,
)
print(p.stdout.read())
print(p.stderr.read())

# Print the stream to STDOUT as it comes in.
p = sb.exec(
    "bash",
    "-c",
    "echo foo; echo bar >&2",
    stdout=StreamType.STDOUT,
    stderr=StreamType.STDOUT,
)
p.wait()

# Discard all output.
p = sb.exec(
    "bash",
    "-c",
    "echo foo; echo bar >&2",
    stdout=StreamType.DEVNULL,
    stderr=StreamType.DEVNULL,
)
p.wait()

sb.terminate()
```

#### Networking and security

# Networking and security

Sandboxes are built to be secure-by-default, meaning that a default Sandbox has
no ability to accept incoming network connections or access your Modal resources.

## Networking

Since Sandboxes may run untrusted code, they have options to restrict their network access.
To block all network access, set `block_network=True` on [`Sandbox.create`](https://modal.com/docs/reference/modal.Sandbox#create).

For more fine-grained networking control, a Sandbox's outbound network access
can be restricted using the `cidr_allowlist` parameter. This parameter takes a
list of CIDR ranges that the Sandbox is allowed to access, blocking all other
outbound traffic.

### Forwarding ports

Sandboxes can also expose TCP ports to the internet. This is useful if,
for example, you want to connect to a web server running inside a Sandbox.

Use the `encrypted_ports` and `unencrypted_ports` parameters of `Sandbox.create`
to specify which ports to forward. You can then access the public URL of a tunnel
using the [`Sandbox.tunnels`](https://modal.com/docs/reference/modal.Sandbox#tunnels) method:

```python notest
import requests
import time

sb = modal.Sandbox.create(
    "python",
    "-m",
    "http.server",
    "12345",
    encrypted_ports=[12345],
    app=my_app,
)

tunnel = sb.tunnels()[12345]

time.sleep(1)  # Wait for server to start.

print(f"Connecting to {tunnel.url}...")
print(requests.get(tunnel.url, timeout=5).text)
```

It is also possible to create an encrypted port that uses `HTTP/2` rather than `HTTP/1.1` with the `h2_ports` option. This will return
a URL that you can make H2 (HTTP/2 + TLS) requests to. If you want to run an `HTTP/2` server inside a sandbox, this feature may be useful.
Here is an example:

```python notest
import time

port = 4359
sb = modal.Sandbox.create(
    app=my_app,
    image=my_image,
    h2_ports = [port],
)
p = sb.exec("python", "my_http2_server.py")

tunnel = sb.tunnels()[port]
time.sleep(1)
print(f"Tunnel URL: {tunnel.url}")
```

For more details on how tunnels work, see the [tunnels guide](https://modal.com/docs/guide/tunnels).

## Security model

In a typical Modal Function, the Function code can call other Modal APIs allowing
it to spawn containers, create and destroy Volumes, read from Dicts and Queues, etc.
Sandboxes, by contrast, are isolated from the main Modal workspace. They have no API
access, meaning the blast radius of any malicious code is limited to the Sandbox
environment.

Sandboxes are built on top of [gVisor](https://gvisor.dev/), a container runtime
by Google that provides strong isolation properties. gVisor has custom logic to
prevent Sandboxes from making malicious system calls, giving you stronger isolation
than standard [runc](https://github.com/opencontainers/runc) containers.

#### File access

# Filesystem Access

There are multiple options for uploading files to a Sandbox and accessing them
from outside the Sandbox.

## Efficient file syncing

To efficiently upload local files to a Sandbox, you can use the
[`add_local_file`](https://modal.com/docs/reference/modal.Image#add_local_file) and
[`add_local_dir`](https://modal.com/docs/reference/modal.Image#add_local_dir) methods on the
[`Image`](https://modal.com/docs/reference/modal.Image) class:

```python notest
sb = modal.Sandbox.create(
    app=my_app,
    image=modal.Image.debian_slim().add_local_dir(
        local_path="/home/user/my_dir",
        remote_path="/app"
    )
)
p = sb.exec("ls", "/app")
print(p.stdout.read())
p.wait()
```

Alternatively, it's possible to use Modal [Volume](https://modal.com/docs/reference/modal.Volume)s or
[CloudBucketMount](https://modal.com/docs/guide/cloud-bucket-mounts)s. These have the benefit that
files created from inside the Sandbox can easily be accessed outside the
Sandbox.

To efficiently upload files to a Sandbox using a Volume, you can use the
[`batch_upload`](https://modal.com/docs/reference/modal.Volume#batch_upload) method on the
`Volume` class - for instance, using an ephemeral Volume that
will be garbage collected when the App finishes:

```python notest
with modal.Volume.ephemeral() as vol:
    import io
    with vol.batch_upload() as batch:
        batch.put_file("local-path.txt", "/remote-path.txt")
        batch.put_directory("/local/directory/", "/remote/directory")
        batch.put_file(io.BytesIO(b"some data"), "/foobar")

    sb = modal.Sandbox.create(
        volumes={"/cache": vol},
        app=my_app,
    )
    p = sb.exec("cat", "/cache/remote-path.txt")
    print(p.stdout.read())
    p.wait()
    sb.terminate()
```

The caller also can access files created in the Volume from the Sandbox, even after the Sandbox is terminated:

```python notest
with modal.Volume.ephemeral() as vol:
    sb = modal.Sandbox.create(
        volumes={"/cache": vol},
        app=my_app,
    )
    p = sb.exec("bash", "-c", "echo foo > /cache/a.txt")
    p.wait()
    sb.terminate()
    for data in vol.read_file("a.txt"):
        print(data)
```

Alternatively, if you want to persist files between Sandbox invocations (useful
if you're building a stateful code interpreter, for example), you can use create
a persisted `Volume` with a dynamically assigned label:

```python notest
session_id = "example-session-id-123abc"
vol = modal.Volume.from_name(f"vol-{session_id}", create_if_missing=True)
sb = modal.Sandbox.create(
    volumes={"/cache": vol},
    app=my_app,
)
p = sb.exec("bash", "-c", "echo foo > /cache/a.txt")
p.wait()
sb.terminate()
for data in vol.read_file("a.txt"):
    print(data)
```

File syncing behavior differs between Volumes and CloudBucketMounts. For
Volumes, files are only synced back to the Volume when the Sandbox terminates.
For CloudBucketMounts, files are synced automatically.

## Filesystem API (Alpha)

If you're less concerned with efficiency of uploads and want a convenient way
to pass data in and out of the Sandbox during execution, you can use our
filesystem API to easily read and write files. The API supports reading
files up to 100 MiB and writes up to 1 GiB in size.

This API is currently in Alpha, and we don't recommend using it for production
workloads.

```python
import modal

app = modal.App.lookup("sandbox-fs-demo", create_if_missing=True)

sb = modal.Sandbox.create(app=app)

with sb.open("test.txt", "w") as f:
    f.write("Hello World\n")

f = sb.open("test.txt", "rb")
print(f.read())
f.close()
```

The filesystem API is similar to Python's built-in [io.FileIO](https://docs.python.org/3/library/io.html#io.FileIO) and supports many of the same methods, including `read`, `readline`, `readlines`, `write`, `flush`, `seek`, and `close`.

We also provide the special methods `replace_bytes` and `delete_bytes`, which may be useful for LLM-generated code.

```python notest
from modal.file_io import delete_bytes, replace_bytes

with sb.open("example.txt", "w") as f:
    f.write("The quick brown fox jumps over the lazy dog")

with sb.open("example.txt", "r+") as f:
    # The quick brown fox jumps over the lazy dog
    print(f.read())

    # The slow brown fox jumps over the lazy dog
    replace_bytes(f, b"slow", start=4, end=9)

    # The slow red fox jumps over the lazy dog
    replace_bytes(f, b"red", start=9, end=14)

    # The slow red fox jumps over the dog
    delete_bytes(f, start=32, end=37)

    f.seek(0)
    print(f.read())

sb.terminate()
```

We additionally provide commands [`mkdir`](https://modal.com/docs/reference/modal.Sandbox#mkdir), [`rm`](https://modal.com/docs/reference/modal.Sandbox#rm), and [`ls`](https://modal.com/docs/reference/modal.Sandbox#ls) to make interacting with the filesystem more ergonomic.

<!-- TODO(WRK-956) -->
<!-- ## File Watching

You can watch files or directories for changes using [`watch`](https://modal.com/docs/reference/modal.Sandbox#watch), which is conceptually similar to [`fsnotify`](https://pkg.go.dev/github.com/fsnotify/fsnotify).

```python notest
from modal.file_io import FileWatchEventType

async def watch(sb: modal.Sandbox):
    event_stream = sb.watch.aio(
        "/watch",
        recursive=True,
        filter=[FileWatchEventType.Create, FileWatchEventType.Modify],
    )
    async for event in event_stream:
        print(event)

async def main():
    app = modal.App.lookup("sandbox-file-watch", create_if_missing=True)
    sb = await modal.Sandbox.create.aio(app=app)
    asyncio.create_task(watch(sb))

    await sb.mkdir.aio("/watch")
    for i in range(10):
        async with await sb.open.aio(f"/watch/bar-{i}.txt", "w") as f:
            await f.write.aio(f"hello-{i}")
``` -->

#### Snapshots

# Snapshots

Sandboxes support snapshotting, allowing you to save your Sandbox's state
and restore it later. This is useful for:

- Creating custom environments for your Sandboxes to run in
- Backing up your Sandbox's state for debugging
- Running large-scale experiments with the same initial state
- Branching your Sandbox's state to test different code changes independently

## Filesystem Snapshots

Filesystem Snapshots are copies of the Sandbox's filesystem at a given point in time.
These Snapshots are [Images](https://modal.com/docs/reference/modal.Image) and can be used to create
new Sandboxes.

To create a Filesystem Snapshot, you can use the
[`Sandbox.snapshot_filesystem()`](https://modal.com/docs/reference/modal.Sandbox#snapshot_filesystem) method:

```python notest
import modal

app = modal.App.lookup("sandbox-fs-snapshot-test", create_if_missing=True)

sb = modal.Sandbox.create(app=app)
p = sb.exec("bash", "-c", "echo 'test' > /test")
p.wait()
assert p.returncode == 0, "failed to write to file"
image = sb.snapshot_filesystem()
sb.terminate()

sb2 = modal.Sandbox.create(image=image, app=app)
p2 = sb2.exec("bash", "-c", "cat /test")
assert p2.stdout.read().strip() == "test"
sb2.terminate()
```

Filesystem Snapshots are optimized for performance: they are calculated as the difference
from your base image, so only modified files are stored. Restoring a Filesystem Snapshot
utilizes the same infrastructure we use to get fast cold starts for your Sandboxes.

## Memory Snapshots

[Sandboxes memory snapshots](https://modal.com/docs/guide/sandbox-memory-snapshots) are in early preview.
Contact us if this is something you're interested in!

### Performance

#### Cold start performance

# Cold start performance

Modal Functions are run in [containers](https://modal.com/docs/guide/images).

If a container is already ready to run your Function, it will be reused.

If not, Modal spins up a new container.
This is known as a _cold start_,
and it is often associated with higher latency.

There are two sources of increased latency during cold starts:

1. inputs may **spend more time waiting** in a queue for a container
   to become ready or "warm".
2. when an input is handled by the container that just started,
   there may be **extra work that only needs to be done on the first invocation**
   ("initialization").

This guide presents techniques and Modal features for reducing the impact of both queueing
and initialization on observed latencies.

If you are invoking Functions with no warm containers
or if you otherwise see inputs spending too much time in the "pending" state,
you should
[target queueing time for optimization](#reduce-time-spent-queueing-for-warm-containers).

If you see some Function invocations taking much longer than others,
and those invocations are the first handled by a new container,
you should
[target initialization for optimization](#reduce-latency-from-initialization).

## Reduce time spent queueing for warm containers

New containers are booted when there are not enough other warm containers to
to handle the current number of inputs.

For example, the first time you send an input to a Function,
there are zero warm containers and there is one input,
so a single container must be booted up.
The total latency for the input will include
the time it takes to boot a container.

If you send another input right after the first one finishes,
there will be one warm container and one pending input,
and no new container will be booted.

Generalizing, there are two factors that affect the time inputs spend queueing:
the time it takes for a container to boot and become warm (which we solve by booting faster)
and the time until a warm container is available to handle an input (which we solve by having more warm containers).

### Warm up containers faster

The time taken for a container to become warm
and ready for inputs can range from seconds to minutes.

Modal's custom container stack has been heavily optimized to reduce this time.
Containers boot in about one second.

But before a container is considered warm and ready to handle inputs,
we need to execute any logic in your code's global scope (such as imports)
or in any
[`modal.enter` methods](https://modal.com/docs/guide/lifecycle-functions).
So if your boots are slow, these are the first places to work on optimization.

For example, you might be downloading a large model from a model server
during the boot process.
You can instead
[download the model ahead of time](https://modal.com/docs/guide/model-weights),
so that it only needs to be downloaded once.

For models in the tens of gigabytes,
this can reduce boot times from minutes to seconds.

### Run more warm containers

It is not always possible to speed up boots sufficiently.
For example, seconds of added latency to load a model may not
be acceptable in an interactive setting.

In this case, the only option is to have more warm containers running.
This increases the chance that an input will be handled by a warm container,
for example one that finishes an input while another container is booting.

Modal currently exposes [three parameters](https://modal.com/docs/guide/scale) that control how
many containers will be warm: `scaledown_window`, `min_containers`,
and `buffer_containers`.

All of these strategies can increase the resources consumed by your Function
and so introduce a trade-off between cold start latencies and cost.

#### Keep containers warm for longer with `scaledown_window`

Modal containers will remain idle for a short period before shutting down. By
default, the maximum idle time is 60 seconds. You can configure this by setting
the `scaledown_window` on the [`@function`](https://modal.com/docs/reference/modal.App#function)
decorator. The value is measured in seconds, and it can be set anywhere between
two seconds and twenty minutes.

```python
import modal

app = modal.App()

@app.function(scaledown_window=300)
def my_idle_greeting():
    return {"hello": "world"}
```

Increasing the `scaledown_window` reduces the chance that subsequent requests
will require a cold start, although you will be billed for any resources used
while the container is idle (e.g., GPU reservation or residual memory
occupancy). Note that containers will not necessarily remain alive for the
entire window, as the autoscaler will scale down more agressively when the
Function is substantially over-provisioned.

#### Overprovision resources with `min_containers` and `buffer_containers`

Keeping already warm containers around longer doesn't help if there are no warm
containers to begin with, as when Functions scale from zero.

To keep some containers warm and running at all times, set the `min_containers`
value on the [`@function`](https://modal.com/docs/reference/modal.App#function) decorator. This
puts a floor on the the number of containers so that the Function doesn't scale
to zero. Modal will still scale up and spin down more containers as the
demand for your Function fluctuates above the `min_containers` value, as usual.

While `min_containers` overprovisions containers while the Function is idle,
`buffer_containers` provisions extra containers while the Function is active.
This "buffer" of extra containers will be idle and ready to handle inputs if
the rate of requests increases. This parameter is particularly useful for
bursty request patterns, where the arrival of one input predicts the arrival of more inputs,
like when a new user or client starts hitting the Function.

```python
import modal

app = modal.App(image=modal.Image.debian_slim().pip_install("fastapi"))

@app.function(min_containers=3, buffer_containers=3)
def my_warm_greeting():
    return "Hello, world!"
```

## Reduce latency from initialization

Some work is done the first time that a function is invoked
but can be used on every subsequent invocation.
This is
[_amortized work_](https://www.cs.cornell.edu/courses/cs312/2006sp/lectures/lec18.html)
done at initialization.

For example, you may be using a large pre-trained model
whose weights need to be loaded from disk to memory the first time it is used.

This results in longer latencies for the first invocation of a warm container,
which shows up in the application as occasional slow calls: high tail latency or elevated p9Xs.

### Move initialization work out of the first invocation

Some work done on the first invocation can be moved up and completed ahead of time.

Any work that can be saved to disk, like
[downloading model weights](https://modal.com/docs/guide/model-weights),
should be done as early as possible. The results can be included in the
[container's Image](https://modal.com/docs/guide/images)
or saved to a
[Modal Volume](https://modal.com/docs/guide/volumes).

Some work is tricky to serialize, like spinning up a network connection or an inference server.
If you can move this initialization logic out of the function body and into the global scope or a
[container `enter` method](https://modal.com/docs/guide/lifecycle-functions#enter),
you can move this work into the warm up period.
Containers will not be considered warm until all `enter` methods have completed,
so no inputs will be routed to containers that have yet to complete this initialization.

For more on how to use `enter` with machine learning model weights, see
[this guide](https://modal.com/docs/guide/model-weights).

Note that `enter` doesn't get rid of the latency --
it just moves the latency to the warm up period,
where it can be handled by
[running more warm containers](#run-more-warm-containers).

### Share initialization work across cold starts with memory snapshots

Cold starts can also be made faster by using memory snapshots.

Invocations of a Function after the first
are faster in part because the memory is already populated
with values that otherwise need to be computed or read from disk,
like the contents of imported libraries.

Memory snapshotting captures the state of a container's memory
at user-controlled points after it has been warmed up
and reuses that state in future boots, which can substantially
reduce cold start latency penalties and warm up period duration.

Refer to the [memory snapshot](https://modal.com/docs/guide/memory-snapshot)
guide for details.

### Optimize initialization code

Sometimes, there is nothing to be done but to speed this work up.

Here, we share specific patterns that show up in optimizing initialization
in Modal Functions.

#### Load multiple large files concurrently

Often Modal applications need to read large files into memory (eg. model
weights) before they can process inputs. Where feasible these large file
reads should happen concurrently and not sequentially. Concurrent IO takes
full advantage of our platform's high disk and network bandwidth
to reduce latency.

One common example of slow sequential IO is loading multiple independent
Huggingface `transformers` models in series.

```python notest
from transformers import CLIPProcessor, CLIPModel, BlipProcessor, BlipForConditionalGeneration
model_a = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor_a = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
model_b = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
processor_b = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")
```

The above snippet does four `.from_pretrained` loads sequentially.
None of the components depend on another being already loaded in memory, so they
can be loaded concurrently instead.

They could instead be loaded concurrently using a function like this:

```python notest
from concurrent.futures import ThreadPoolExecutor, as_completed
from transformers import CLIPProcessor, CLIPModel, BlipProcessor, BlipForConditionalGeneration

def load_models_concurrently(load_functions_map: dict) -> dict:
    model_id_to_model = {}
    with ThreadPoolExecutor(max_workers=len(load_functions_map)) as executor:
        future_to_model_id = {
            executor.submit(load_fn): model_id
            for model_id, load_fn in load_functions_map.items()
        }
        for future in as_completed(future_to_model_id.keys()):
            model_id_to_model[future_to_model_id[future]] = future.result()
    return model_id_to_model

components = load_models_concurrently({
    "clip_model": lambda: CLIPModel.from_pretrained("openai/clip-vit-base-patch32"),
    "clip_processor": lambda: CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32"),
    "blip_model": lambda: BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large"),
    "blip_processor": lambda: BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")
})
```

If performing concurrent IO on large file reads does _not_ speed up your cold
starts, it's possible that some part of your function's code is holding the
Python [GIL](https://wiki.python.org/moin/GlobalInterpreterLock) and reducing
the efficacy of the multi-threaded executor.

#### Memory Snapshot

# Memory Snapshot

Modal can save the state of your Function's memory right after initialization and restore it directly later, skipping initialization work.

These "memory snapshots" can dramatically improve cold start performance for Modal Functions.

During initialization, your code might read many files from the file system, which is quite expensive.
For example, the `torch` package is [hundreds of MiB](https://pypi.org/project/torch/#files) and requires over 20,000 file operations to load!
Such Functions typically start several times faster with memory snapshots enabled.

## How do I use memory snapshots?

You can enable memory snapshots for your Function with the `enable_memory_snapshot=True` parameter:

```python
@app.function(enable_memory_snapshot=True)
def my_func():
    print("hello")
```

Then deploy the App with `modal deploy`. Memory snapshots are created only for deployed Apps.

## How do I use GPU memory snapshots?

Pass the additional option `experimental_options={"enable_gpu_snapshot": True}` to your Function
to enable GPU snapshotting.

```python
@app.function(
    gpu="a10",
    enable_memory_snapshot=True,
    experimental_options={"enable_gpu_snapshot": True},
)
def my_gpu_func():
    print("hello CUDA")
```

Otherwise, memory snapshots only capture CPU memory. GPU memory is not available during snapshots
(though [CUDA drivers are](https://modal.com/docs/guide/cuda)).

You can find a detailed code sample [here](https://modal.com/docs/examples/gpu_snapshot).

### What are the limitations of GPU memory snapshots?

GPU memory snapshots are in _alpha_.
[We've seen](https://modal.com/blog/gpu-mem-snapshots) that they can massively reduce cold boot time
but we are still exploring their limitations. Try it for yourself and let us know how it goes!

## When are snapshots updated?

Redeploying your Function with new configuration (e.g. a [new GPU type](https://modal.com/docs/guide/gpu))
or new code will cause previous snapshots to become obsolete.
Subsequent invocations to the new Function version will automatically create new snapshots with the new configuration and code.

Changes to [Modal Volumes](https://modal.com/docs/guide/volumes) do not cause snapshots to update.
Deleting files in a Volume used during restore will cause restore failures.

## I haven't changed my Function. Why do I still see snapshots being created sometimes?

Modal recaptures snapshots to keep up with the platform's latest runtime and security changes.

Additionally, you may observe your Function being memory
snapshot multiple times during its first few invocations. This happens because
memory snapshots are specific to the underlying worker type that created them (e.g. low-level processor details),
and Modal Functions run across a handful of worker types.

Snapshots may add a small amount of latency to Function initialization.

CPU-only Functions need around 6 snapshots for full coverage, and Functions targeting a specific
GPU (e.g. A100) need 2-3.

## How do snapshots handle randomness?

If your application depends on uniqueness of state, you must evaluate your
Function code and verify that it is resilient to snapshotting operations. For
example, if a variable is randomly initialized and snapshotted, that variable
will be identical after every restore, possibly breaking uniqueness expectations
of the proceeding Function code.

#### Geographic latency

# Geographic Latency

Modal's worker cluster is multi-cloud and multi-region. The vast majority of workers are located
in the continental USA, but we do run workers in Europe and Asia.

Modal's control plane is hosted in Virginia, USA (`us-east-1`).

Any time data needs to travel between the Modal client, our control plane servers, and our workers
latency will be incurred. [Cloudping.co](https://www.cloudping.co) provides good estimates on the
significance of the latency between regions. For example, the roundtrip latency between AWS `us-east-1` (Virginia, USA) and
`us-west-1` (California, USA) is around 60ms.

You can observe the location identifier of a container [via an environment variable](https://modal.com/docs/guide/environment_variables).
Logging this environment variable alongside latency information can reveal when geography is impacting your application
performance.

## Region selection

In cases where low-latency communication is required between your container and a network dependency (e.g a database),
it is useful to ensure that Modal schedules your container in only regions geographically proximate to that dependency.
For example, if you have an AWS RDS database in Virginia, USA (`us-east-1`), ensuring your Modal containers are also scheduled in Virginia
means that network latency between the container and the database will be less than 5 milliseconds.

For more information, please see [Region selection](https://modal.com/docs/guide/region-selection).

### Reliability and robustness

#### Failures and retries

# Failures and retries

When you call a function over a sequence of inputs with
[Function.map()](https://modal.com/docs/guide/scale#parallel-execution-of-inputs), sometimes
errors can happen during function execution. Exceptions from within the remote
function are propagated to the caller, so you can handle them with a
`try-except` statement (refer to
[section on custom types](https://modal.com/docs/guide/troubleshooting#custom-types-defined-in-__main__)
for more on how to catch user-defined exceptions):

```python
@app.function()
def f(i):
    raise ValueError()

@app.local_entrypoint()
def main():
    try:
        for _ in f.map([1, 2, 3]):
            pass
    except ValueError:
        print("Exception handled")
```

## Function retries

You can configure Modal to automatically retry function failures if you set the
`retries` option when declaring your function:

```python
@app.function(retries=3)
def my_flaky_function():
    pass
```

When used with `Function.map()`, each input is retried up to the max number of
retries specified.

The basic configuration shown provides a fixed 1s delay between retry attempts.
For fine-grained control over retry delays, including exponential backoff
configuration, use [`modal.Retries`](https://modal.com/docs/reference/modal.Retries).

To treat exceptions as successful results and aggregate them in the results list instead, pass in [`return_exceptions=True`](https://modal.com/docs/guide/scale#exceptions).

## Container crashes

If a `modal.Function` container crashes (either on start-up, e.g. while handling imports in global scope, or during execution, e.g. an out-of-memory error), Modal will reschedule the container and any work it was currently assigned.

For [ephemeral apps](https://modal.com/docs/guide/apps#ephemeral-apps), container crashes will be retried until a failure rate is exceeded, after which all pending inputs will be failed and the exception will be propagated to the caller.

For [deployed apps](https://modal.com/docs/guide/apps#deployed-apps), container crashes will be retried indefinitely, so as to not disrupt service. Modal will instead apply a crash-loop backoff and the rate of new container creation for the function will be slowed down. Crash-looping containers are displayed in the app dashboard.

#### Preemption

# Preemption

All Modal Functions are subject to preemption. If a preemption event interrupts
a running Function, Modal will gracefully terminate the Function and restart it
on the same input.

Preemptions are rare, but it is always possible that your Function is
interrupted. Long-running Functions such as model training Functions should take
particular care to tolerate interruptions, as likelihood of interruption increases
with Function run duration.

## Preparing for interruptions

Design your applications to be fault and preemption tolerant. Modal will send an
interrupt signal to your container when preemption occurs. This will cause the
Function's [exit handler](https://modal.com/docs/guide/lifecycle-functions#exit) to run, which
can perform any cleanup within its grace period.

Other best practices for handling preemptions include:

- Divide long-running operations into small tasks or use checkpoints so that you
  can save your work frequently.
- Ensure preemptible operations are safely retryable (ie. idempotent).

## Running uninterruptible Functions

We currently don't have a way for Functions to avoid the possibility of
interruption, but it's a planned feature. If you require Functions guaranteed to
run without interruption, please reach out!

#### Timeouts

# Timeouts

All Modal [Function](https://modal.com/docs/reference/modal.Function) executions have a default
execution timeout of 300 seconds (5 minutes), but users may specify timeout
durations between 1 second and 24 hours.

```python
import time

@app.function()
def f():
    time.sleep(599)  # Timeout!

@app.function(timeout=600)
def g():
    time.sleep(599)
    print("*Just* made it!")
```

The timeout duration is a measure of a Function's _execution_ time. It does not
include scheduling time or any other period besides the time your code is
executing in Modal. This duration is also per execution attempt, meaning
Functions configured with [`modal.Retries`](https://modal.com/docs/reference/modal.Retries) will
start new execution timeouts on each retry. For example, an infinite-looping
Function with a 100 second timeout and 3 allowed retries will run for least 400
seconds within Modal.

### Container startup timeout

Currently `timeout` applies also to the container's _startup_ time as well as its execution time.
If you're container is failing for spending too long initializing, extend the `timeout` of your Function.
In the future we will decouple startup timeouts from execution timeouts.

## Handling timeouts

After exhausting any specified retries, a timeout in a Function will produce a
`modal.exception.FunctionTimeoutError` which you may catch in your code.

```python
import modal.exception

@app.function(timeout=100)
def f():
    time.sleep(200)  # Timeout!

@app.local_entrypoint()
def main():
    try:
        f.remote()
    except modal.exception.FunctionTimeoutError:
        ... # Handle the timeout.
```

## Timeout accuracy

Functions will run for _at least_ as long as their timeout allows, but they may
run a handful of seconds longer. If you require accurate and precise timeout
durations on your Function executions, it is recommended that you implement
timeout logic in your user code.

#### GPU health

# GPU Health

Modal constantly monitors host GPU health, draining Workers with critical issues
and surfacing warnings for customer triage.

Application level observability of GPU health is facilitated by [metrics](https://modal.com/docs/guide/gpu-metrics) and event logging to container log streams.

## `[gpu-health]` logging

Containers with attached NVIDIA GPUs are connected to our `gpu-health` monitoring system
and receive event logs which originate from either application software behavior, system software behavior, or hardware failure.

These logs are in the following format: `[gpu-health] [LEVEL] GPU-[UUID]: EVENT_TYPE: MSG`

- `gpu-health`: Name indicating the source is Modal's observability system.
- `LEVEL`: Represents the severity level of the log message.
- `GPU_UUID`: A unique identifier for the GPU device associated with the event, if any.
- `EVENT_TYPE`: The type of event source. Modal monitors for multiple types of errors,
  including Xid, SXid, and uncorrectable ECC. See below for more details.
- `MSG`: The message component is either the original message taken from the event source, or a description provided by Modal of the problem.

## Level

The severity level may be `CRITICAL` or `WARN`. Modal automatically responds to `CRITICAL` level events by draining the underlying Worker and migrating customer containers.
`WARN` level logs may be benign or indication of an application or library bug. No automatic action is taken by our system for warnings.

## Xid & SXid

The Xid message is an error report from the NVIDIA driver. The SXid, or "Switch Xid" is a report for the NVSwitch component used in GPU-to-GPU communication, and is thus only relevant in multi-GPU containers.

A classic critical Xid error is the 'fell of the bus' report, code 79. The `gpu-health` event log looks like this:

```
[gpu-health] [CRITICAL] GPU-1234: XID: NVRM: Xid (PCI:0000:c6:00): 79, pid=1101234, name=nvc:[driver], GPU has fallen off the bus.
```

There are over 100 Xid codes and they are of highly varying frequency, severity, and specificity.
See [NVIDIA's official documentation](https://docs.nvidia.com/deploy/xid-errors/index.html) for more information.

#### Troubleshooting

# Troubleshooting

## "Command not found" errors

If you installed Modal but you're seeing an error like
`modal: command not found` when trying to run the CLI, this means that the
installation location of Python package executables ("binaries") are not present
on your system path. This is a common problem; you need to reconfigure your
system's environment variables to fix it.

One workaround is to use `python -m modal.cli` instead of `modal`. However, this
is just a patch. There's no single solution for the problem because Python
installs dependencies on different locations depending on your environment. See
this [popular StackOverflow question](https://stackoverflow.com/q/35898734) for
pointers on how to resolve your system path issue.

## Custom types defined in `__main__`

Modal currently uses [cloudpickle](https://github.com/cloudpipe/cloudpickle) to
transfer objects returned or exceptions raised by functions that are executed in
Modal. This gives a lot of flexibility and support for custom data types.

However, any types that are declared in your Python entrypoint file (The one you
call on the command line) will currently be _redeclared_ if they are returned
from Modal functions, and will therefore have the same structure and type name
but not maintain class object identity with your local types. This means that
you _can't_ catch specific custom exception classes:

```python
import modal
app = modal.App()

class MyException(Exception):
    pass

@app.function()
def raise_custom():
    raise MyException()

@app.local_entrypoint()
def main():
    try:
        raise_custom.remote()
    except MyException:  # this will not catch the remote exception
        pass
    except Exception:  # this will catch it instead, as it's still a subclass of Exception
        pass
```

Nor can you do object equality checks on `dataclasses`, or `isinstance` checks:

```python
import modal
import dataclasses

@dataclasses.dataclass
class MyType:
    foo: int

app = modal.App()

@app.function()
def return_custom():
    return MyType(foo=10)

@app.local_entrypoint()
def main():
    data = return_custom.remote()
    assert data == MyType(foo=10)  # false!
    assert data.foo == 10  # true!, the type still has the same fields etc.
    assert isinstance(data, MyType)  # false!
```

If this is a problem for you, you can easily solve it by moving your custom type
definitions to a separate Python file from the one you trigger to run your Modal
code, and import that file instead.

```python
# File: my_types.py
import dataclasses

@dataclasses.dataclass
class MyType:
    foo: int
```

```python notest
# File: modal_script.py
import modal
from my_types import MyType

app = modal.App()

@app.function()
def return_custom():
    return MyType(foo=10)

@app.local_entrypoint()
def main():
    data = return_custom.remote()
    assert data == MyType(foo=10)  # true!
    assert isinstance(data, MyType)  # true!
```

## Function side effects

The same container _can_ be reused for multiple invocations of the same function
within an app. This means that if your function has side effects like modifying
files on disk, they may or may not be present for subsequent calls to that
function. You should not rely on the side effects to be present, but you might
have to be careful so they don't cause problems.

For example, if you create a disk-backed database using sqlite3:

```python
import modal
import sqlite3

app = modal.App()

@app.function()
def db_op():
    db = sqlite3("db_file.sqlite3")
    db.execute("CREATE TABLE example (col_1 TEXT)")
    ...
```

This function _can_ (but will not necessarily) fail on the second invocation
with an

`OperationalError: table foo already exists`

To get around this, take care to either clean up your side effects (e.g.
deleting the db file at the end your function call above) or make your functions
take them into consideration (e.g. adding an
`if os.path.exists("db_file.sqlite")` condition or randomize the filename
above).

## Heartbeat timeout

The Modal client in `modal.Function` containers runs a heartbeat loop that the host uses to healthcheck the container's main process.
If the container stops heartbeating for a long period (minutes) the container will be terminated due to a `heartbeat timeout`, which is displayed in logs.

Container heartbeat timeouts are rare, and typically caused by one of two application-level sources:

- [Global Interpreter Lock](https://wiki.python.org/moin/GlobalInterpreterLock) is held for a long time, stopping the heartbeat thread from making progress. [py-spy](https://github.com/benfred/py-spy?tab=readme-ov-file#how-does-gil-detection-work) can detect GIL holding. We include `py-spy` [automatically in `modal shell`](https://modal.com/docs/guide/developing-debugging#debug-shells) for convenience. A quick fix for GIL holding is to run the code which holds the GIL [in a subprocess](https://docs.python.org/3/library/multiprocessing.html#the-process-class).
- Container process initiates shutdown, intentionally stopping the heartbeats, but it does not complete shutdown.

In both cases [turning on debug logging](https://modal.com/docs/guide/developing-debugging#debug-logs) will help diagnose the issue.

## `413 Content Too Large` errors

If you receive a `413 Content Too Large` error, this might be because you are
hitting our gRPC payload size limits.

The size limit is currently 100MB.

## `403` errors when connecting to GCP services.

GCP will sometimes return 403 errors to Modal when connecting directly to GCP
cloud services like Google Cloud Storage. This is a known issue.

The workaround is to pin the `cloud` parameter in the
[`@app.function`](https://modal.com/docs/reference/modal.App#function) or
[`@app.cls`](https://modal.com/docs/reference/modal.App#cls).

For example:

```python
@app.function(cloud="gcp")
def f():
    ...
```

```python
@app.cls(cloud="gcp")
class MyClass:
    ...
```

## Outdated kernel version (4.4.0)

Our secure runtime [reports a misleadingly old](https://github.com/google/gvisor/issues/11117) kernel version, 4.4.0.
Certain software libraries will detect this and report a warning. These warnings can be ignored because the runtime
actually implements Linux kernel features from versions 5.15+.

If the outdated kernel version reporting creates errors in your application please contact us [in our Slack](https://modal.com/slack).

### Security and privacy

# Security and privacy at Modal

The document outlines Modal's security and privacy commitments.

## Application security (AppSec)

AppSec is the practice of building software that is secure by design, secured
during development, secured with testing and review, and deployed securely.

- We build our software using memory-safe programming languages, including Rust
  (for our worker runtime and storage infrastructure) and Python (for our API
  servers and Modal client).
- Software dependencies are audited by Github's Dependabot.
- We make decisions that minimize our attack surface. Most interactions with
  Modal are well-described in a gRPC API, and occur through
  [`modal`](https://pypi.org/project/modal), our open-source command-line tool
  and Python client library.
- We have automated synthetic monitoring test applications that continuously
  check for network and application isolation within our runtime.
- We use HTTPS for secure connections. Modal forces HTTPS for all services using
  TLS (SSL), including our public website and the Dashboard to ensure secure
  connections. Modal's [client library](https://pypi.org/project/modal) connects
  to Modal's servers over TLS and verify TLS certificates on each connection.
- All user data is encrypted in transit and at rest.
- All public Modal APIs use
  [TLS 1.3](https://datatracker.ietf.org/doc/html/rfc8446), the latest and
  safest version of the TLS protocol.
- Internal code reviews are performed using a modern, PR-based development
  workflow (Github), and engage external penetration testing firms to assess our
  software security.

## Corporate security (CorpSec)

CorpSec is the practice of making sure Modal employees have secure access to
Modal company infrastructure, and also that exposed channels to Modal are
secured. CorpSec controls are the primary concern of standards such as SOC2.

- Access to our services and applications is gated on a SSO Identity Provider
  (IdP).
- We mandate phishing-resistant multi-factor authentication (MFA) in all
  enrolled IdP accounts.
- We regularly audit access to internal systems.
- Employee laptops are protected by full disk encryption using FileVault2, and
  managed by Secureframe MDM.

## Network and infrastructure security (InfraSec)

InfraSec is the practice of ensuring a hardened, minimal attack surface for
components we deploy on our network.

- Modal uses logging and metrics observability providers, including Datadog and
  Sentry.io.
- Compute jobs at Modal are containerized and virtualized using
  [gVisor](https://github.com/google/gvisor), the sandboxing technology
  developed at Google and used in their _Google Cloud Run_ and _Google
  Kubernetes Engine_ cloud services.
- We conduct annual business continuity and security incident exercises.
<!-- TODO: we don't yet encrypt network file system data. "Customer information on databases and volumes at Modal is encrypted with the Linux LUKS block storage encryption secrets." -->

## Vulnerability remediation

Security vulnerabilities directly affecting Modal's systems and services will be
patched or otherwise remediated within a timeframe appropriate for the severity
of the vulnerability, subject to the public availability of a patch or other
remediation mechanisms.

If there is a CVSS severity rating accompanying a vulnerability disclosure, we
rely on that as a starting point, but may upgrade or downgrade the severity
using our best judgement.

### Severity timeframes

- **Critical:** 24 hours
- **High:** 1 week
- **Medium:** 1 month
- **Low:** 3 months
- **Informational:** 3 months or longer

## Shared responsibility model

Modal prioritizes the integrity, security, and availability of customer data. Under our shared responsibility model, customers also have certain responsibilities regarding data backup, recovery, and availability.

1. **Data backup**: Customers are responsible for maintaining backups of their data. Performing daily backups is recommended. Customers must routinely verify the integrity of their backups.
2. **Data recovery**: Customers should maintain a comprehensive data recovery plan that includes detailed procedures for data restoration in the event of data loss, corruption, or system failure. Customers must routinely test their recovery process.
3. **Availability**: While Modal is committed to high service availability, customers must implement contingency measures to maintain business continuity during service interruptions. Customers are also responsible for the reliability of their own IT infrastructure.
4. **Security measures**: Customers must implement appropriate security measures, such as encryption and access controls, to protect their data throughout the backup, storage, and recovery processes. These processes must comply with all relevant laws and regulations.

## SOC 2

We have successfully completed a [System and Organization Controls (SOC) 2 Type 2
audit](https://modal.com/blog/soc2type2). Go to our [Security Portal](https://trust.modal.com) to request access to the report.

## HIPAA

HIPAA, which stands for the Health Insurance Portability and Accountability Act, establishes a set of standards that protect health information, including individuals’ medical records and other individually identifiable health information. HIPAA guidelines apply to both covered entities and business associates—of which Modal is the latter if you are processing PHI on Modal.

Modal's services can be used in a HIPAA compliant manner. It is important to note that unlike other security standards, there is no officially recognized certification process for HIPAA compliance. Instead, we demonstrate our compliance with regulations such as HIPAA via the practices outlined in this doc, our technical and operational security measures, and through official audits for standards compliance such as SOC 2 certification.

To use Modal services for HIPAA-compliant workloads, a Business Associate Agreement (BAA) should be established with us prior to submission of any PHI. This is available on our Enterprise plan. Contact us at security@modal.com to get started. At the moment, [Volumes](https://modal.com/docs/guide/volumes), [Images](https://modal.com/docs/guide/images) (persistent storage), [memory snapshots](https://modal.com/docs/guide/memory-snapshot), and user code are out of scope of the commitments within our BAA, so PHI should not be used in those areas of the product.

## PCI

_Payment Card Industry Data Security Standard_ (PCI) is a standard that defines
the security and privacy requirements for payment card processing.

Modal uses [Stripe](https://stripe.com) to securely process transactions and
trusts their commitment to best-in-class security. We do not store personal
credit card information for any of our customers. Stripe is certified as "PCI
Service Provider Level 1", which is the highest level of certification in the
payments industry.

## Bug bounty program

Keeping user data secure is a top priority at Modal. We welcome contributions
from the security community to identify vulnerabilities in our product and
disclose them to us in a responsible manner. We offer rewards ranging from $100
to $1000+ depending on the severity of the issue discovered. To participate,
please send a report of the vulnerability to security@modal.com.

## Data privacy

Modal will never access or use:

- your source code.
- the inputs (function arguments) or outputs (function return values) to your Modal Functions.
- any data you store in Modal, such as in Images or Volumes.

Inputs (function arguments) and outputs (function return values) are deleted from our system after a max TTL of 7 days.

App logs and metadata are stored on Modal. Modal will not access this data
unless permission is granted by the user to help with troubleshooting.

## Questions?

[Email us!](mailto:security@modal.com)

### Modal Notebooks (beta)

# Modal Notebooks (beta)

Notebooks allow you to write and execute Python code in Modal's cloud, within your browser. It's a hosted Jupyter notebook with:

- Serverless pricing and automatic idle shutdown
- Access to Modal GPUs and compute
- Real-time collaborative editing
- Python Intellisense/LSP support and AI autocomplete

<center>
<video controls autoplay muted playsinline>
<source src="https://modal-cdn.com/Modal-Notebooks-Beta.mp4" type="video/mp4">
</video>
</center>

## Getting started

Open [modal.com/notebooks](https://modal.com/notebooks) in your browser and create a new notebook. You can also upload an `.ipynb` file from your computer.

Once you create a notebook, you can start running cells. Try a simple statement like

```python
print("Hello, Modal!")
```

Or, import a library and create a plot:

```python notest
import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(-20, 20, 500)
plt.plot(np.cos(x / 3.7 + 0.3), x * np.sin(x))
```

The default notebook image comes with a number of Python packages pre-installed, so you can get started right away. Popular ones include PyTorch, NumPy, Pandas, JAX, Transformers, and Matplotlib. You can find the full image definition [here](https://github.com/modal-labs/modal-client/blob/main/modal_global_objects/images/notebook_base_image.py). If you need another package, just install it:

```shell
!uv pip install --system [my-package]
```

All output types work out-of-the-box, including rich HTML, images, and interactive plots.

## Kernel resources

Just like with Modal Functions, notebooks run in serverless containers. This means you pay only for the CPU cores and memory you use.

If you need more resources, you can change kernel settings in the sidebar. This lets you set the number of CPU cores, memory, and GPU type for your notebook. You can also set a timeout for idle shutdown, which defaults to 10 minutes.

Use any GPU type available in Modal, including up to 8 Nvidia A100s or H100s. You can switch the kernel configuration in seconds!

![Compute profile tab in notebook sidebar](https://modal-cdn.com/cdnbot/compute-profilev9rvmmvw_365a1197.webp)

Note that the CPU and memory settings are _reservations_, so you can usually burst above the request. For example, if you've set the notebook to have 0.5 CPU cores, you'll be billed for that continuously, but you can use up to any available cores on the machine (e.g., 32 CPUs) and will be billed for only the time you use them.

## Custom images, volumes and secrets

Modal Notebooks supports custom images, volumes, and secrets, just like Modal Functions. You can use these to install additional packages, mount persistent storage, or access secrets.

- To use a custom image, you need to have a [deployed Modal Function](https://modal.com/docs/guide/managing-deployments) using that image. Then, search for that function in the sidebar.
- To use a Secret, simply create a [Modal Secret](https://modal.com/secrets) using our wizard and attach it to the notebook, so it can be injected as an environment variable automatically.
- To use a Volume, create a [Modal Volume](https://modal.com/docs/guide/volumes) and attach it to the notebook. This lets you mount high-performance, persistent storage that can be shared across multiple notebooks or functions. They will appear as folders in the `/mnt` directory by default.

## Access and sharing

Need a colleague—or the whole internet—to see your work? Just click **Share** in the top‑right corner of the notebook editor.

By default, notebooks are visible to you and teammates in your workspace. They can open the notebook and run cells; flip the "Allow edits" toggle if you want them to make changes directly. Workspace managers can also change the notebook's access settings.

Modal supports sharing by public, unlisted link. If you toggle this, it allows _anyone with the link_ to open the notebook. Pick **Can view** (default) or **Can view and run** based on your preference. Viewers don’t need a Modal account, so this is perfect for collaborating with other stakeholders outside your workspace.

No matter how the notebook is shared, anyone can fork and run their own copy of it.

## Interactive file viewer

The panel on the left-hand side of the notebook shows a **live view of the container’s filesystem**:

| Feature                 | Details                                                                                                                                                                    |
| ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Browse & preview**    | Click through folders to inspect any file that your code has created or downloaded.                                                                                        |
| **Upload & download**   | Drag-and-drop files from your desktop, or click the **⬆** / **⬇** icons to add new data sets, notebooks, or models—or to save results back to your machine.              |
| **One-click refresh**   | Changes made by your code (for example, writing a CSV) appear instantly; hit the refresh icon if you want to force an update.                                              |
| **Context-aware paths** | The viewer always reflects _exactly_ what your code sees (e.g. `/root`, `/mnt/…`), so you can double-check that that file you just wrote really landed where you expected. |

**Important:** the underlying container is **ephemeral**. Anything stored outside an attached [Volume](https://modal.com/docs/guide/volumes) disappears when the kernel shuts down (after your idle-timeout or when you hit **Stop kernel**). Mount a Volume for data you want to keep across sessions.

The viewer itself is only active while the kernel is running—if the notebook is stopped you’ll see an “empty” state until you start it again.

## Editor features

Modal Notebooks bundle the same productivity tooling you’d expect from a modern IDE.

With Pyright, you get autocomplete, signature help, and on-hover documentation for every installed library.

We also implemented AI-powered code completion using Anthropic's **Claude 4** model. This keeps you in the flow for everything from small snippets to multi-line functions. Just press `Tab` to accept suggestions or `Esc` to dismiss them.

Familiar Jupyter shortcuts (`A`, `B`, `X`, `Y`, `M`, etc.) all work within the notebook, so you can quickly add new cells, delete existing ones, or change cell types.

Finally, we have real-time collaborative editing, so you can work with your team in the same notebook. You can see other users' cursors and edits in real-time, and you can see when others are running cells with you. This makes it easy to pair program or review code together.

## Cell magic

Modal Notebooks have built-in support for the `%modal` cell magic. This lets you run code in any [deployed Modal Function or Cls](https://modal.com/docs/guide/trigger-deployed-functions), right from your notebook.

For example, if you have previously run `modal deploy` for an app like:

```python notest
import modal

app = modal.App("my-app")

@app.function()
def my_function(s: str):
    return len(s)
```

Then you could access this function from your notebook:

```python notest
%modal from my-app import my_function

my_function.remote("hello, world!")  # returns 13
```

Run `%modal` to see all options. This works for Cls as well, and you can import from different environments or alias them with the `as` keyword.

## Roadmap

The product is in beta, and we're planning to make a lot of improvements over the coming months. Some bigger features on mind:

- **Modal cloud integrations**
  - Persistent disk storage
  - Memory snapshots to save your notebook session
  - Create notebooks from the `modal` CLI
  - Custom image registry
- **Notebook editor**
  - Jupyter Widgets support
  - Interactive outline
  - Reactive cell execution
  - Edit history

Let us know via [Slack](https://modal.com/slack) if you have any feedback.

### Integrations

#### Using OIDC to authenticate with external services

# Using OIDC to authenticate with external services

Your Functions in Modal may need to access external resources like S3 buckets.
Traditionally, you would need to store long-lived credentials in Modal Secrets
and reference those Secrets in your function code. With the Modal OIDC
integration, you can instead use automatically-generated identity
tokens to authenticate to external services.

## How it works

[OIDC](https://auth0.com/docs/authenticate/protocols/openid-connect-protocol) is
a standard protocol for authenticating users between systems. In Modal, we use
OIDC to generate short-lived tokens that external services can use to verify
that your function is authenticated.

The OIDC integration has two components: the discovery document and the generated
tokens.

The [OIDC discovery document](https://swagger.io/docs/specification/v3_0/authentication/openid-connect-discovery/)
describes how our OIDC server is configured. It primarily includes the supported
[claims](https://developer.okta.com/blog/2017/07/25/oidc-primer-part-1) and the [keys](https://auth0.com/docs/secure/tokens/json-web-tokens/json-web-key-sets)
we use to sign tokens. Discovery documents are always hosted at `/.well-known/openid-configuration`, and
you can view ours at <https://oidc.modal.com/.well-known/openid-configuration>.

The generated tokens are [JWTs](https://jwt.io/) signed by Modal using the keys described in the
discovery document. These tokens contain the full identity of the Function
in the `sub` claim, and they use custom claims to make this information more
easily accessible. See our [discovery document](https://oidc.modal.com/.well-known/openid-configuration)
for a full list of claims.

Generated tokens are injected into your Function's containers via the `MODAL_IDENTITY_TOKEN`
environment variable. Below is an example of what claims might be included in a token:

```json
{
  "sub": "modal:workspace_id:ac-12345abcd:environment_name:modal-examples:app_name:oidc-token-test:function_name:jwt_return_func:container_id:ta-12345abcd",
  "aud": "oidc.modal.com",
  "exp": 1732137751,
  "iat": 1731964951,
  "iss": "https://oidc.modal.com",
  "jti": "31f92dca-e847-4bc9-8d15-9f234567a123",
  "workspace_id": "ac-12345abcd",
  "environment_id": "en-12345abcd",
  "environment_name": "modal-examples",
  "app_id": "ap-12345abcd",
  "app_name": "oidc-token-test",
  "function_id": "fu-12345abcd",
  "function_name": "jwt_return_func",
  "container_id": "ta-12345abcd"
}
```

### Key thumbprints

RSA keys have [thumbprints](https://connect2id.com/products/nimbus-jose-jwt/examples/jwk-thumbprints). You
can use these thumbprints to verify that the keys in our discovery document are
genuine. This protects against potential Man in the Middle (MitM) attacks, although
our required use of HTTPS mitigates this risk.

If you'd like to have the extra security of verifying the thumbprints, you can
use the following command to print the thumbprints for the keys in our
discovery document:

```bash
$ openssl s_client -connect oidc.modal.com:443 < /dev/null 2>/dev/null | openssl x509 -fingerprint -noout | awk -F= '{print $2}' | tr -d ':'
F062F2151EDE30D1620B48B7AC91D66047D769D3
```

Note that these thumbprints may change over time as we rotate keys. We recommend
periodically checking for and updating your scripts with the new thumbprints.

### App name format

By default, Modal Apps can be created with arbitrary names. However, when using
OIDC, the App name has a stricter character set. Specifically, it must be 64
characters or less and can only include alphanumeric characters, dashes, periods,
and underscores. If these constraints are violated, the OIDC token will not be
injected into the container.

Note that these are the same constraints that are applied to [Deployed Apps](https://modal.com/docs/guide/managing-deployments).
This means that if an App is deployable, it will also be compatible with OIDC.

## Demo usage with AWS

To see how OIDC tokens can be used, we'll demo a simple Function that lists
objects in an S3 bucket.

### Step 0: Understand your OIDC claims

Before we can configure OIDC policies, we need to know what claims we can match
against. We can run a Function and inspect its claims to find out.

```python notest
app = modal.App("oidc-token-test")

jwt_image = modal.Image.debian_slim().pip_install("pyjwt")

@app.function(image=jwt_image)
def jwt_return_func():
    import jwt

    token = os.environ["MODAL_IDENTITY_TOKEN"]
    claims = jwt.decode(token, options={"verify_signature": False})
    print(json.dumps(claims, indent=2))

@app.local_entrypoint()
def main():
    jwt_return_func.remote()
```

Run the function locally to see its claims:

```bash
$ modal run oidc-token-test.py
{
  "sub": "modal:workspace_id:ac-12345abcd:environment_name:modal-examples:app_name:oidc-token-test:function_name:jwt_return_func:container_id:ta-12345abcd",
  "aud": "oidc.modal.com",
  "exp": 1732137751,
  "iat": 1731964951,
  "iss": "https://oidc.modal.com",
  "jti": "31f92dca-e847-4bc9-8d15-9f234567a123",
  "workspace_id": "ac-12345abcd",
  "environment_id": "en-12345abcd",
  "environment_name": "modal-examples",
  "app_id": "ap-12345abcd",
  "app_name": "oidc-token-test",
  "function_id": "fu-12345abcd",
  "function_name": "jwt_return_func",
  "container_id": "ta-12345abcd"
}
```

Now we can match off these claims to configure our OIDC policies.

### Step 1: Configure AWS to trust Modal's OIDC provider

We need to make AWS accept Modal identity tokens. To do this, we need to add
Modal's OIDC provider as a trusted entity in our AWS account.

```bash
aws iam create-open-id-connect-provider \
    --url https://oidc.modal.com \
    --client-id-list oidc.modal.com \
    # Optionally replace with the thumbprint from the discovery document.
    # Note that this may change over time as we rotate keys, and this argument
    # can be omitted if you'd prefer to rely on the HTTPS verification instead.
    --thumbprint-list "<thumbprint>"
```

This will trigger AWS to pull down our [JSON Web Key Set (JWKS)](https://auth0.com/docs/secure/tokens/json-web-tokens/json-web-key-sets)
and use it to verify the signatures of any tokens signed by Modal.

### Step 2: Create an IAM role that can be assumed by Modal Functions

Let's create a simple IAM policy that allows listing objects in an S3 bucket.
Take the policy below and replace the bucket name with your own.

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:PutObject", "s3:GetObject", "s3:ListBucket"],
      "Resource": ["arn:aws:s3:::fun-bucket", "arn:aws:s3:::fun-bucket/*"]
    }
  ]
}
```

Now, we can create an IAM role that uses this policy. Visit the IAM console
to create this role. Be sure to replace the account ID and workspace ID placeholders
with your own.

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::123456789abcd:oidc-provider/oidc.modal.com"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "oidc.modal.com:aud": "oidc.modal.com"
        },
        "StringLike": {
          "oidc.modal.com:sub": "modal:workspace_id:ac-12345abcd:*"
        }
      }
    }
  ]
}
```

Note how we use `workspace_id` to limit the scope of the role. This means that
the IAM role can only be assumed by Functions in your Workspace. You can further
limit this by specifying an Environment, App, or Function name.

Ideally, we would use the custom claims for role limiting. Unfortunately, AWS
does not support [matching on custom claims](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_iam-condition-keys.html#condition-keys-wif),
so we use the `sub` claim instead.

### Step 3: Use the OIDC token in your Function

The AWS SDKs have built-in support for OIDC tokens, so you can use them as
follows:

```python notest
import boto3

app = modal.App("oidc-token-test")

boto3_image = modal.Image.debian_slim().pip_install("boto3")

# Trade a Modal OIDC token for AWS credentials
def get_s3_client(role_arn):
    sts_client = boto3.client("sts")

    # Assume role with Web Identity
    credential_response = sts_client.assume_role_with_web_identity(
        RoleArn=role_arn, RoleSessionName="OIDCSession", WebIdentityToken=os.environ["MODAL_IDENTITY_TOKEN"]
    )

    # Extract credentials
    credentials = credential_response["Credentials"]
    return boto3.client(
        "s3",
        aws_access_key_id=credentials["AccessKeyId"],
        aws_secret_access_key=credentials["SecretAccessKey"],
        aws_session_token=credentials["SessionToken"],
    )

# List the contents of an S3 bucket
@app.function(image=boto3_image)
def list_bucket_contents(bucket_name, role_arn):
    s3_client = get_s3_client(role_arn)
    response = s3_client.list_objects_v2(Bucket=bucket_name)
    for obj in response["Contents"]:
        print(f"- {obj['Key']} (Size: {obj['Size']} bytes)")

@app.local_entrypoint()
def main():
    # Replace with the role ARN and bucket name from step 2
    list_bucket_contents.remote("fun-bucket", "arn:aws:iam::123456789abcd:role/oidc_test_role")
```

Run the function locally to see the contents of the bucket:

```bash
$ modal run oidc-token-test.py
- test-file.txt (Size: 10 bytes)
```

## Next steps

The OIDC integration can be used for much more than just AWS. With this same pattern,
you can configure automatic access to [Vault](https://developer.hashicorp.com/vault/docs/auth/jwt),
[GCP](https://cloud.google.com/identity-platform/docs/web/oidc), [Azure](https://learn.microsoft.com/en-us/entra/identity-platform/v2-protocols-oidc), and more.

#### Connecting Modal to your Datadog account

# Connecting Modal to your Datadog account

You can use the [Modal + Datadog Integration](https://docs.datadoghq.com/integrations/modal/)
to export Modal function logs to Datadog. You'll find the Modal Datadog
Integration available for install in the Datadog marketplace.

## What this integration does

This integration allows you to:

1. Export Modal audit logs in Datadog
2. Export Modal function logs to Datadog
3. Export container metrics to Datadog

## Installing the integration

1. Open the [Modal Tile](https://app.datadoghq.com/integrations?integrationId=modal) (or the EU tile [here](https://app.datadoghq.eu/integrations?integrationId=modal))
   in the Datadog integrations page
2. Click "Install Integration"
3. Click Connect Accounts to begin authorization of this integration.
   You will be redirected to log into Modal, and once logged in, you’ll
   be redirected to the Datadog authorization page.
4. Click "Authorize" to complete the integration setup

## Metrics

The Modal Datadog Integration will forward the following metrics to Datadog:

- `modal.cpu.utilization`
- `modal.memory.utilization`
- `modal.gpu.memory.utilization`
- `modal.gpu.compute.utilization`
- `modal.input_events.elapsed_time_us`
- `modal.input_events.successes`
- `modal.input_events.total_inputs`

`modal.input_events.successes` and `modal.input_events.total_inputs` can be used to measure the success rate of a certain function or app.

These metrics come free of charge and are tagged with `container_id`,
`environment_name`, and `workspace_name`. The input event metrics are, in addition, tagged with
`app_name`, `app_id`, `function_name`, `function_id`, and `workspace_id`.

## Structured logging

Logs from Modal are sent to Datadog in plaintext without any structured
parsing. This means that if you have custom log formats, you'll need to
set up a [log processing pipeline](https://docs.datadoghq.com/logs/log_configuration/pipelines/?tab=source)
in Datadog to parse them.

Modal passes log messages in the `.message` field of the log record. To
parse logs, you should operate over this field. Note that the Modal Integration
does set up some basic pipelines. In order for your pipelines to work, ensure
that your pipelines come before Modal's pipelines in your log settings.

## Cost Savings

The Modal Datadog Integration will forward all logs to Datadog which could be
costly for verbose apps. We recommend using either [Log Pipelines](https://docs.datadoghq.com/logs/log_configuration/pipelines/?tab=source)
or [Index Exclusion Filters](https://docs.datadoghq.com/logs/indexes/?tab=ui#exclusion-filters)
to filter logs before they are sent to Datadog.

The Modal Integration tags all logs with the `environment` attribute. The
simplest way to filter logs is to create a pipeline that filters on this
attribute and to isolate verbose apps in a separate environment.

## Uninstalling the integration

Once the integration is uninstalled, all logs will stop being sent to
Datadog, and authorization will be revoked.

1. Navigate to the [Modal metrics settings page](http://modal.com/settings/metrics)
   and select "Delete Datadog Integration".
2. On the Configure tab in the Modal integration tile in Datadog,
   click Uninstall Integration.
3. Confirm that you want to uninstall the integration.
4. Ensure that all API keys associated with this integration have been
   disabled by searching for the integration name on the [API Keys](https://app.datadoghq.com/organization-settings/api-keys?filter=Modal)
   page.

#### Connecting Modal to your OpenTelemetry provider

# Connecting Modal to your OpenTelemetry Provider

You can export Modal logs to your [OpenTelemetry](https://opentelemetry.io/docs/what-is-opentelemetry/)
provider using the Modal OpenTelemetry integration. This integration is compatible with
any observability provider that supports the OpenTelemetry HTTP APIs.

## What this integration does

This integration allows you to:

1. Export Modal audit logs to your provider
2. Export Modal function logs to your provider
3. Export container metrics to your provider

## Metrics

The Modal OpenTelemetry Integration will forward the following metrics to your provider:

- `modal.cpu.utilization`
- `modal.memory.utilization`
- `modal.gpu.memory.utilization`
- `modal.gpu.compute.utilization`

These metrics are tagged with `container_id`, `environment_name`, and `workspace_name`.

## Installing the integration

1. Find out the endpoint URL for your OpenTelemetry provider. This is the URL that
   the Modal integration will send logs to. Note that this should be the base URL
   of the OpenTelemetry provider, and not a specific endpoint. For example, for the
   [US New Relic instance](https://docs.newrelic.com/docs/opentelemetry/best-practices/opentelemetry-otlp/#configure-endpoint-port-protocol),
   the endpoint URL is `https://otlp.nr-data.net`, not `https://otlp.nr-data.net/v1/logs`.
2. Find out the API key or other authentication method required to send logs to your
   OpenTelemetry provider. This is the key that the Modal integration will use to authenticate
   with your provider. Modal can provide any key/value HTTP header pairs. For example, for
   [New Relic](https://docs.newrelic.com/docs/opentelemetry/best-practices/opentelemetry-otlp/#api-key),
   the header is `api-key`.
3. Create a new OpenTelemetry Secret in Modal with one key per header. These keys should be
   prefixed with `OTEL_HEADER_`, followed by the name of the header. The value of this
   key should be the value of the header. For example, for New Relic, an example Secret
   might look like `OTEL_HEADER_api-key: YOUR_API_KEY`. If you use the OpenTelemetry Secret
   template, this will be pre-filled for you.
4. Navigate to the [Modal metrics settings page](http://modal.com/settings/metrics) and configure
   the OpenTelemetry push URL from step 1 and the Secret from step 3.
5. Save your changes and use the test button to confirm that logs are being sent to your provider.
   If it's all working, you should see a `Hello from Modal! 🚀` log from the `modal.test_logs` service.

## Allowlisting the integration's IPs

The integration uses a set of static IP addresses (subject to change) to send data to your OpenTelemetry provider:

```
3.215.65.235
3.219.40.38
13.219.36.96
18.206.3.184
35.169.34.255
44.208.153.216
52.86.93.233
54.198.254.114
54.208.217.246
204.236.196.209
```

## Uninstalling the integration

Once the integration is uninstalled, all logs will stop being sent to
your provider.

1. Navigate to the [Modal metrics settings page](http://modal.com/settings/metrics)
   and disable the OpenTelemetry integration.

#### Okta SSO

# Okta SSO

## Prerequisites

- A Workspace that's on an [Enterprise](https://modal.com/pricing) plan
- Admin access to the Workspace you want to configure with Okta Single-Sign-On (SSO)
- Admin privileges for your Okta Organization

## Supported features

- IdP-initiated SSO
- SP-initiated SSO
- Just-In-Time account provisioning

For more information on the listed features, visit the
[Okta Glossary](https://help.okta.com/okta_help.htm?type=oie&id=ext_glossary).

## Configuration

### Read this before you enable "Require SSO"

Enabling "Require SSO" will force all users to sign in via Okta. Ensure that you
have admin access to your Modal Workspace through an Okta account before
enabling.

### Configuration steps

#### Step 1: Add Modal app to Okta Applications

1. Sign in to your Okta admin dashboard
2. Navigate to the Applications tab and click "Browse App Catalog".
   ![Okta browse application](../../assets/docs/okta-browse-applications.png)

3. Select "Modal" and click "Done".
4. Select the "Sign On" tab and click "Edit".
   ![Okta sign on edit](../../assets/docs/okta-sign-on-edit.png)
5. Fill out Workspace field to configure for your specific Modal workspace. See
   [Step 2](https://modal.com/docs/guide/okta-sso#step-2-link-your-workspace-to-okta-modal-application)
   if you're unsure what this is.
   ![Okta add workspace](../../assets/docs/okta-add-workspace-username.png)

#### Step 2: Link your Workspace to Okta Modal application

1. Navigate to your application on the Okta Admin page.
2. Copy the Metadata URL from the Okta Admin Console (It's under the "Sign On"
   tab). ![Okta metadata url](../../assets/docs/okta-metadata-url.png)

3. Sign in to https://modal.com and visit your Workspace Management page
   (e.g. `https://modal.com/settings/[workspace name]/workspace-management`)
4. Paste the Metadata URL in the input and click "Save Changes"

#### Step 3: Assign users / groups and test the integration

1. Navigate back to your Okta application on the Okta Admin dashboard.
2. Click on the "Assignments" tab and add the appropriate people or groups.

![Okta Assign Users](../../assets/docs/okta-assign-people.png)

3. To test the integration, sign in as one of the users you assigned in the previous step.
4. Click on the Modal application on the Okta Dashboard to initiate Single Sign-On.

#### Notes

The following SAML attributes are used by the integration:

| Name      | Value          |
| --------- | -------------- |
| email     | user.email     |
| firstName | user.firstName |
| lastName  | user.lastName  |

## SP-initiated SSO

The sign-in process is initiated from https://modal.com/login/sso

1. Enter your workspace name in the input
2. Click "continue with SSO" to authenticate with Okta

#### Slack notifications (beta)

# Slack notifications (beta)

You can integrate your Modal Workspace with Slack to receive timely essential notifications.

## Prerequisites

- You are a [Workspace Manager](https://modal.com/docs/guide/workspaces#administrating-workspace-members) in the Modal Workspace you're installing the Slack integration in.
- You have permissions to install apps in your Slack workspace.

## Supported notifications

- Alerts for failed scheduled function runs.
- Alerts for crash-looping containers in a function.
- Alerts when any of your apps have client versions that are out of date.
- Alerts when you hit your GPU resource limits.

## Configuration

### Step 1: Install the Slack integration

Visit the _Slack Integration_ section on your [settings](https://modal.com/settings) page in your Modal Workspace and click the **Add to Slack** button.

### Step 2: Add the Modal app to your Slack channel

Navigate to the Slack channel you want to add the Modal to and click on the channel header. On the integrations tab you can add the Modal app.

![Add Modal app to Slack channel](../../assets/docs/slack-add-modal-app.jpg)

### Step 3: Use `/modal link` to link the Slack channel to your Modal Workspace

You'll be prompted to select the Workspace you want to link to the Slack channel. You can always unlink the Slack channel by visiting the _Slack Integration_ section on your [settings](https://modal.com/settings) page in your Modal Workspace.

### Workspace & account settings

#### Workspaces

# Workspaces

A **workspace** is an area where a user can deploy Modal apps and other
resources. There are two types of workspaces: personal and shared. After a new
user has signed up to Modal, a personal workspace is automatically created for
them. The name of the personal workspace is based on your GitHub username, but
it might be randomly generated if already taken or invalid.

To collaborate with others, a new shared workspace needs to be created.

## Create a Workspace

All additional workspaces are shared workspaces, meaning you can invite others
by email to collaborate with you. There are two ways to create a Modal workspace
on the [settings](https://modal.com/settings/workspaces) page.

![view of workspaces creation interface](https://modal-cdn.com/cdnbot/create-new-workspace-viewk0ka46_7_800f2053.webp)

1. Create from [GitHub organization](https://docs.github.com/en/organizations). Allows members in GitHub organization to auto-join the workspace.

2. Create from scratch. You can invite anyone to your workspace.

If you're interested in having a workspace associated with your Okta
organization, then check out our [Okta SSO docs](https://modal.com/docs/guide/okta-sso).

If you're interested in using SSO through Google or other providers, then please reach out to us at [support@modal.com](mailto:support@modal.com).

## Auto-joining a Workspace associated with a GitHub organization

Note: This is only relevant for Workspaces created from a GitHub organization.

Users can automatically join a Workspace on their [Workspace settings page](https://modal.com/settings/workspaces) if they are a member of the GitHub organization associated with the Workspace.

To turn off this functionality a Workspace Manager can disable it on the **Workspace Management** tab of their Workspace's settings page.

## Inviting new Workspace members

To invite a new Workspace member, you can visit the [settings](https://modal.com/settings) page
and navigate to the members tab for the appropriate workspace.

You can either send an email invite or share an invite link. Both existing Modal
users and non-existing users can use the links to join your workspace. If they
are a new user a Modal account will be created for them.

![invite member section](../../assets/screenshots/invite-member.png)

## Create a token for a Workspace

To interact with a Workspace's resources programmatically, you need to add an
API token for that Workspace. Your existing API tokens are displayed on
[the settings page](https://modal.com/settings/tokens) and new API tokens can be added for a
particular Workspace.

After adding a token for a Workspace to your Modal config file you can activate
that Workspace's profile using the CLI (see below).

As an manager or workspace owner you can manage active tokens for a workspace on
[the member tokens page](https://modal.com/settings/tokens/member-tokens). For more information on API
token management see the
[documentation about configuration](https://modal.com/docs/reference/modal.config).

## Switching active Workspace

When on the dashboard or using the CLI, the active profile determines which
personal or organizational Workspace is associated with your actions.

### Dashboard

You can switch between organization Workspaces and your Personal Workspace by
using the workspace selector at the top of [the dashboard](https://modal.com/home).

### CLI

To switch the Workspace associated with CLI commands, use
`modal profile activate`.

## Administrating workspace members

Workspaces have three different levels of access privileges:

- Owner
- Manager
- Member

The user that creates a workspace is automatically set as the **Owner** for that
workspace. The owner can assign any other roles within the workspace, as well as
remove other members of the workspace.

A **Manager** within a workspace can assign all roles except **Owner** and can
also remove other members of the workspace.

A **Member** of a workspace can not assign any access privileges within the
workspace but can otherwise perform any action like running and deploying apps
and modify Secrets.

As an Owner or Manager you can administrate the access privileges of other
members on the members tab in [settings](https://modal.com/settings).

## Leaving a Workspace

To leave a workspace, navigate to [the settings page](https://modal.com/settings/workspaces) and
click "Leave" on a listed Workspace. There must be at least one owner assigned
to a workspace.

#### Environments

# Environments

Environments are sub-divisions of workspaces, allowing you to deploy the same app
(or set of apps) in multiple instances for different purposes without changing
your code. Typical use cases for environments include having one `dev`
environment and one `prod` environment, preventing overwriting production apps
when developing new features, while still being able to deploy changes to a
"live" and potentially complex structure of apps.

Each environment has its own set of [Secrets](https://modal.com/docs/guide/secrets) and any
object lookups performed from an app in an environment will by default look for
objects in the same environment.

By default, every workspace has a single Environment called "main". New
Environments can be created on the CLI:

```sh
modal environment create dev
```

(You can run `modal environment --help` for more info)

Once created, Environments show up as a dropdown menu in the navbar of the
[Modal dashboard](https://modal.com/home), letting you set browse all Modal Apps and Secrets
filtered by which Environment they were deployed to.

Most CLI commands also support an `--env` flag letting you specify which
Environment you intend to interact with, e.g.:

```sh
modal run --env=dev app.py
modal volume create --env=dev storage
```

To set a default Environment for your current CLI profile you can use
`modal config set-environment`, e.g.:

```sh
modal config set-environment dev
```

Alternatively, you can set the `MODAL_ENVIRONMENT` environment variable.

## Environment web suffixes

Environments have a 'web suffix' which is used to make
[web endpoint URLs](https://modal.com/docs/guide/webhook-urls) unique across your workspace. One
Environment is allowed to have no suffix (`""`).

## Cross environment lookups

It's possible to explicitly look up objects in Environments other than the Environment
your app runs within:

```python
production_secret = modal.Secret.from_name(
    "my-secret",
    environment_name="main"
)
```

```python notest
modal.Function.from_name(
    "my_app",
    "some_function",
    environment_name="dev"
)
```

However, the `environment_name` argument is optional and omitting it will use
the Environment from the object's associated App or calling context.

#### Modal user account setup

# Modal user account setup

To run and deploy applications on Modal you'll need to sign up and create a user
account.

You can visit the [signup](https://modal.com/signup) page to begin the process or execute
[`modal setup`](https://modal.com/docs/reference/cli/setup#modal-setup) on the command line.

Users can also be provisioned through [Okta SSO](https://modal.com/docs/guide/okta-sso), which is
an enterprise feature that you can request. For the typical user you'll sign-up
using an existing GitHub account. If you're interested in authenticating with
other identity providers let us know at <support@modal.com>.

## What GitHub permissions does signing up require?

- `user:email` — gives us the emails associated with the GitHub account.
- `read:org` (invites only) — needed for Modal workspace invites. Note: this
  only allows us to see what organization memberships you have
  ([GitHub docs](https://docs.github.com/en/apps/oauth-apps/building-oauth-apps/scopes-for-oauth-apps)).
  We won't be able to access any code repositories or other details.

## How can I change my email?

You can change your email on the [settings](https://modal.com/settings) page.

#### Service users

# Service Users (beta)

Service users are programmatic accounts that allow automated systems to interact with Modal. They're ideal for CI/CD pipelines, automated deployments, and other workflows that need to authenticate.

## Create a Service User

Service users are only available for shared workspaces. You will need workspace owner or manager privileges to create service users.

To create a service user:

1. Go to your workspace [tokens settings page](https://modal.com/settings/tokens/service-users)
2. Click **New Service User**
3. Enter a name for your service user (must be lowercase alphanumeric, can contain hyphens or underscores)
4. Click **Create**

After creation, you'll see the `MODAL_TOKEN_ID` and `MODAL_TOKEN_SECRET`. **This is the only time you can view the token secret** for security reasons.

## Use Service User Tokens

Set the service user credentials as environment variables in your automated environment:

```bash
export MODAL_TOKEN_ID=your-token-id
export MODAL_TOKEN_SECRET=your-token-secret
```

Once configured, you can use Modal's CLI and Python SDK as usual:

```bash
modal deploy your_app.py
```

## Delete a Service User

To remove a service user:

1. Go to the [tokens settings page](https://modal.com/settings/tokens/service-users)
2. Find the service user in the table
3. Click **Delete** when you hover over the row

## Permissions

Service users have the same permissions as workspace members. They cannot do actions that are only permitted for a workspace owner or manager. To learn more about members, managers, and owners, see this [workspace](https://modal.com/docs/guide/workspaces#administrating-workspace-members) section.

### Other topics

#### Modal 1.0 migration guide

# Modal 1.0 migration guide

We released version 1.0 of the Modal Python SDK in May 2025.
This release signifies an increased commitment to API stability and implies
some changes to our development workflow.

Preceding the 1.0 release, we introduced a number of deprecations and changes
based on feedback that we received from early users. These changes were intended
to address pain points and reduce confusion about some aspects of the Modal API.
While adapting to them requires some changes to existing code, we believe that
they’ll make it easier to use Modal going forward.

This page highlights the major changes for 1.0 and provides some advice for how
to migrate your code to the new stable APIs. Most deprecations introduced prior
to the release of v1.0 will not be enforced (actually cause breaking changes)
until a subsequent minor (v1.x) release, but we recommend updating your code so
that you can take advantage of new features and avoid any future issues.

## Deprecating `Image.copy_*` methods

_Introduced in: v0.72.11_

We recently introduced new `Image` methods — `Image.add_local_dir` and
`Image.add_local_file` — to replace the existing `Image.copy_local_dir` and
`Image.copy_local_file`.

The new methods subsume the functionality of the old ones, but their default
behavior is different and more performant. By default, files will be mounted to
the container at runtime rather than copied into a new `Image` layer. This can
speed up development substantially when iterating on the contents of the files.

Building a new `Image` layer should be necessary only when subsequent build
steps will use the added files. In that case, you can pass `copy=True` in
`Image.add_local_file` or `Image.add_local_dir`.

The `Image.add_local_dir` method also has an `ignore=` parameter, which you can
use to pass file-matching patterns (using dockerignore rules) or predicate
functions to exclude files.

## Deprecating `Mount` as part of the public API

_Introduced in: v0.72.4_ | _Enforced in: v1.0.0_

Currently, local files can be mounted to the container filesystem either by
including them in the `Image` definition or by passing a `modal.Mount` object
directly to the `App.function` or `App.cls` decorators. As part of the 1.0
release, we are simplifying the container filesystem configuration to be defined
only by the `Image` used for each Function. This implies deprecation of the
following:

- The `mount=` parameter of `App.function` and `App.cls`
- The `context_mount=` parameter of several `modal.Image` methods
- The `Image.copy_mount` method
- The `Mount` object

Code that uses the `mount=` parameter of `App.function` and `App.cls` should be
migrated to pass those files / directories to the `Image` used by that Function
or Cls, i.e. using the `Image.add_local_file`, `Image.add_local_dir`, or
`Image.add_local_python_source` methods:

```python notest
# Mounting local files

# Old way (deprecated)
mount = modal.Mount.from_local_dir("data").add_local_file("config.yaml")
@app.function(image=image, mount=mount)
def f():
    ...

# New way
image = image.add_local_dir("data", "/root/data").add_local_file("config.yaml", "/root/config.yaml")
@app.function(image=image)
def f():
    ...

## Mounting local Python source code

# Old way (deprecated)
mount = modal.Mount.from_local_python_packages("my-lib"))
@app.function(image=image, mount=mount)
def f()
    ...

# New way
image = image.add_local_python_source("my-lib")
@app.function(image=image)
def f(...):
    ...

## Using Image.copy_mount

# Old way (deprecated)
mount = modal.Mount.from_local_dir("data").add_local_file("config.yaml")
image.copy_mount(mount)

# New way
image.add_local_dir("data", "root/data").add_local_file("config.yaml", "/root/config.yaml")
```

Code that uses the `context_mount=` parameter of `Image.from_dockerfile` and
`Image.dockerfile_commands` methods can delete that parameter; we now
automatically infer the files that need to be included in the context.

## Deprecating the `@modal.build` decorator

_Introduced in: v0.72.17_

As part of consolidating the filesystem configuration API, we are also
deprecating the `modal.build` decorator.

For use cases where `modal.build` would previously have been the suggested
approach (e.g., downloading model weights or other large assets to the
container filesystem), we now recommend using a `modal.Volume` instead. The
main advantage of storing weights in a `Volume` instead of an `Image` is that
the weights do not need to be re-downloaded every time you change something else
about the `Image` definition.

Many frameworks, such as Hugging Face, automatically cache downloaded model
weights. When using these frameworks, you just need to ensure that you mount a
`modal.Volume` to the expected location of the framework’s cache:

```python notest
cache_vol = modal.Volume.from_name("hf-hub-cache")
@app.cls(
    image=image.env({"HF_HUB_CACHE": "/cache"}),
    volumes={"/cache": cache_vol},
    ...
)
class Model:
    @modal.enter()
    def load_model(self):
        self.model = ModelClass.from_pretrained(...)
```

For frameworks that don’t support automatic caching, you could write a separate
function to download the weights and write them directly to the Volume, then
`modal run` against this function before you deploy.

In some cases (e.g., if the step runs very quickly), you may wish for the logic
currently decorated with `@modal.build` to continue modifying the Image
filesystem. In that case, you can extract the method as a standalone function
and pass it to `Image.run_function`:

```python notest
def download_weights():
    ...

image = image.run_function(download_weights)
```

## Requiring explicit inclusion of local Python dependencies

_Introduced in: 0.73.11_ | _Enforced in: 1.0.0_

Prior to 1.0, Modal will inspect the modules that are imported when running
your App code and automatically include any "local" modules in the remote
container environment. This behavior is referred to as "automounting".

While convenient, this approach has a number of edge cases and surprising
behaviors, such as ignoring modules with imports that are deferred using
`Image.imports`. Additionally, it is difficult to configure the automounting
behavior to, e.g., ignore large data files that are stored within your local
Python project directories.

Going forward, it will be necessary to explicitly include the local dependencies
of your Modal App. The easiest way to do this is with
[`Image.add_local_python_source`](https://modal.com/docs/reference/modal.Image#add_local_python_source):

```python notest
import modal
import helpers

image = modal.Image.debian_slim().add_local_python_source("helpers")
```

In the period leading up to the change in default behavior, the Modal client
will issue deprecation warnings when automounted modules are not included
in the Image. Updating the Image definition will remove these warnings.

Note that Modal will continue to automatically include the source module or
package defining the App itself. We're introducing a new App or Function-level
parameter, `include_source`, which can be set to `False` in cases where this is
not desired (i.e., because your Image definition already includes the App
source).

## Renaming autoscaler parameters

_Introduced in: v0.73.76_

We're renaming several parameters that configure autoscaling behavior:

- `keep_warm` is now `min_containers`
- `concurrency_limit` is now `max_containers`
- `container_idle_timeout` is now `scaledown_window`

The renaming is intended to address some persistent confusion about
the meaning of these parameters. The migration path is a simple
find-and-replace operation.

Additionally, we're promoting a fourth parameter, `buffer_containers`,
from experimental status (previously `_experimental_buffer_containers`).
Like `min_containers`, `buffer_containers` can help mitigate cold-start
penalties by overprovisioning containers while the Function is active.

## Renaming `modal.web_endpoint` to `modal.fastapi_endpoint`

_Introduced in: v0.73.89_

We're renaming the `modal.web_endpoint` decorator to `modal.fastapi_endpoint`
so that the implicit dependency on FastAPI is more clear. This can be a
simple name substitution in your code as the semantics are otherwise identical.

We may reintroduce a lightweight `modal.web_endpoint` without external
dependencies in the future.

## Replacing `allow_concurrent_inputs` with `@modal.concurrent`

_Introduced in: v0.73.148_

The `allow_concurrent_inputs` parameter is being replaced with a new decorator,
`@modal.concurrent`. The decorator can be applied either to a Function or a Cls.
We're moving the input concurrency feature out of "Beta" status as part of this
change.

The new decorator exposes two distinct parameters: `max_inputs` (the limit
on the number of inputs the Function will concurrently accept) and
`target_inputs` (the level of concurrency targeted by the Modal autoscaler).
The simplest migration path is to replace `allow_concurrent_inputs=N` with
`@modal.concurrent(max_inputs=N)`:

```python notest
# Old way, with a function (deprecated)
@app.function(allow_concurrent_inputs=1000)
def f(...):
    ...

# New way, with a function
@app.function()
@modal.concurrent(max_inputs=1000)
def f(...):
    ...

# Old way, with a class (deprecated)
@app.cls(allow_concurrent_inputs=1000)
class MyCls:
    ...

# New way, with a class
@app.cls()
@modal.concurrent(max_inputs=1000)
class MyCls:
    ...
```

Setting `target_inputs` along with `max_inputs` may benefit performance by
reducing latency during periods where the container pool is scaling up. See the
[input concurrency guide](https://modal.com/docs/guide/concurrent-inputs) for more information.

## Deprecating the `.lookup` method on Modal objects

_Introduced in: v0.72.56_

Most Modal objects can be instantiated through two distinct methods:
`.from_name` and `.lookup`. The redundancy between these methods is a persistent
source of confusion.

The `.from_name` method is lazy: it operates entirely locally and instantiates
only a shell for the object. The local object won’t be associated with its
identity on the Modal server until you interact with it. In contrast, the
`.lookup` method is eager: it triggers a remote call to the Modal server, and it
returns a fully-hydrated object.

Because Modal objects can now be hydrated on-demand, when they are first
used, there is rarely any need to eagerly hydrate. Therefore, we’re deprecating
`.lookup` so that there’s only one obvious way to instantiate objects.

In most cases, the migration is a simple find-and-replace of `.lookup` →
`.from_name`.

One exception is when your code needs to access object metadata, such as its ID,
or a web endpoint's URL. In that case, you can explicitly force hydration of the
object by calling its `.hydrate()` method. There may be other subtle consequences,
such as errors being rasied at a different location if no object exists with the
given name.

## Removing support for custom Cls constructors

_Introduced in: v0.74.0_

Classes decorated with `App.cls` are no longer allowed to have a custom constructor
(`__init__` method). Instead, class parameterization should be exposed using
dataclass-style [`modal.parameter`](https://modal.com/docs/reference/modal.parameter) annotations:

```python notest
# Old way (deprecated)
@app.cls()
class MyCls:
    def __init__(self, name: str = "Bert"):
        self.name = name

# New way
@app.cls()
class MyCls:
    name: str = modal.parameter(default="Bert")
```

Modal will provide a synthetic constructor for classes that use `modal.parameter`.
Arguments to the synthetic constructor must be passed using keywords, so you may
need to update your calling code as well:

```python notest
obj = MyCls(name="Bert")  # name= is now required
```

We're making this change to address some persistent confusion about when
constructors execute for remote calls and what operations are allowed to run in
them. If your custom constructor performs any setup logic beyond storing the
parameter values, you should move it to a method decorated with
`@modal.enter()`.

Additionally, we're reducing the types that we support as class parameters to
a small number of primitives (`str`, `int`, `bool`, and `bytes`).

Limiting class parameterization to primitive types will also allow us to provide
better observability over parameterized class instances in the web dashboard,
CLI, and other contexts where it is not possible to represent arbitrary Python
objects.

If you need to parameterize classes across more complex types, you can implement
your own serialization logic, e.g. using strings as the wire format:

```python notest
@app.cls()
class MyCls:
    param_str: str = modal.parameter()

    @modal.enter()
    def deserialize_parameters(self):
        self.param_obj = SomeComplexType.from_str(self.param_str)
```

We recommend adopting interpretable constructor arguments (i.e., prefer
meaningful strings over pickled bytes) so that you will be able to get the most
benefit from future improvements to parameterized class observability.

## Simplifying Cls lookup patterns

_Introduced in: v0.73.26_

Modal previously supported several different patterns for looking up a `modal.Cls`
and remotely invoking one of its methods:

```python notest
# Documented pattern
MyCls = modal.Cls.from_name("my-app", "MyCls")
obj = MyCls()
obj.some_method.remote(...)

# Alternate pattern: skipping the object instantiation
MyCls = modal.Cls.from_name("my-app", "MyCls")
MyCls.some_method.remote(...)

# Alternate pattern: looking up the method as a Function
f = modal.Function.lookup("my-app", "MyCls.some_method")
f.remote(...)
```

While each pattern could successfully trigger a remote function call, there were
a number of subtle differences in behavior between them.

Going forward, we will only support the first pattern. Making remote calls to a
method on a deployed Cls will require you to (a) look up the object using
`modal.Cls` and (b) instantiate the object before calling its methods.

## Deprecating `modal.gpu` objects

_Introduced in: v0.73.31_

The `modal.gpu` objects are being deprecated; going forward, all GPU resource
configuration should be accomplished using strings.

This should be an easy code substitution, e.g. `gpu=modal.gpu.H100()` can be
replaced with `gpu="H100"`. When using the `count=` parameter of the GPU class,
simply append it to the name with a colon (e.g. `gpu="H100:8"`). In the case of
the `modal.gpu.A100(size="80GB")` variant, the name of the corresponding gpu is
`"A100-80GB"`.

Note that string arguments are case-insensitive, so `"H100"` and `"h100"` are
both accepted.

The main rationale for this change is that it will allow us to introduce new
GPU models in the future without requring users to upgrade their SDK.

## Requiring explicit invocation for module mode

_Introduced in: 0.73.58_

The Modal CLI allows you to reference the source code for your App as either
a file path (e.g. `src/my_app.py`) or as a module name (e.g. `src.my_app`).

As in Python, the choice has some implications for how relative imports are
resolved. To make this more salient, Modal will mirror Python going forwared
and require that you explicitly invoke module mode by passing `-m` on your
command line (e.g., `modal deploy -m src.my_app`).

#### File and project structure

# Project structure

## Apps spanning multiple files

When your project spans multiple files, more care is required to package the
full structure for running or deploying on Modal.

There are two main considerations: (1) ensuring that all of your Functions get
registered to the App, and (2) ensuring that any local dependencies get included
in the Modal container.

Say that you have a simple project that's distributed across three files:

```
src/
├── app.py  # Defines the `modal.App` as a variable named `app`
├── llm.py  # Imports `app` and decorates some functions
└── web.py  # Imports `app` and decorates other functions
```

With this structure, if you deploy using `modal deploy src/app.py`, Modal won't
discover the Functions defined in the other two modules, because they never get
imported.

If you instead run `modal deploy src/llm.py`, Modal will deploy the App with
just the Functions defined in that module.

One option would be to ensure that one module in the project transitively
imports all of the other modules and to point the `modal deploy` CLI at it, but
this approach can lead to an awkard project structure.

### Defining your project as a Python package

A better approach would be to define your project as a Python _package_ and to
use the Modal CLI's "module mode" invocation pattern.

In Python, a package is a directory containing an `__init__.py` file (and
usually some other Python modules). If you have a `src/__init__.py` that
imports all of the member modules, it will ensure that any decorated Functions
contained within them get registered to the App:

```python notest
# Contents of __init__.py
import .app
import .llm
import .web
```

_Important: use *relative* imports (`import .app`) between member modules._

Unfortunately, it's not enough just to set this up and make your deploy command
`modal deploy src/app.py`. Instead, you need to invoke Modal in _module mode_:
`modal deploy -m src.app`. Note the use of the `-m` flag and the module path
(`src.app` instead of `src/app.py`). Akin to `python -m ...`, this incantation
treats the target as a package rather than just a single script.

### App composition

As your project grows in scope, it may become helpful to organize it into
multiple component Apps, rather than having the project defined as one large
monolith. That way, as you iterate during development, you can target a specific
component, which will build faster and avoid any conflicts with concurrent work
on other parts of the project.

Projects set up this way can still be deployed as one unit by using `App.include`.
Say our project from above defines separate Apps in `llm.py` and `web.py` and then
adds a new `deploy.py` file:

```python notest
# Contents of deploy.py
import modal

from .llm import llm_app
from .web import web_app

app = modal.App("full-app").include(llm_app).include(web_app)
```

This lets you run `modal deploy -m src.deploy` to package everything in one
step.

**Note:** Since the multi-file app still has a single namespace for all
functions, it's important to name your Modal functions uniquely across the
project even when splitting it up across files: otherwise you risk some
functions "shadowing" others with the same name.

## Including local dependencies

Another factor to consider is whether Modal will package all of the local
dependencies that your App requires.

Even if your Modal App itself can be contained to a single file, any local
modules that file imports (like, say, a `helpers.py`) also need to be available
in the Modal container.

By default, Modal will automatically include the module or package where a
Function is defined in all containers that run that Function. So if the project
is set up as a package and the helper modules are part of that package, you
should be all set. If you're not using a package setup, or if the local
dependencies are external to your project's package, you'll need to explicitly
include them in the Image, i.e. with `modal.Image.add_local_python_source`.

**Note:** This behavior changed in Modal 1.0. Previously, Modal would
"automount" any local dependencies that were imported by your App source into a
container. This was changed to be more selective to avoid unnecessary inclusion
of large local packages.

#### Developing and debugging

# Developing and debugging

Modal makes it easy to run apps in the cloud, try code changes in the cloud, and
debug remotely executing code as if it were right there on your laptop. To speed
boost your inner dev loop, this guide provides a rundown of tools and techniques
for developing and debugging software in Modal.

## Interactivity

You can launch a Modal App interactively and have it drop you right into the
middle of the action, at an interesting callsite or the site of a runtime
detonation.

### Interactive functions

It is possible to start the interactive Python debugger or start an `IPython`
REPL right in the middle of your Modal App.

To do so, you first need to run your App in "interactive" mode by using the
`--interactive` / `-i` flag. In interactive mode, you can establish a connection
to the calling terminal by calling `interact()` from within your function.

For a simple example, you can accept user input with the built-in Python `input`
function:

```python
@app.function()
def my_fn(hidden):
    modal.interact()

    x = input("Enter a number: ")
    if hidden == x:
        print(f"Your number is {x}, which is the hidden value!")
    else:
        print(f"Your number is {x}, which is not the hidden value")
```

Now when you run your app with the `--interactive` flag, you're able to send
inputs to your app, even though it's running in a remote container!

```shell
modal run -i guess_number.py --hidden 5
Enter a number: 5
Your number is 5, which is the hidden value!
```

For a more interesting example, you can [`pip_install("ipython")`](https://modal.com/docs/reference/modal.Image#pip_install)
and start an `IPython` REPL dynamically anywhere in your code:

```python
@app.function()
def f():
    model = expensive_function()
    # play around with model
    modal.interact()
    import IPython
    IPython.embed()
```

The built-in Python debugger can be initiated with the language's `breakpoint()`
function. For convenience, breakpoints call `interact` automatically.

```python
@app.function()
def f():
    x = "10point3"
    breakpoint()
    answer = float(x)
```

### Debugging Running Containers

#### Debug Shells

Modal also lets you run interactive commands on your running Containers from the
terminal -- much like `ssh`-ing into a traditional machine or cloud VM.

To run a command inside a running Container, you first need to get the Container
ID. You can view all running Containers and their Container IDs with
[`modal container list`](https://modal.com/docs/reference/cli/container).

After you obtain the Container ID, you can connect to the Container with `modal shell [container-id]`. This launches a "Debug Shell" that comes with some preinstalled tools:

- `vim`
- `nano`
- `ps`
- `strace`
- `curl`
- `py-spy`
- and more!

You can use a debug shell to examine or terminate running processes, modify the Container filesystem, run commands, and more. You can also install additional packages using your Container's package manager (ex. `apt`).

Note that debug shells will terminate immediately once your Container has finished running.

#### `modal container exec`

You can also execute a specific command in a running Container with `modal container exec [container-id] [command...]`. For example, to see what files are in `/root`, you can run `modal container exec [container-id] ls /root`.

```
❯ modal container list
                         Active Containers in environment: nathan-dev
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ Container ID                  ┃ App ID                    ┃ App Name ┃ Start Time           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
│ ta-01JK47GVDMWMGPH8MQ0EW30Y25 │ ap-FSuhQ4LpvNAt5b6mKi1CDw │ my-app   │ 2025-02-02 16:02 EST │
└───────────────────────────────┴───────────────────────────┴──────────┴──────────────────────┘

❯ modal container exec ta-01JK47GVDMWMGPH8MQ0EW30Y25 ls /root
__pycache__  test00.py
```

Note that your executed command will terminate immediately once your Container
has finished running.

By default, commands will be run within a
[pseudoterminal (PTY)](https://en.wikipedia.org/wiki/Pseudoterminal), but this
can be disabled with the `--no-pty` flag.

#### Live container profiling

When a container or input is seemingly stuck or not making progress,
you can use the Modal web dashboard to find out what code that's executing in the
container in real time. To do so, look for **Live Profiling** in the **Containers** tab in your
function dashboard.

![Live container profiling](https://modal-public-assets.s3.us-east-1.amazonaws.com/live-profiling-bigger.gif)

### Debugging Container Images

You can also launch an interactive shell in a new Container with the same
environment as your Function. This is handy for debugging issues with your
Image, interactively refining build commands, and exploring the contents of
[`Volume`](https://modal.com/docs/reference/modal.Volume)s and
[`NetworkFileSystem`](https://modal.com/docs/reference/modal.NetworkFileSystem)s.

The primary interface for accessing this feature is the
[`modal shell`](https://modal.com/docs/reference/cli/shell) CLI command, which accepts a Function
name in your App (or prompts you to select one, if none is provided), and runs
an interactive command on the same image as the Function, with the same
[`Secret`](https://modal.com/docs/reference/modal.Secret)s and
[`NetworkFileSystem`](https://modal.com/docs/reference/modal.NetworkFileSystem)s attached as the selected Function.

The default command is `/bin/bash`, but you can override this with any other
command of your choice using the `--cmd` flag.

Note that `modal shell [filename].py` does not attach a shell to a running Container of the
Function, but instead creates a fresh instance of the underlying Image. To attach a shell to a running Container, use `modal shell [container-id]` instead.

## Live updating

### Hot reloading with `modal serve`

Modal has the command `modal serve <filename.py>`, which creates a loop that
live updates an App when any of the supporting files change.

Live updating works with web endpoints, syncing your changes as you make them,
and it also works well with cron schedules and job queues.

```python
import modal

app = modal.App(image=modal.Image.debian_slim().pip_install("fastapi"))

@app.function()
@modal.fastapi_endpoint()
def f():
    return "I update on file edit!"

@app.function(schedule=modal.Period(seconds=5))
def run_me():
    print("I also update on file edit!")
```

If you edit this file, the `modal serve` command will detect the change and
update the code, without having to restart the command.

## Observability

Each running Modal App, including all ephemeral Apps, streams logs and resource
metrics back to you for viewing.

On start, an App will log a dashboard link that will take you its App page.

```shell
$ python3 main.py
✓ Initialized. View app page at https://modal.com/apps/ap-XYZ1234.
...
```

From this page you can access the following:

- logs, both from your application and system-level logs from Modal
- compute resource metrics (CPU, RAM, GPU)
- function call history, including historical success/failure counts

### Debug logs

You can enable Modal's client debug logs by setting the `MODAL_LOGLEVEL` environment variable to `DEBUG`.
Running the following will show debug logging from the Modal client running locally.

```bash
MODAL_LOGLEVEL=DEBUG modal run hello.py
```

To enable debug logs in the Modal client running in the remote container, you can set `MODAL_LOGLEVEL` using
a Modal [`Secret`](https://modal.com/docs/reference/modal.Secret).

```python
@app.function(secrets=[modal.Secret.from_dict({"MODAL_LOGLEVEL": "DEBUG"})])
def f():
    print("Hello, world!")
```

### Client tracebacks

To see a traceback (a.k.a [stack trace](https://en.wikipedia.org/wiki/Stack_trace)) for a client-side exception, you can set the `MODAL_TRACEBACK` environment variable to `1`.

```bash
MODAL_TRACEBACK=1 modal run my_app.py
```

We encourage you to report cases where you need to enable this functionality, as it's indication of an issue in Modal.

#### Developing Modal code with LLMs

# Developing Modal code with LLMs

Excellent developer experience is at the core of Modal. This also means that Modal works well with code generation agents, especially those that can run CLI commands like `modal run` in an implement, test and debug loop, like Amp, Claude Code, Cursor's agent mode, Gemini CLI, etc.

There are of course also many concepts and design patterns that are unique to Modal, so below we gather rules and guidelines that we have found useful when developing Modal code with LLMs. You can paste/import this into your `AGENTS.md`, `CLAUDE.md`, `.cursor/rules/modal.mdc`, etc. or use it as a starting point for your own rules or prompts.

````markdown
# Modal Rules and Guidelines for LLMs

This file provides rules and guidelines for LLMs when implementing Modal code.

## General

- Modal is a serverless cloud platform for running Python code with minimal configuration
- Designed for AI/ML workloads but supports general-purpose cloud compute
- Serverless billing model - you only pay for resources used

## Modal documentation

- Extensive documentation is available at: modal.com/docs (and in markdown format at modal.com/llms-full.txt)
- A large collection of examples is available at: modal.com/docs/examples (and github.com/modal-labs/modal-examples)
- Reference documentation is available at: modal.com/docs/reference

Always refer to documentation and examples for up-to-date functionality and exact syntax.

## Core Modal concepts

### App

- A group of functions, classes and sandboxes that are deployed together.

### Function

- The basic unit of serverless execution on Modal.
- Each Function executes in its own container, and you can configure different Images for different Functions within the same App:

  ```python
  image = (
    modal.Image.debian_slim(python_version="3.12")
    .pip_install("torch", "transformers")
    .apt_install("ffmpeg")
    .run_commands("mkdir -p /models")
  )

  @app.function(image=image)
  def square(x: int) -> int:
    return x * x
  ```

- You can configure individual hardware requirements (CPU, memory, GPUs, etc.) for each Function.

  ```python
  @app.function(
    gpu="H100",
    memory=4096,
    cpu=2,
  )
  def inference():
    ...
  ```

  Some examples specificly for GPUs:

  ```python
  @app.function(gpu="A10G")  # Single GPU, e.g. T4, A10G, A100, H100, or "any"
  @app.function(gpu="A100:2")  # Multiple GPUs, e.g. 2x A100 GPUs
  @app.function(gpu=["H100", "A100", "any"]) # GPU with fallbacks
  ```

- Functions can be invoked in a number of ways. Some of the most common are:
  - `foo.remote()` - Run the Function in a separate container in the cloud. This is by far the most common.
  - `foo.local()` - Run the Function in the same context as the caller. Note: This does not necessarily mean locally on your machine.
  - `foo.map()` - Parallel map over a set of inputs.
  - `foo.spawn()` - Calls the function with the given arguments, without waiting for the results. Terminating the App will also terminate spawned functions.
- Web endpoint: You can turn any Function into an HTTP web endpoint served by adding a decorator:

  ```python
  @app.function()
  @modal.fastapi_endpoint()
  def fastapi_endpoint():
    return {"status": "ok"}

  @app.function()
  @modal.asgi_app()
  def asgi_app():
    app = FastAPI()
    ...
    return app
  ```

- You can run Functions on a schedule using e.g. `@app.function(schedule=modal.Period(minutes=5))` or `@app.function(schedule=modal.Cron("0 9 * * *"))`.

### Classes (a.k.a. `Cls`)

- For stateful operations with startup/shutdown lifecycle hooks. Example:

  ```python
  @app.cls(gpu="A100")
  class ModelServer:
      @modal.enter()
      def load_model(self):
          # Runs once when container starts
          self.model = load_model()

      @modal.method()
      def predict(self, text: str) -> str:
          return self.model.generate(text)

      @modal.exit()
      def cleanup(self):
          # Runs when container stops
          cleanup()
  ```

### Other important concepts

- Image: Represents a container image that Functions can run in.
- Sandbox: Allows defining containers at runtime and securely running arbitrary code inside them.
- Volume: Provide a high-performance distributed file system for your Modal applications.
- Secret: Enables securely providing credentials and other sensitive information to your Modal Functions.
- Dict: Distributed key/value store, managed by Modal.
- Queue: Distributed, FIFO queue, managed by Modal.

## Differences from standard Python development

- Modal always executes code in the cloud, even while you are developing. You can use Environments for separating development and production deployments.
- Dependencies: It's common and encouraged to have different dependency requirements for different Functions within the same App. Consider defining dependencies in Image definitions (see Image docs) that are attached to Functions, rather than in global `requirements.txt`/`pyproject.toml` files, and putting `import` statements inside the Function `def`. Any code in the global scope needs to be executable in all environments where that App source will be used (locally, and any of the Images the App uses).

## Modal coding style

- Modal Apps, Volumes, and Secrets should be named using kebab-case.
- Always use `import modal`, and qualified names like `modal.App()`, `modal.Image.debian_slim()`.
- Modal evolves quickly, and prints helpful deprecation warnings when you `modal run` an App that uses deprecated features. When writing new code, never use deprecated features.

## Common commands

Running `modal --help` gives you a list of all available commands. All commands also support `--help` for more details.

### Running your Modal app during development

- `modal run path/to/your/app.py` - Run your app on Modal.
- `modal run -m module.path.to.app` - Run your app on Modal, using the Python module path.
- `modal serve modal_server.py` - Run web endpoint(s) associated with a Modal app, and hot-reload code on changes. Will print a URL to the web endpoint(s). Note: you need to use `Ctrl+C` to interrupt `modal serve`.

### Deploying your Modal app

- `modal deploy path/to/your/app.py` - Deploy your app (Functions, web endpoints, etc.) to Modal.
- `modal deploy -m module.path.to.app` - Deploy your app to Modal, using the Python module path.

Logs:

- `modal app logs <app_name>` - Stream logs for a deployed app. Note: you need to use `Ctrl+C` to interrupt the stream.

### Resource management

- There are CLI commands for interacting with resources like `modal app list`, `modal volume list`, and similarly for `secret`, `dict`, `queue`, etc.
- These also support other command than `list` - use e.g. `modal app --help` for more.

## Testing and debugging

- When using `app.deploy()`, you can wrap it in a `with modal.enable_output():` block to get more output.
````

#### Jupyter notebooks

# Jupyter notebooks

> **Note:** This document is about running Jupyter on Modal. For our hosted notebooks product with real-time collaboration, see [Modal Notebooks](https://modal.com/docs/guide/notebooks-modal).

You can use the Modal client library in notebook environments like Jupyter! Just
`import modal` and use as normal. You will likely need to use [`app.run`](https://modal.com/docs/guide/apps#ephemeral-apps) to create an ephemeral app to run your functions:

```python
# Cell 1

import modal

app = modal.App()

@app.function()
def my_function(x):
    ...

# Cell 2

with modal.enable_output():
    with app.run():
        my_function.remote(42)
```

## Known issues

- **Interactive shell and interactive functions are not supported.**

  These can only be run within a live terminal session, so they are not
  supported in notebooks.

- **Local and remote Python versions must match.**

  When defining Modal Functions in a Jupyter notebook, the Function automatically
  has `serialized=True` set. This implies that the versions of Python and any third-
  party libraries used in your Modal container must match the version you have locally,
  so that the function can be deserialized remotely without errors.

If you encounter issues not documented above, try restarting the notebook kernel, as it may be
in a broken state, which is common in notebook development.

If the issue persists, contact us [in our Slack](https://modal.com/slack).

We are working on removing these known issues so that writing Modal applications
in a notebook feels just like developing in regular Python modules and scripts.

## Jupyter inside Modal

You can run Jupyter in Modal using the `modal launch` command. For example:

```
$ modal launch jupyter --gpu a10g
```

That will start a Jupyter instance with an A10G GPU attached. You'll be able to
access the app with via a
[Modal Tunnel URL](https://modal.com/docs/guide/tunnels#tunnels-beta). Jupyter
will stop running whenever you stop Modal call in your terminal.

See `--help` for additional options.

## Further examples

- [Basic demonstration of running Modal in a notebook](https://github.com/modal-labs/modal-examples/blob/main/11_notebooks/basic.ipynb)
- [Running Jupyter server within a Modal function](https://github.com/modal-labs/modal-examples/blob/main/11_notebooks/jupyter_inside_modal.py)

#### Asynchronous API usage

# Asynchronous API usage

All of the functions in Modal are available in both standard (blocking) and
asynchronous variants. The async interface can be accessed by appending `.aio`
to any function in the Modal API.

For example, instead of `my_modal_function.remote("hello")` in a blocking
context, you can use `await my_modal_function.remote.aio("hello")` to get an
asynchronous coroutine response, for use with Python's `asyncio` library.

```python
import asyncio
import modal

app = modal.App()

@app.function()
async def myfunc():
    ...

@app.local_entrypoint()
async def main():
    # execute 100 remote calls to myfunc in parallel
    await asyncio.gather(*[myfunc.remote.aio() for i in range(100)])
```

This is an advanced feature. If you are comfortable with asynchronous
programming, you can use this to create arbitrary parallel execution patterns,
with the added benefit that any Modal functions will be executed remotely.

## Async functions

Regardless if you use an async runtime (like `asyncio`) in your usage of _Modal
itself_, you are free to define your `app.function`-decorated function bodies
as either async or blocking. Both kinds of definitions will work for remote
Modal function calls from both any context.

An async function can call a blocking function, and vice versa.

```python
@app.function()
def blocking_function():
    return 42

@app.function()
async def async_function():
    x = await blocking_function.remote.aio()
    return x * 10

@app.local_entrypoint()
def blocking_main():
    print(async_function.remote())  # => 420
```

If a function is configured to support multiple concurrent inputs per container,
the behavior varies slightly between blocking and async contexts:

- In a blocking context, concurrent inputs will run on separate Python threads.
  These are subject to the GIL, but they can still lead to race conditions if
  used with non-threadsafe objects.
- In an async context, concurrent inputs are simply scheduled as coroutines on
  the executor thread. Everything remains single-threaded.

#### Global variables

# Global variables

There are cases where you might want objects or data available in **global**
scope. For example:

- You need to use the data in a scheduled function (scheduled functions don't
  accept arguments)
- You need to construct objects (e.g. Secrets) in global scope to use as
  function annotations
- You don't want to clutter many function signatures with some common arguments
  they all use, and pass the same arguments through many layers of function
  calls.

For these cases, you can use the `modal.is_local` function, which returns `True`
if the app is running locally (initializing) or `False` if the app is executing
in the cloud.

For instance, to create a [`modal.Secret`](https://modal.com/docs/guide/secrets) that you can pass
to your function decorators to create environment variables, you can run:

```python
import os

if modal.is_local():
    pg_password = modal.Secret.from_dict({"PGPASS": os.environ["MY_LOCAL_PASSWORD"]})
else:
    pg_password = modal.Secret.from_dict({})

@app.function(secrets=[pg_password])
def get_secret_data():
    connection = psycopg2.connect(password=os.environ["PGPASS"])
    ...
```

## Warning about regular module globals

If you try to construct a global in module scope using some local data _without_
using something like `modal.is_local`, it might have unexpected effects since
your Python modules will be not only be loaded on your local machine, but also
on the remote worker.

E.g., this will typically not work:

```python notest
# blob.json doesn't exist on the remote worker, so this will cause an error there
data_blob = open("blob.json", "r").read()

@app.function()
def foo():
    print(data_blob)
```

#### Region selection

# Region selection

Modal allows you to specify which cloud region you would like to run a Function in. This may be useful if:

- you are required (for regulatory reasons or by your customers) to process data within certain regions.
- you want to reduce egress fees that result from reading data from a dependency like S3.
- you have a latency-sensitive app where app endpoints need to run near an external DB.

Note that regardless of what region your Function runs in, all Function inputs and outputs go through Modal's control plane in us-east-1.

## Pricing

A multiplier on top of our [base usage pricing](https://modal.com/pricing) will be applied to any function that has a cloud region defined.

| **Region**             | **Multiplier** |
| ---------------------- | -------------- |
| Any region in US/EU/AP | 1.25x          |
| All other regions      | 2.5x           |

Here's an example: let's say you have a function that uses 1 T4, 1 CPU core, and 1GB memory. You've specified that the function should run in `us-east-2`. The cost to run this function for 1 hour would be `((T4 hourly cost) + (CPU hourly cost for one core) + (Memory hourly cost for one GB)) * 1.25`.

If you specify multiple regions and they span the two categories above, we will apply the smaller of the two multipliers.

## Specifying a region

To run your Modal Function in a specific region, pass a `region=` argument to the `function` decorator.

```python
import os
import modal

app = modal.App("...")

@app.function(region="us-east") # also supports a list of options, for example region=["us-central", "us-east"]
def f():
    print(f"running in {os.environ['MODAL_REGION']}") # us-east-1, us-east-2, us-ashburn-1, etc.
```

You can specify a region in addition to the underlying cloud, `@app.function(cloud="aws", region="us-east")` would run your Function only in `"us-east-1"` or `"us-east-2"` for instance.

## Region options

Modal offers varying levels of granularity for regions. Use broader regions when possible, as this increases the pool of available resources your Function can be assigned to, which improves cold-start time and availability.

### United States ("us")

Use `region="us"` to select any region in the United States.

```
     Broad            Specific             Description
 ==============================================================
  "us-east"           "us-east-1"          AWS Virginia
                      "us-east-2"          AWS Ohio
                      "us-east1"           GCP South Carolina
                      "us-east4"           GCP Virginia
                      "us-east5"           GCP Ohio
                      "us-ashburn-1"       OCI Virginia
 --------------------------------------------------------------
  "us-central"        "us-central1"        GCP Iowa
                      "us-chicago-1"       OCI Chicago
                      "us-phoenix-1"       OCI Phoenix
 --------------------------------------------------------------
  "us-west"           "us-west-1"          AWS California
                      "us-west-2"          AWS Oregon
                      "us-west1"           GCP Oregon
                      "us-west3"           GCP Utah
                      "us-west4"           GCP Nevada
                      "us-sanjose-1"       OCI San Jose
```

### Europe ("eu")

Use `region="eu"` to select any region in Europe.

```
     Broad            Specific             Description
 ==============================================================
  "eu-west"           "eu-central-1"       AWS Frankfurt
                      "eu-west-1"          AWS Ireland
                      "eu-west-3"          AWS Paris
                      "europe-west1"       GCP Belgium
                      "europe-west3"       GCP Frankfurt
                      "europe-west4"       GCP Netherlands
                      "eu-frankfurt-1"     OCI Frankfurt
                      "eu-paris-1"         OCI Paris
 --------------------------------------------------------------
  "eu-north"          "eu-north-1"         AWS Stockholm
```

### Asia–Pacific ("ap")

Use `region="ap"` to select any region in Asia–Pacific.

```
     Broad            Specific             Description
 ==============================================================
  "ap-northeast"      "asia-northeast3"    GCP Seoul
                      "asia-northeast1"    GCP Tokyo
                      "ap-northeast-1"     AWS Tokyo
                      "ap-northeast-3"     AWS Osaka
 --------------------------------------------------------------
  "ap-southeast"      "asia-southeast1"    GCP Singapore
                      "ap-southeast-3"     AWS Jakarta
 --------------------------------------------------------------
  "ap-south"          "ap-south-1"         AWS Mumbai
```

### Other regions

```
     Broad            Specific             Description
 ==============================================================
  "ca"                "ca-central-1"       AWS Montreal
                      "ca-toronto-1"       OCI Toronto
 --------------------------------------------------------------
  "uk"                "uk-london-1"        OCI London
                      "europe-west2"       GCP London
                      "eu-west-2"          AWS London
 --------------------------------------------------------------
  "jp"                "ap-northeast-1"     AWS Tokyo
                      "ap-northeast-3"     AWS Osaka
                      "asia-northeast1"    GCP Tokyo
 --------------------------------------------------------------
  "me"                "me-west1"           GCP Tel Aviv
 --------------------------------------------------------------
  "sa"                "sa-east-1"          AWS São Paulo
```

## Region selection and GPU availability

Region selection limits the pool of instances we can run your Functions on. As a result, you may observe higher wait times between when your Function is called and when it gets executed. Generally, we have higher availability in US/EU versus other regions. Whenever possible, select the broadest possible regions so you get the best resource availability.

#### Container lifecycle hooks

# Container lifecycle hooks

Since Modal will reuse the same container for multiple inputs, sometimes you
might want to run some code exactly once when the container starts or exits.

To accomplish this, you need to use Modal's class syntax and the
[`@app.cls`](https://modal.com/docs/reference/modal.App#cls) decorator. Specifically, you'll
need to:

1. Convert your function to a method by making it a member of a class.
2. Decorate the class with `@app.cls(...)` with same arguments you previously
   had for `@app.function(...)`.
3. Instead of the `@app.function` decorator on the original method, use
   `@method` or the appropriate decorator for a
   [web endpoint](#lifecycle-hooks-for-web-endpoints).
4. Add the correct method "hooks" to your class based on your need:
   - `@enter` for one-time initialization (remote)
   - `@exit` for one-time cleanup (remote)

## `@enter`

The container entry handler is called when a new container is started. This is
useful for doing one-time initialization, such as loading model weights or
importing packages that are only present in that image.

To use, make your function a member of a class, and apply the `@enter()`
decorator to one or more class methods:

```python
import modal

app = modal.App()

@app.cls(cpu=8)
class Model:
    @modal.enter()
    def run_this_on_container_startup(self):
        import pickle
        self.model = pickle.load(open("model.pickle"))

    @modal.method()
    def predict(self, x):
        return self.model.predict(x)

@app.local_entrypoint()
def main():
    Model().predict.remote(x=123)
```

When working with an [asynchronous Modal](https://modal.com/docs/guide/async) app, you may use an
async method instead:

```python
import modal

app = modal.App()

@app.cls(memory=1024)
class Processor:
    @modal.enter()
    async def my_enter_method(self):
        self.cache = await load_cache()

    @modal.method()
    async def run(self, x):
        return await do_some_async_stuff(x, self.cache)

@app.local_entrypoint()
async def main():
    await Processor().run.remote(x=123)
```

Note: The `@enter()` decorator replaces the earlier `__enter__` syntax, which
has been deprecated.

## `@exit`

The container exit handler is called when a container is about to exit. It is
useful for doing one-time cleanup, such as closing a database connection or
saving intermediate results. To use, make your function a member of a class, and
apply the `@exit()` decorator:

```python
import modal

app = modal.App()

@app.cls()
class ETLPipeline:
    @modal.enter()
    def open_connection(self):
        import psycopg2
        self.connection = psycopg2.connect(os.environ["DATABASE_URI"])

    @modal.method()
    def run(self):
        # Run some queries
        pass

    @modal.exit()
    def close_connection(self):
        self.connection.close()

@app.local_entrypoint()
def main():
    ETLPipeline().run.remote()
```

Exit handlers are also called when a container is [preempted](https://modal.com/docs/guide/preemption).
The exit handler is given a grace period of 30 seconds to finish, and it will be
killed if it takes longer than that to complete.

## Lifecycle hooks for web endpoints

Modal `@function`s that are [web endpoints](https://modal.com/docs/guide/webhooks) can be
converted to the class syntax as well. Instead of `@modal.method`, simply use
whichever of the web endpoint decorators (`@modal.fastapi_endpoint`,
`@modal.asgi_app` or `@modal.wsgi_app`) you were using before.

```python
from fastapi import Request

import modal

image = modal.Image.debian_slim().pip_install("fastapi")
app = modal.App("web-endpoint-cls", image=image)

@app.cls()
class Model:
    @modal.enter()
    def run_this_on_container_startup(self):
        self.model = pickle.load(open("model.pickle"))

    @modal.fastapi_endpoint()
    def predict(self, request: Request):
        ...
```

#### Parametrized functions

# Parametrized functions

A single Modal Function can be parametrized by a set of arguments, so that each unique combination of arguments will behave like an individual
Modal Function with its own auto-scaling and lifecycle logic.

For example, you might want to have a separate pool of containers for each unique user that invokes your Function. In this scenario, you would
parametrize your Function by a user ID.

To parametrize a Modal Function, you need to use Modal's [class syntax](https://modal.com/docs/guide/lifecycle-functions) and the
[`@app.cls`](https://modal.com/docs/reference/modal.App#cls) decorator. Specifically, you'll need to:

1. Convert your function to a method by making it a member of a class.
2. Decorate the class with `@app.cls(...)` with the same arguments you previously
   had for `@app.function(...)` or your [web endpoint decorator](https://modal.com/docs/guide/webhooks).
3. If you previously used the `@app.function()` decorator on your function, replace it with `@modal.method()`.
4. Define dataclass-style, type-annotated instance attributes with `modal.parameter()` and optionally set default values:

```python
import modal

app = modal.App()

@app.cls()
class MyClass:

    foo: str = modal.parameter()
    bar: int = modal.parameter(default=10)

    @modal.method()
    def baz(self, qux: str = "default") -> str:
        return f"This code is running in container pool ({self.foo}, {self.bar}), with input qux={qux}"
```

The parameters create a keyword-only constructor for your class, and the methods can be called as follows:

```python
@app.local_entrypoint()
def main():
    m1 = MyClass(foo="hedgehog", bar=7)
    m1.baz.remote()

    m2 = MyClass(foo="fox")
    m2.baz.remote(qux="override")
```

Function calls for each unique combination of values for `foo` and `bar` will run in their own separate container pools.
If you re-constructed a `MyClass` with the same arguments in a different context, the calls to `baz` would be routed to the same set of containers as before.

Some things to note:

- The total size of the arguments is limited to 16 KiB.
- Modal classes can still annotate types of regular class attributes, which are independent of parametrization, by either omitting `= modal.parameter()` or using `= modal.parameter(init=False)` to satisfy type checkers.
- The support types are these primitives: `str`, `int`, `bool`, and `bytes`.
- The legacy `__init__` constructor method is being removed, see [the 1.0 migration for details.](https://modal.com/docs/guide/modal-1-0-migration#removing-support-for-custom-cls-constructors)

## Looking up a parametrized function

If you want to call your parametrized function from a Python script running
anywhere, you can use `Cls.lookup`:

```python notest
import modal

MyClass = modal.Cls.from_name("parametrized-function-app", "MyClass")  # returns a class-like object
m = MyClass(foo="snake", bar=12)
m.baz.remote()
```

## Parametrized web endpoints

Modal [web endpoints](https://modal.com/docs/guide/webhooks) can also be parametrized:

```python
app = modal.App("parametrized-endpoint")

@app.cls()
class MyClass():

    foo: str = modal.parameter()
    bar: int = modal.parameter(default=10)

    @modal.fastapi_endpoint()
    def baz(self, qux: str = "default") -> str:
        ...
```

Parameters are specified in the URL as query parameter values.

```bash
curl "https://parametrized-endpoint.modal.run?foo=hedgehog&bar=7&qux=override"
curl "https://parametrized-endpoint.modal.run?foo=hedgehog&qux=override"
curl "https://parametrized-endpoint.modal.run?foo=hedgehog&bar=7"
curl "https://parametrized-endpoint.modal.run?foo=hedgehog"
```

## Using parametrized functions with lifecycle functions

Parametrized functions can be used with [lifecycle functions](https://modal.com/docs/guide/lifecycle-functions).

For example, here is how you might parametrize the [`@enter`](https://modal.com/docs/guide/lifecycle-functions#enter) lifecycle function to load a specific model:

```python
@app.cls()
class Model:

    name: str = modal.parameter()
    size: int = modal.parameter(default=100)

    @modal.enter()
    def load_model(self):
        print(f"Loading model {self.name} with size {self.size}")
        self.model = load_model_util(self.name, self.size)

    @modal.method()
    def generate(self, prompt: str) -> str:
        return self.model.generate(prompt)
```

#### S3 Gateway endpoints

# S3 Gateway endpoints

When running workloads in AWS, our system automatically uses a corresponding
[S3 Gateway endpoint](https://docs.aws.amazon.com/vpc/latest/privatelink/vpc-endpoints-s3.html)
to ensure low costs, optimal performance, and network reliability between Modal and S3.

Workloads running on Modal should not incur egress or ingress fees associated
with S3 operations. No configuration is needed in order for your app to use S3 Gateway endpoints.
S3 Gateway endpoints are automatically used when your app runs on AWS.

## Endpoint configuration

Only use the region-specific endpoint (`s3.<region>.amazonaws.com`) or the
global AWS endpoint (`s3.amazonaws.com`). Using an S3 endpoint from one region
in another **will not use the S3 Gateway Endpoint incurring networking costs**.

Avoid specifying regional endpoints manually, as this can lead to unexpected cost
or performance degradation.

## Inter-region costs

S3 Gateway endpoints guarantee no costs for network traffic within the same AWS region.
However, if your Modal Function runs in one region but your bucket resides in a
different region you will be billed for inter-region traffic.

You can prevent this by scheduling your Modal App in the same region of your
S3 bucket with [Region selection](https://modal.com/docs/guide/region-selection#region-selection).

#### GPU Metrics

# GPU Metrics

Modal exposes a number of GPU metrics that help monitor the health and utilization of the GPUs you're using.

- **GPU utilization %** is the percentage of time that the GPU was executing at least one CUDA kernel. This is the same metric reported as utilization by [`nvidia-smi`](https://modal.com/gpu-glossary/host-software/nvidia-smi). GPU utilization is helpful for determining the amount of time GPU work is blocked on CPU work, like PyTorch compute graph construction or input processing. However, it is far from indicating what fraction of the GPU's computing firepower (FLOPS or memory throughput, [CUDA Cores](https://modal.com/gpu-glossary/device-hardware/cuda-core), [SMs](https://modal.com/gpu-glossary/device-hardware/streaming-multiprocessor)) is being used. See [this blog post](https://arthurchiao.art/blog/understanding-gpu-performance) for details.
- **GPU power utilization %** is the percentage of the maximum power draw that the device is currently drawing. When aggregating across containers, we also report **Total GPU power usage** in Watts. Because high-performance GPUs are [fundamentally limited by power draw](https://www.thonking.ai/p/strangely-matrix-multiplications), both for computation and memory access, the power usage can be used as a proxy of how much work the GPU is doing. A fully-saturated GPU should draw at or near its entire power budget (which can also be found by running `nvidia-smi`).
- **GPU temperature** is the temperature measured on the die of the GPU. Like power draw, which is the source of the thermal energy, the ability to efflux heat is a fundamental limit on GPU performance: continuing to draw full power without removing the waste heat would damage the system. At the highest temperatures readily observed in proper GPU deployments (i.e. mid-70s Celsius for an H100), increased error correction from thermal noise can already reduce performance. Generally, power utilization is a better proxy for performance, but we report temperature for completeness.
- **GPU memory used** is the amount of memory allocated on the GPU, in bytes.

In general, these metrics are useful signals or correlates of performance, but can't be used to directly debug performance issues. Instead, we (and [the manufacturers!](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#assess-parallelize-optimize-deploy)) recommend tracing and profiling workloads. See [this example](https://modal.com/docs/examples/torch_profiling) of profiling PyTorch applications on Modal.

## API Reference

### Python API Reference

#### App

# modal.App

```python
class App(object)
```

A Modal App is a group of functions and classes that are deployed together.

The app serves at least three purposes:

* A unit of deployment for functions and classes.
* Syncing of identities of (primarily) functions and classes across processes
  (your local Python interpreter and every Modal container active in your application).
* Manage log collection for everything that happens inside your code.

**Registering functions with an app**

The most common way to explicitly register an Object with an app is through the
`@app.function()` decorator. It both registers the annotated function itself and
other passed objects, like schedules and secrets, with the app:

```python
import modal

app = modal.App()

@app.function(
    secrets=[modal.Secret.from_name("some_secret")],
    schedule=modal.Period(days=1),
)
def foo():
    pass
```

In this example, the secret and schedule are registered with the app.

```python
def __init__(
    self,
    name: Optional[str] = None,
    *,
    image: Optional[_Image] = None,  # Default Image for the App (otherwise default to `modal.Image.debian_slim()`)
    secrets: Sequence[_Secret] = [],  # Secrets to add for all Functions in the App
    volumes: dict[Union[str, PurePosixPath], _Volume] = {},  # Volume mounts to use for all Functions
    include_source: bool = True,  # Default configuration for adding Function source file(s) to the Modal container
) -> None:
```

Construct a new app, optionally with default image, mounts, secrets, or volumes.

```python notest
image = modal.Image.debian_slim().pip_install(...)
secret = modal.Secret.from_name("my-secret")
volume = modal.Volume.from_name("my-data")
app = modal.App(image=image, secrets=[secret], volumes={"/mnt/data": volume})
```
## name

```python
@property
def name(self) -> Optional[str]:
```

The user-provided name of the App.
## is_interactive

```python
@property
def is_interactive(self) -> bool:
```

Whether the current app for the app is running in interactive mode.
## app_id

```python
@property
def app_id(self) -> Optional[str]:
```

Return the app_id of a running or stopped app.
## description

```python
@property
def description(self) -> Optional[str]:
```

The App's `name`, if available, or a fallback descriptive identifier.
## lookup

```python
@staticmethod
def lookup(
    name: str,
    *,
    client: Optional[_Client] = None,
    environment_name: Optional[str] = None,
    create_if_missing: bool = False,
) -> "_App":
```

Look up an App with a given name, creating a new App if necessary.

Note that Apps created through this method will be in a deployed state,
but they will not have any associated Functions or Classes. This method
is mainly useful for creating an App to associate with a Sandbox:

```python
app = modal.App.lookup("my-app", create_if_missing=True)
modal.Sandbox.create("echo", "hi", app=app)
```
## set_description

```python
def set_description(self, description: str):
```

## image

```python
@property
def image(self) -> _Image:
```

## run

```python
@contextmanager
def run(
    self,
    *,
    client: Optional[_Client] = None,
    detach: bool = False,
    interactive: bool = False,
    environment_name: Optional[str] = None,
) -> AsyncGenerator["_App", None]:
```

Context manager that runs an ephemeral app on Modal.

Use this as the main entry point for your Modal application. All calls
to Modal Functions should be made within the scope of this context
manager, and they will correspond to the current App.

**Example**

```python notest
with app.run():
    some_modal_function.remote()
```

To enable output printing (i.e., to see App logs), use `modal.enable_output()`:

```python notest
with modal.enable_output():
    with app.run():
        some_modal_function.remote()
```

Note that you should not invoke this in global scope of a file where you have
Modal Functions or Classes defined, since that would run the block when the Function
or Cls is imported in your containers as well. If you want to run it as your entrypoint,
consider protecting it:

```python
if __name__ == "__main__":
    with app.run():
        some_modal_function.remote()
```

You can then run your script with:

```shell
python app_module.py
```
## deploy

```python
def deploy(
    self,
    *,
    name: Optional[str] = None,  # Name for the deployment, overriding any set on the App
    environment_name: Optional[str] = None,  # Environment to deploy the App in
    tag: str = "",  # Optional metadata that will be visible in the deployment history
    client: Optional[_Client] = None,  # Alternate client to use for RPCs
) -> typing_extensions.Self:
```

Deploy the App so that it is available persistently.

Deployed Apps will be avaible for lookup or web-based invocations until they are stopped.
Unlike with `App.run`, this method will return as soon as the deployment completes.

This method is a programmatic alternative to the `modal deploy` CLI command.

Examples:

```python notest
app = App("my-app")
app.deploy()
```

To enable output printing (i.e., to see build logs), use `modal.enable_output()`:

```python notest
app = App("my-app")
with modal.enable_output():
    app.deploy()
```

Unlike with `App.run`, Function logs will not stream back to the local client after the
App is deployed.

Note that you should not invoke this method in global scope, as that would redeploy
the App every time the file is imported. If you want to write a programmatic deployment
script, protect this call so that it only runs when the file is executed directly:

```python notest
if __name__ == "__main__":
    with modal.enable_output():
        app.deploy()
```

Then you can deploy your app with:

```shell
python app_module.py
```
## registered_functions

```python
@property
def registered_functions(self) -> dict[str, _Function]:
```

All modal.Function objects registered on the app.

Note: this property is populated only during the build phase, and it is not
expected to work when a deplyoed App has been retrieved via `modal.App.lookup`.
## registered_classes

```python
@property
def registered_classes(self) -> dict[str, _Cls]:
```

All modal.Cls objects registered on the app.

Note: this property is populated only during the build phase, and it is not
expected to work when a deplyoed App has been retrieved via `modal.App.lookup`.
## registered_entrypoints

```python
@property
def registered_entrypoints(self) -> dict[str, _LocalEntrypoint]:
```

All local CLI entrypoints registered on the app.

Note: this property is populated only during the build phase, and it is not
expected to work when a deplyoed App has been retrieved via `modal.App.lookup`.
## registered_web_endpoints

```python
@property
def registered_web_endpoints(self) -> list[str]:
```

Names of web endpoint (ie. webhook) functions registered on the app.

Note: this property is populated only during the build phase, and it is not
expected to work when a deplyoed App has been retrieved via `modal.App.lookup`.
## local_entrypoint

```python
def local_entrypoint(
    self, _warn_parentheses_missing: Any = None, *, name: Optional[str] = None
) -> Callable[[Callable[..., Any]], _LocalEntrypoint]:
```

Decorate a function to be used as a CLI entrypoint for a Modal App.

These functions can be used to define code that runs locally to set up the app,
and act as an entrypoint to start Modal functions from. Note that regular
Modal functions can also be used as CLI entrypoints, but unlike `local_entrypoint`,
those functions are executed remotely directly.

**Example**

```python
@app.local_entrypoint()
def main():
    some_modal_function.remote()
```

You can call the function using `modal run` directly from the CLI:

```shell
modal run app_module.py
```

Note that an explicit [`app.run()`](https://modal.com/docs/reference/modal.App#run) is not needed, as an
[app](https://modal.com/docs/guide/apps) is automatically created for you.

**Multiple Entrypoints**

If you have multiple `local_entrypoint` functions, you can qualify the name of your app and function:

```shell
modal run app_module.py::app.some_other_function
```

**Parsing Arguments**

If your entrypoint function take arguments with primitive types, `modal run` automatically parses them as
CLI options.
For example, the following function can be called with `modal run app_module.py --foo 1 --bar "hello"`:

```python
@app.local_entrypoint()
def main(foo: int, bar: str):
    some_modal_function.call(foo, bar)
```

Currently, `str`, `int`, `float`, `bool`, and `datetime.datetime` are supported.
Use `modal run app_module.py --help` for more information on usage.
## function

```python
@warn_on_renamed_autoscaler_settings
def function(
    self,
    _warn_parentheses_missing: Any = None,
    *,
    image: Optional[_Image] = None,  # The image to run as the container for the function
    schedule: Optional[Schedule] = None,  # An optional Modal Schedule for the function
    secrets: Sequence[_Secret] = (),  # Optional Modal Secret objects with environment variables for the container
    gpu: Union[
        GPU_T, list[GPU_T]
    ] = None,  # GPU request as string ("any", "T4", ...), object (`modal.GPU.A100()`, ...), or a list of either
    serialized: bool = False,  # Whether to send the function over using cloudpickle.
    network_file_systems: dict[
        Union[str, PurePosixPath], _NetworkFileSystem
    ] = {},  # Mountpoints for Modal NetworkFileSystems
    volumes: dict[
        Union[str, PurePosixPath], Union[_Volume, _CloudBucketMount]
    ] = {},  # Mount points for Modal Volumes & CloudBucketMounts
    # Specify, in fractional CPU cores, how many CPU cores to request.
    # Or, pass (request, limit) to additionally specify a hard limit in fractional CPU cores.
    # CPU throttling will prevent a container from exceeding its specified limit.
    cpu: Optional[Union[float, tuple[float, float]]] = None,
    # Specify, in MiB, a memory request which is the minimum memory required.
    # Or, pass (request, limit) to additionally specify a hard limit in MiB.
    memory: Optional[Union[int, tuple[int, int]]] = None,
    ephemeral_disk: Optional[int] = None,  # Specify, in MiB, the ephemeral disk size for the Function.
    min_containers: Optional[int] = None,  # Minimum number of containers to keep warm, even when Function is idle.
    max_containers: Optional[int] = None,  # Limit on the number of containers that can be concurrently running.
    buffer_containers: Optional[int] = None,  # Number of additional idle containers to maintain under active load.
    scaledown_window: Optional[int] = None,  # Max time (in seconds) a container can remain idle while scaling down.
    proxy: Optional[_Proxy] = None,  # Reference to a Modal Proxy to use in front of this function.
    retries: Optional[Union[int, Retries]] = None,  # Number of times to retry each input in case of failure.
    timeout: Optional[int] = None,  # Maximum execution time of the function in seconds.
    name: Optional[str] = None,  # Sets the Modal name of the function within the app
    is_generator: Optional[
        bool
    ] = None,  # Set this to True if it's a non-generator function returning a [sync/async] generator object
    cloud: Optional[str] = None,  # Cloud provider to run the function on. Possible values are aws, gcp, oci, auto.
    region: Optional[Union[str, Sequence[str]]] = None,  # Region or regions to run the function on.
    enable_memory_snapshot: bool = False,  # Enable memory checkpointing for faster cold starts.
    block_network: bool = False,  # Whether to block network access
    restrict_modal_access: bool = False,  # Whether to allow this function access to other Modal resources
    # Maximum number of inputs a container should handle before shutting down.
    # With `max_inputs = 1`, containers will be single-use.
    max_inputs: Optional[int] = None,
    i6pn: Optional[bool] = None,  # Whether to enable IPv6 container networking within the region.
    # Whether the file or directory containing the Function's source should automatically be included
    # in the container. When unset, falls back to the App-level configuration, or is otherwise True by default.
    include_source: Optional[bool] = None,
    experimental_options: Optional[dict[str, Any]] = None,
    # Parameters below here are experimental. Use with caution!
    _experimental_scheduler_placement: Optional[
        SchedulerPlacement
    ] = None,  # Experimental controls over fine-grained scheduling (alpha).
    _experimental_proxy_ip: Optional[str] = None,  # IP address of proxy
    _experimental_custom_scaling_factor: Optional[float] = None,  # Custom scaling factor
    # Parameters below here are deprecated. Please update your code as suggested
    keep_warm: Optional[int] = None,  # Replaced with `min_containers`
    concurrency_limit: Optional[int] = None,  # Replaced with `max_containers`
    container_idle_timeout: Optional[int] = None,  # Replaced with `scaledown_window`
    allow_concurrent_inputs: Optional[int] = None,  # Replaced with the `@modal.concurrent` decorator
    allow_cross_region_volumes: Optional[bool] = None,  # Always True on the Modal backend now
    _experimental_buffer_containers: Optional[int] = None,  # Now stable API with `buffer_containers`
) -> _FunctionDecoratorType:
```

Decorator to register a new Modal Function with this App.
## cls

```python
@typing_extensions.dataclass_transform(field_specifiers=(parameter,), kw_only_default=True)
@warn_on_renamed_autoscaler_settings
def cls(
    self,
    _warn_parentheses_missing: Optional[bool] = None,
    *,
    image: Optional[_Image] = None,  # The image to run as the container for the function
    secrets: Sequence[_Secret] = (),  # Optional Modal Secret objects with environment variables for the container
    gpu: Union[
        GPU_T, list[GPU_T]
    ] = None,  # GPU request as string ("any", "T4", ...), object (`modal.GPU.A100()`, ...), or a list of either
    serialized: bool = False,  # Whether to send the function over using cloudpickle.
    network_file_systems: dict[
        Union[str, PurePosixPath], _NetworkFileSystem
    ] = {},  # Mountpoints for Modal NetworkFileSystems
    volumes: dict[
        Union[str, PurePosixPath], Union[_Volume, _CloudBucketMount]
    ] = {},  # Mount points for Modal Volumes & CloudBucketMounts
    # Specify, in fractional CPU cores, how many CPU cores to request.
    # Or, pass (request, limit) to additionally specify a hard limit in fractional CPU cores.
    # CPU throttling will prevent a container from exceeding its specified limit.
    cpu: Optional[Union[float, tuple[float, float]]] = None,
    # Specify, in MiB, a memory request which is the minimum memory required.
    # Or, pass (request, limit) to additionally specify a hard limit in MiB.
    memory: Optional[Union[int, tuple[int, int]]] = None,
    ephemeral_disk: Optional[int] = None,  # Specify, in MiB, the ephemeral disk size for the Function.
    min_containers: Optional[int] = None,  # Minimum number of containers to keep warm, even when Function is idle.
    max_containers: Optional[int] = None,  # Limit on the number of containers that can be concurrently running.
    buffer_containers: Optional[int] = None,  # Number of additional idle containers to maintain under active load.
    scaledown_window: Optional[int] = None,  # Max time (in seconds) a container can remain idle while scaling down.
    proxy: Optional[_Proxy] = None,  # Reference to a Modal Proxy to use in front of this function.
    retries: Optional[Union[int, Retries]] = None,  # Number of times to retry each input in case of failure.
    timeout: Optional[int] = None,  # Maximum execution time of the function in seconds.
    cloud: Optional[str] = None,  # Cloud provider to run the function on. Possible values are aws, gcp, oci, auto.
    region: Optional[Union[str, Sequence[str]]] = None,  # Region or regions to run the function on.
    enable_memory_snapshot: bool = False,  # Enable memory checkpointing for faster cold starts.
    block_network: bool = False,  # Whether to block network access
    restrict_modal_access: bool = False,  # Whether to allow this class access to other Modal resources
    # Limits the number of inputs a container handles before shutting down.
    # Use `max_inputs = 1` for single-use containers.
    max_inputs: Optional[int] = None,
    i6pn: Optional[bool] = None,  # Whether to enable IPv6 container networking within the region.
    include_source: Optional[bool] = None,  # When `False`, don't automatically add the App source to the container.
    experimental_options: Optional[dict[str, Any]] = None,
    # Parameters below here are experimental. Use with caution!
    _experimental_scheduler_placement: Optional[
        SchedulerPlacement
    ] = None,  # Experimental controls over fine-grained scheduling (alpha).
    _experimental_proxy_ip: Optional[str] = None,  # IP address of proxy
    _experimental_custom_scaling_factor: Optional[float] = None,  # Custom scaling factor
    # Parameters below here are deprecated. Please update your code as suggested
    keep_warm: Optional[int] = None,  # Replaced with `min_containers`
    concurrency_limit: Optional[int] = None,  # Replaced with `max_containers`
    container_idle_timeout: Optional[int] = None,  # Replaced with `scaledown_window`
    allow_concurrent_inputs: Optional[int] = None,  # Replaced with the `@modal.concurrent` decorator
    _experimental_buffer_containers: Optional[int] = None,  # Now stable API with `buffer_containers`
    allow_cross_region_volumes: Optional[bool] = None,  # Always True on the Modal backend now
) -> Callable[[Union[CLS_T, _PartialFunction]], CLS_T]:
```

Decorator to register a new Modal [Cls](https://modal.com/docs/reference/modal.Cls) with this App.
## include

```python
def include(self, /, other_app: "_App") -> typing_extensions.Self:
```

Include another App's objects in this one.

Useful for splitting up Modal Apps across different self-contained files.

```python
app_a = modal.App("a")
@app.function()
def foo():
    ...

app_b = modal.App("b")
@app.function()
def bar():
    ...

app_a.include(app_b)

@app_a.local_entrypoint()
def main():
    # use function declared on the included app
    bar.remote()
```

#### Client

# modal.Client

```python
class Client(object)
```

## is_closed

```python
def is_closed(self) -> bool:
```

## hello

```python
def hello(self):
```

Connect to server and retrieve version information; raise appropriate error for various failures.
## from_credentials

```python
@classmethod
def from_credentials(cls, token_id: str, token_secret: str) -> "_Client":
```

Constructor based on token credentials; useful for managing Modal on behalf of third-party users.

**Usage:**

```python notest
client = modal.Client.from_credentials("my_token_id", "my_token_secret")

modal.Sandbox.create("echo", "hi", client=client, app=app)
```
## get_input_plane_metadata

```python
def get_input_plane_metadata(self, input_plane_region: str) -> list[tuple[str, str]]:
```

#### CloudBucketMount

# modal.CloudBucketMount

```python
class CloudBucketMount(object)
```

Mounts a cloud bucket to your container. Currently supports AWS S3 buckets.

S3 buckets are mounted using [AWS S3 Mountpoint](https://github.com/awslabs/mountpoint-s3).
S3 mounts are optimized for reading large files sequentially. It does not support every file operation; consult
[the AWS S3 Mountpoint documentation](https://github.com/awslabs/mountpoint-s3/blob/main/doc/SEMANTICS.md)
for more information.

**AWS S3 Usage**

```python
import subprocess

app = modal.App()
secret = modal.Secret.from_name(
    "aws-secret",
    required_keys=["AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY"]
    # Note: providing AWS_REGION can help when automatic detection of the bucket region fails.
)

@app.function(
    volumes={
        "/my-mount": modal.CloudBucketMount(
            bucket_name="s3-bucket-name",
            secret=secret,
            read_only=True
        )
    }
)
def f():
    subprocess.run(["ls", "/my-mount"], check=True)
```

**Cloudflare R2 Usage**

Cloudflare R2 is [S3-compatible](https://developers.cloudflare.com/r2/api/s3/api/) so its setup looks
very similar to S3. But additionally the `bucket_endpoint_url` argument must be passed.

```python
import subprocess

app = modal.App()
secret = modal.Secret.from_name(
    "r2-secret",
    required_keys=["AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY"]
)

@app.function(
    volumes={
        "/my-mount": modal.CloudBucketMount(
            bucket_name="my-r2-bucket",
            bucket_endpoint_url="https://<ACCOUNT ID>.r2.cloudflarestorage.com",
            secret=secret,
            read_only=True
        )
    }
)
def f():
    subprocess.run(["ls", "/my-mount"], check=True)
```

**Google GCS Usage**

Google Cloud Storage (GCS) is [S3-compatible](https://cloud.google.com/storage/docs/interoperability).
GCS Buckets also require a secret with Google-specific key names (see below) populated with
a [HMAC key](https://cloud.google.com/storage/docs/authentication/managing-hmackeys#create).

```python
import subprocess

app = modal.App()
gcp_hmac_secret = modal.Secret.from_name(
    "gcp-secret",
    required_keys=["GOOGLE_ACCESS_KEY_ID", "GOOGLE_ACCESS_KEY_SECRET"]
)

@app.function(
    volumes={
        "/my-mount": modal.CloudBucketMount(
            bucket_name="my-gcs-bucket",
            bucket_endpoint_url="https://storage.googleapis.com",
            secret=gcp_hmac_secret,
        )
    }
)
def f():
    subprocess.run(["ls", "/my-mount"], check=True)
```

```python
def __init__(self, bucket_name: str, bucket_endpoint_url: Optional[str] = None, key_prefix: Optional[str] = None, secret: Optional[modal.secret._Secret] = None, oidc_auth_role_arn: Optional[str] = None, read_only: bool = False, requester_pays: bool = False) -> None
```

#### Cls

# modal.Cls

```python
class Cls(modal.object.Object)
```

Cls adds method pooling and [lifecycle hook](https://modal.com/docs/guide/lifecycle-functions) behavior
to [modal.Function](https://modal.com/docs/reference/modal.Function).

Generally, you will not construct a Cls directly.
Instead, use the [`@app.cls()`](https://modal.com/docs/reference/modal.App#cls) decorator on the App object.

## hydrate

```python
def hydrate(self, client: Optional[_Client] = None) -> Self:
```

Synchronize the local object with its identity on the Modal server.

It is rarely necessary to call this method explicitly, as most operations
will lazily hydrate when needed. The main use case is when you need to
access object metadata, such as its ID.

*Added in v0.72.39*: This method replaces the deprecated `.resolve()` method.
## from_name

```python
@classmethod
def from_name(
    cls: type["_Cls"],
    app_name: str,
    name: str,
    *,
    environment_name: Optional[str] = None,
) -> "_Cls":
```

Reference a Cls from a deployed App by its name.

This is a lazy method that defers hydrating the local
object with metadata from Modal servers until the first
time it is actually used.

```python
Model = modal.Cls.from_name("other-app", "Model")
```
## with_options

```python
@warn_on_renamed_autoscaler_settings
def with_options(
    self: "_Cls",
    *,
    cpu: Optional[Union[float, tuple[float, float]]] = None,
    memory: Optional[Union[int, tuple[int, int]]] = None,
    gpu: GPU_T = None,
    secrets: Collection[_Secret] = (),
    volumes: dict[Union[str, os.PathLike], _Volume] = {},
    retries: Optional[Union[int, Retries]] = None,
    max_containers: Optional[int] = None,  # Limit on the number of containers that can be concurrently running.
    buffer_containers: Optional[int] = None,  # Additional containers to scale up while Function is active.
    scaledown_window: Optional[int] = None,  # Max amount of time a container can remain idle before scaling down.
    timeout: Optional[int] = None,
    # The following parameters are deprecated
    concurrency_limit: Optional[int] = None,  # Now called `max_containers`
    container_idle_timeout: Optional[int] = None,  # Now called `scaledown_window`
    allow_concurrent_inputs: Optional[int] = None,  # See `.with_concurrency`
) -> "_Cls":
```

Override the static Function configuration at runtime.

This method will return a new instance of the cls that will autoscale independently of the
original instance. Note that options cannot be "unset" with this method (i.e., if a GPU
is configured in the `@app.cls()` decorator, passing `gpu=None` here will not create a
CPU-only instance).

**Usage:**

You can use this method after looking up the Cls from a deployed App or if you have a
direct reference to a Cls from another Function or local entrypoint on its App:

```python notest
Model = modal.Cls.from_name("my_app", "Model")
ModelUsingGPU = Model.with_options(gpu="A100")
ModelUsingGPU().generate.remote(input_prompt)  # Run with an A100 GPU
```

The method can be called multiple times to "stack" updates:

```python notest
Model.with_options(gpu="A100").with_options(scaledown_window=300)  # Use an A100 with slow scaledown
```

Note that container arguments (i.e. `volumes` and `secrets`) passed in subsequent calls
will not be merged.
## with_concurrency

```python
def with_concurrency(self: "_Cls", *, max_inputs: int, target_inputs: Optional[int] = None) -> "_Cls":
```

Create an instance of the Cls with input concurrency enabled or overridden with new values.

**Usage:**

```python notest
Model = modal.Cls.from_name("my_app", "Model")
ModelUsingGPU = Model.with_options(gpu="A100").with_concurrency(max_inputs=100)
ModelUsingGPU().generate.remote(42)  # will run on an A100 GPU with input concurrency enabled
```
## with_batching

```python
def with_batching(self: "_Cls", *, max_batch_size: int, wait_ms: int) -> "_Cls":
```

Create an instance of the Cls with dynamic batching enabled or overridden with new values.

**Usage:**

```python notest
Model = modal.Cls.from_name("my_app", "Model")
ModelUsingGPU = Model.with_options(gpu="A100").with_batching(max_batch_size=100, batch_wait_ms=1000)
ModelUsingGPU().generate.remote(42)  # will run on an A100 GPU with input concurrency enabled
```

#### Cron

# modal.Cron

```python
class Cron(modal.schedule.Schedule)
```

Cron jobs are a type of schedule, specified using the
[Unix cron tab](https://crontab.guru/) syntax.

The alternative schedule type is the [`modal.Period`](https://modal.com/docs/reference/modal.Period).

**Usage**

```python
import modal
app = modal.App()

@app.function(schedule=modal.Cron("* * * * *"))
def f():
    print("This function will run every minute")
```

We can specify different schedules with cron strings, for example:

```python
modal.Cron("5 4 * * *")  # run at 4:05am UTC every night
modal.Cron("0 9 * * 4")  # runs every Thursday at 9am UTC
```

We can also optionally specify a timezone, for example:

```python
# Run daily at 6am New York time, regardless of whether daylight saving
# is in effect (i.e. at 11am UTC in the winter, and 10am UTC in the summer):
modal.Cron("0 6 * * *", timezone="America/New_York")
```

If no timezone is specified, the default is UTC.

```python
def __init__(
    self,
    cron_string: str,
    timezone: str = "UTC",
) -> None:
```

Construct a schedule that runs according to a cron expression string.

#### Dict

# modal.Dict

```python
class Dict(modal.object.Object)
```

Distributed dictionary for storage in Modal apps.

Dict contents can be essentially any object so long as they can be serialized by
`cloudpickle`. This includes other Modal objects. If writing and reading in different
environments (eg., writing locally and reading remotely), it's necessary to have the
library defining the data type installed, with compatible versions, on both sides.
Additionally, cloudpickle serialization is not guaranteed to be deterministic, so it is
generally recommended to use primitive types for keys.

**Lifetime of a Dict and its items**

An individual Dict entry will expire after 7 days of inactivity (no reads or writes). The
Dict entries are written to durable storage.

Legacy Dicts (created before 2025-05-20) will still have entries expire 30 days after being
last added. Additionally, contents are stored in memory on the Modal server and could be lost
due to unexpected server restarts. Eventually, these Dicts will be fully sunset.

**Usage**

```python
from modal import Dict

my_dict = Dict.from_name("my-persisted_dict", create_if_missing=True)

my_dict["some key"] = "some value"
my_dict[123] = 456

assert my_dict["some key"] == "some value"
assert my_dict[123] == 456
```

The `Dict` class offers a few methods for operations that are usually accomplished
in Python with operators, such as `Dict.put` and `Dict.contains`. The advantage of
these methods is that they can be safely called in an asynchronous context by using
the `.aio` suffix on the method, whereas their operator-based analogues will always
run synchronously and block the event loop.

For more examples, see the [guide](https://modal.com/docs/guide/dicts-and-queues#modal-dicts).

## hydrate

```python
def hydrate(self, client: Optional[_Client] = None) -> Self:
```

Synchronize the local object with its identity on the Modal server.

It is rarely necessary to call this method explicitly, as most operations
will lazily hydrate when needed. The main use case is when you need to
access object metadata, such as its ID.

*Added in v0.72.39*: This method replaces the deprecated `.resolve()` method.
## name

```python
@property
def name(self) -> Optional[str]:
```

## ephemeral

```python
@classmethod
@contextmanager
def ephemeral(
    cls: type["_Dict"],
    data: Optional[dict] = None,  # DEPRECATED
    client: Optional[_Client] = None,
    environment_name: Optional[str] = None,
    _heartbeat_sleep: float = EPHEMERAL_OBJECT_HEARTBEAT_SLEEP,
) -> Iterator["_Dict"]:
```

Creates a new ephemeral Dict within a context manager:

Usage:
```python
from modal import Dict

with Dict.ephemeral() as d:
    d["foo"] = "bar"
```

```python notest
async with Dict.ephemeral() as d:
    await d.put.aio("foo", "bar")
```
## from_name

```python
@staticmethod
def from_name(
    name: str,
    *,
    environment_name: Optional[str] = None,
    create_if_missing: bool = False,
) -> "_Dict":
```

Reference a named Dict, creating if necessary.

This is a lazy method that defers hydrating the local
object with metadata from Modal servers until the first
time it is actually used.

```python
d = modal.Dict.from_name("my-dict", create_if_missing=True)
d[123] = 456
```
## delete

```python
@staticmethod
def delete(
    name: str,
    *,
    client: Optional[_Client] = None,
    environment_name: Optional[str] = None,
):
```

## info

```python
@live_method
def info(self) -> DictInfo:
```

Return information about the Dict object.
## clear

```python
@live_method
def clear(self) -> None:
```

Remove all items from the Dict.
## get

```python
@live_method
def get(self, key: Any, default: Optional[Any] = None) -> Any:
```

Get the value associated with a key.

Returns `default` if key does not exist.
## contains

```python
@live_method
def contains(self, key: Any) -> bool:
```

Return if a key is present.
## len

```python
@live_method
def len(self) -> int:
```

Return the length of the Dict.

Note: This is an expensive operation and will return at most 100,000.
## update

```python
@live_method
def update(self, other: Optional[Mapping] = None, /, **kwargs) -> None:
```

Update the Dict with additional items.
## put

```python
@live_method
def put(self, key: Any, value: Any, *, skip_if_exists: bool = False) -> bool:
```

Add a specific key-value pair to the Dict.

Returns True if the key-value pair was added and False if it wasn't because the key already existed and
`skip_if_exists` was set.
## pop

```python
@live_method
def pop(self, key: Any) -> Any:
```

Remove a key from the Dict, returning the value if it exists.
## keys

```python
@live_method_gen
def keys(self) -> Iterator[Any]:
```

Return an iterator over the keys in this Dict.

Note that (unlike with Python dicts) the return value is a simple iterator,
and results are unordered.
## values

```python
@live_method_gen
def values(self) -> Iterator[Any]:
```

Return an iterator over the values in this Dict.

Note that (unlike with Python dicts) the return value is a simple iterator,
and results are unordered.
## items

```python
@live_method_gen
def items(self) -> Iterator[tuple[Any, Any]]:
```

Return an iterator over the (key, value) tuples in this Dict.

Note that (unlike with Python dicts) the return value is a simple iterator,
and results are unordered.

#### Error

# modal.Error

```python
class Error(Exception)
```

Base class for all Modal errors. See [`modal.exception`](https://modal.com/docs/reference/modal.exception)
for the specialized error classes.

**Usage**

```python notest
import modal

try:
    ...
except modal.Error:
    # Catch any exception raised by Modal's systems.
    print("Responding to error...")
```

#### FilePatternMatcher

# modal.FilePatternMatcher

```python
class FilePatternMatcher(modal.file_pattern_matcher._AbstractPatternMatcher)
```

Allows matching file Path objects against a list of patterns.

**Usage:**
```python
from pathlib import Path
from modal import FilePatternMatcher

matcher = FilePatternMatcher("*.py")

assert matcher(Path("foo.py"))

# You can also negate the matcher.
negated_matcher = ~matcher

assert not negated_matcher(Path("foo.py"))
```

```python
def __init__(self, *pattern: str) -> None:
```

Initialize a new FilePatternMatcher instance.

Args:
    pattern (str): One or more pattern strings.

Raises:
    ValueError: If an illegal exclusion pattern is provided.
## can_prune_directories

```python
def can_prune_directories(self) -> bool:
```

Returns True if this pattern matcher allows safe early directory pruning.

Directory pruning is safe when matching directories can be skipped entirely
without missing any files that should be included. This is for example not
safe when we have inverted/negated ignore patterns (e.g. "!**/*.py").
## from_file

```python
@classmethod
def from_file(cls, file_path: Union[str, Path]) -> "FilePatternMatcher":
```

Initialize a new FilePatternMatcher instance from a file.

The patterns in the file will be read lazily when the matcher is first used.

Args:
    file_path (Path): The path to the file containing patterns.

**Usage:**
```python
from modal import FilePatternMatcher

matcher = FilePatternMatcher.from_file("/path/to/ignorefile")
```

#### Function

# modal.Function

```python
class Function(typing.Generic, modal.object.Object)
```

Functions are the basic units of serverless execution on Modal.

Generally, you will not construct a `Function` directly. Instead, use the
`App.function()` decorator to register your Python functions with your App.

## hydrate

```python
def hydrate(self, client: Optional[_Client] = None) -> Self:
```

Synchronize the local object with its identity on the Modal server.

It is rarely necessary to call this method explicitly, as most operations
will lazily hydrate when needed. The main use case is when you need to
access object metadata, such as its ID.

*Added in v0.72.39*: This method replaces the deprecated `.resolve()` method.
## update_autoscaler

```python
@live_method
def update_autoscaler(
    self,
    *,
    min_containers: Optional[int] = None,
    max_containers: Optional[int] = None,
    buffer_containers: Optional[int] = None,
    scaledown_window: Optional[int] = None,
) -> None:
```

Override the current autoscaler behavior for this Function.

Unspecified parameters will retain their current value, i.e. either the static value
from the function decorator, or an override value from a previous call to this method.

Subsequent deployments of the App containing this Function will reset the autoscaler back to
its static configuration.

Examples:

```python notest
f = modal.Function.from_name("my-app", "function")

# Always have at least 2 containers running, with an extra buffer when the Function is active
f.update_autoscaler(min_containers=2, buffer_containers=1)

# Limit this Function to avoid spinning up more than 5 containers
f.update_autoscaler(max_containers=5)

# Extend the scaledown window to increase the amount of time that idle containers stay alive
f.update_autoscaler(scaledown_window=300)

```
## from_name

```python
@classmethod
def from_name(
    cls: type["_Function"],
    app_name: str,
    name: str,
    *,
    environment_name: Optional[str] = None,
) -> "_Function":
```

Reference a Function from a deployed App by its name.

This is a lazy method that defers hydrating the local
object with metadata from Modal servers until the first
time it is actually used.

```python
f = modal.Function.from_name("other-app", "function")
```
## get_web_url

```python
@live_method
def get_web_url(self) -> Optional[str]:
```

URL of a Function running as a web endpoint.
## remote

```python
@live_method
def remote(self, *args: P.args, **kwargs: P.kwargs) -> ReturnType:
```

Calls the function remotely, executing it with the given arguments and returning the execution's result.
## remote_gen

```python
@live_method_gen
def remote_gen(self, *args, **kwargs) -> AsyncGenerator[Any, None]:
```

Calls the generator remotely, executing it with the given arguments and returning the execution's result.
## local

```python
def local(self, *args: P.args, **kwargs: P.kwargs) -> OriginalReturnType:
```

Calls the function locally, executing it with the given arguments and returning the execution's result.

The function will execute in the same environment as the caller, just like calling the underlying function
directly in Python. In particular, only secrets available in the caller environment will be available
through environment variables.
## spawn

```python
@live_method
def spawn(self, *args: P.args, **kwargs: P.kwargs) -> "_FunctionCall[ReturnType]":
```

Calls the function with the given arguments, without waiting for the results.

Returns a [`modal.FunctionCall`](https://modal.com/docs/reference/modal.FunctionCall) object
that can later be polled or waited for using
[`.get(timeout=...)`](https://modal.com/docs/reference/modal.FunctionCall#get).
Conceptually similar to `multiprocessing.pool.apply_async`, or a Future/Promise in other contexts.
## get_raw_f

```python
def get_raw_f(self) -> Callable[..., Any]:
```

Return the inner Python object wrapped by this Modal Function.
## get_current_stats

```python
@live_method
def get_current_stats(self) -> FunctionStats:
```

Return a `FunctionStats` object describing the current function's queue and runner counts.
## map

```python
@warn_if_generator_is_not_consumed(function_name="Function.map")
def map(
    self,
    *input_iterators: typing.Iterable[Any],  # one input iterator per argument in the mapped-over function/generator
    kwargs={},  # any extra keyword arguments for the function
    order_outputs: bool = True,  # return outputs in order
    return_exceptions: bool = False,  # propagate exceptions (False) or aggregate them in the results list (True)
    wrap_returned_exceptions: bool = True,
) -> AsyncOrSyncIterable:
```

Parallel map over a set of inputs.

Takes one iterator argument per argument in the function being mapped over.

Example:
```python
@app.function()
def my_func(a):
    return a ** 2

@app.local_entrypoint()
def main():
    assert list(my_func.map([1, 2, 3, 4])) == [1, 4, 9, 16]
```

If applied to a `app.function`, `map()` returns one result per input and the output order
is guaranteed to be the same as the input order. Set `order_outputs=False` to return results
in the order that they are completed instead.

`return_exceptions` can be used to treat exceptions as successful results:

```python
@app.function()
def my_func(a):
    if a == 2:
        raise Exception("ohno")
    return a ** 2

@app.local_entrypoint()
def main():
    # [0, 1, UserCodeException(Exception('ohno'))]
    print(list(my_func.map(range(3), return_exceptions=True)))
```
## starmap

```python
@warn_if_generator_is_not_consumed(function_name="Function.starmap")
def starmap(
    self,
    input_iterator: typing.Iterable[typing.Sequence[Any]],
    *,
    kwargs={},
    order_outputs: bool = True,
    return_exceptions: bool = False,
    wrap_returned_exceptions: bool = True,
) -> AsyncOrSyncIterable:
```

Like `map`, but spreads arguments over multiple function arguments.

Assumes every input is a sequence (e.g. a tuple).

Example:
```python
@app.function()
def my_func(a, b):
    return a + b

@app.local_entrypoint()
def main():
    assert list(my_func.starmap([(1, 2), (3, 4)])) == [3, 7]
```
## for_each

```python
def for_each(self, *input_iterators, kwargs={}, ignore_exceptions: bool = False):
```

Execute function for all inputs, ignoring outputs. Waits for completion of the inputs.

Convenient alias for `.map()` in cases where the function just needs to be called.
as the caller doesn't have to consume the generator to process the inputs.
## spawn_map

```python
def spawn_map(self, *input_iterators, kwargs={}) -> None:
```

Spawn parallel execution over a set of inputs, exiting as soon as the inputs are created (without waiting
for the map to complete).

Takes one iterator argument per argument in the function being mapped over.

Example:
```python
@app.function()
def my_func(a):
    return a ** 2

@app.local_entrypoint()
def main():
    my_func.spawn_map([1, 2, 3, 4])
```

Programmatic retrieval of results will be supported in a future update.

#### FunctionCall

# modal.FunctionCall

```python
class FunctionCall(typing.Generic, modal.object.Object)
```

A reference to an executed function call.

Constructed using `.spawn(...)` on a Modal function with the same
arguments that a function normally takes. Acts as a reference to
an ongoing function call that can be passed around and used to
poll or fetch function results at some later time.

Conceptually similar to a Future/Promise/AsyncResult in other contexts and languages.

## hydrate

```python
def hydrate(self, client: Optional[_Client] = None) -> Self:
```

Synchronize the local object with its identity on the Modal server.

It is rarely necessary to call this method explicitly, as most operations
will lazily hydrate when needed. The main use case is when you need to
access object metadata, such as its ID.

*Added in v0.72.39*: This method replaces the deprecated `.resolve()` method.
## get

```python
def get(self, timeout: Optional[float] = None) -> ReturnType:
```

Get the result of the function call.

This function waits indefinitely by default. It takes an optional
`timeout` argument that specifies the maximum number of seconds to wait,
which can be set to `0` to poll for an output immediately.

The returned coroutine is not cancellation-safe.
## get_call_graph

```python
def get_call_graph(self) -> list[InputInfo]:
```

Returns a structure representing the call graph from a given root
call ID, along with the status of execution for each node.

See [`modal.call_graph`](https://modal.com/docs/reference/modal.call_graph) reference page
for documentation on the structure of the returned `InputInfo` items.
## cancel

```python
def cancel(
    self,
    # if true, containers running the inputs are forcibly terminated
    terminate_containers: bool = False,
):
```

Cancels the function call, which will stop its execution and mark its inputs as
[`TERMINATED`](https://modal.com/docs/reference/modal.call_graph#modalcall_graphinputstatus).

If `terminate_containers=True` - the containers running the cancelled inputs are all terminated
causing any non-cancelled inputs on those containers to be rescheduled in new containers.
## from_id

```python
@staticmethod
def from_id(function_call_id: str, client: Optional[_Client] = None) -> "_FunctionCall[Any]":
```

Instantiate a FunctionCall object from an existing ID.

Examples:

```python notest
# Spawn a FunctionCall and keep track of its object ID
fc = my_func.spawn()
fc_id = fc.object_id

# Later, use the ID to re-instantiate the FunctionCall object
fc = _FunctionCall.from_id(fc_id)
result = fc.get()
```

Note that it's only necessary to re-instantiate the `FunctionCall` with this method
if you no longer have access to the original object returned from `Function.spawn`.
## gather

```python
@staticmethod
def gather(*function_calls: "_FunctionCall[T]") -> typing.Sequence[T]:
```

Wait until all Modal FunctionCall objects have results before returning.

Accepts a variable number of `FunctionCall` objects, as returned by `Function.spawn()`.

Returns a list of results from each FunctionCall, or raises an exception
from the first failing function call.

Examples:

```python notest
fc1 = slow_func_1.spawn()
fc2 = slow_func_2.spawn()

result_1, result_2 = modal.FunctionCall.gather(fc1, fc2)
```

*Added in v0.73.69*: This method replaces the deprecated `modal.functions.gather` function.

#### Image

# modal.Image

```python
class Image(modal.object.Object)
```

Base class for container images to run functions in.

Do not construct this class directly; instead use one of its static factory methods,
such as `modal.Image.debian_slim`, `modal.Image.from_registry`, or `modal.Image.micromamba`.

## add_local_file

```python
def add_local_file(self, local_path: Union[str, Path], remote_path: str, *, copy: bool = False) -> "_Image":
```

Adds a local file to the image at `remote_path` within the container

By default (`copy=False`), the files are added to containers on startup and are not built into the actual Image,
which speeds up deployment.

Set `copy=True` to copy the files into an Image layer at build time instead, similar to how
[`COPY`](https://docs.docker.com/engine/reference/builder/#copy) works in a `Dockerfile`.

copy=True can slow down iteration since it requires a rebuild of the Image and any subsequent
build steps whenever the included files change, but it is required if you want to run additional
build steps after this one.

*Added in v0.66.40*: This method replaces the deprecated `modal.Image.copy_local_file` method.
## add_local_dir

```python
def add_local_dir(
    self,
    local_path: Union[str, Path],
    remote_path: str,
    *,
    copy: bool = False,
    # Predicate filter function for file exclusion, which should accept a filepath and return `True` for exclusion.
    # Defaults to excluding no files. If a Sequence is provided, it will be converted to a FilePatternMatcher.
    # Which follows dockerignore syntax.
    ignore: Union[Sequence[str], Callable[[Path], bool]] = [],
) -> "_Image":
```

Adds a local directory's content to the image at `remote_path` within the container

By default (`copy=False`), the files are added to containers on startup and are not built into the actual Image,
which speeds up deployment.

Set `copy=True` to copy the files into an Image layer at build time instead, similar to how
[`COPY`](https://docs.docker.com/engine/reference/builder/#copy) works in a `Dockerfile`.

copy=True can slow down iteration since it requires a rebuild of the Image and any subsequent
build steps whenever the included files change, but it is required if you want to run additional
build steps after this one.

**Usage:**

```python
from modal import FilePatternMatcher

image = modal.Image.debian_slim().add_local_dir(
    "~/assets",
    remote_path="/assets",
    ignore=["*.venv"],
)

image = modal.Image.debian_slim().add_local_dir(
    "~/assets",
    remote_path="/assets",
    ignore=lambda p: p.is_relative_to(".venv"),
)

image = modal.Image.debian_slim().add_local_dir(
    "~/assets",
    remote_path="/assets",
    ignore=FilePatternMatcher("**/*.txt"),
)

# When including files is simpler than excluding them, you can use the `~` operator to invert the matcher.
image = modal.Image.debian_slim().add_local_dir(
    "~/assets",
    remote_path="/assets",
    ignore=~FilePatternMatcher("**/*.py"),
)

# You can also read ignore patterns from a file.
image = modal.Image.debian_slim().add_local_dir(
    "~/assets",
    remote_path="/assets",
    ignore=FilePatternMatcher.from_file("/path/to/ignorefile"),
)
```

*Added in v0.66.40*: This method replaces the deprecated `modal.Image.copy_local_dir` method.
## add_local_python_source

```python
def add_local_python_source(
    self, *modules: str, copy: bool = False, ignore: Union[Sequence[str], Callable[[Path], bool]] = NON_PYTHON_FILES
) -> "_Image":
```

Adds locally available Python packages/modules to containers

Adds all files from the specified Python package or module to containers running the Image.

Packages are added to the `/root` directory of containers, which is on the `PYTHONPATH`
of any executed Modal Functions, enabling import of the module by that name.

By default (`copy=False`), the files are added to containers on startup and are not built into the actual Image,
which speeds up deployment.

Set `copy=True` to copy the files into an Image layer at build time instead. This can slow down iteration since
it requires a rebuild of the Image and any subsequent build steps whenever the included files change, but it is
required if you want to run additional build steps after this one.

**Note:** This excludes all dot-prefixed subdirectories or files and all `.pyc`/`__pycache__` files.
To add full directories with finer control, use `.add_local_dir()` instead and specify `/root` as
the destination directory.

By default only includes `.py`-files in the source modules. Set the `ignore` argument to a list of patterns
or a callable to override this behavior, e.g.:

```py
# includes everything except data.json
modal.Image.debian_slim().add_local_python_source("mymodule", ignore=["data.json"])

# exclude large files
modal.Image.debian_slim().add_local_python_source(
    "mymodule",
    ignore=lambda p: p.stat().st_size > 1e9
)
```

*Added in v0.67.28*: This method replaces the deprecated `modal.Mount.from_local_python_packages` pattern.
## from_id

```python
@staticmethod
def from_id(image_id: str, client: Optional[_Client] = None) -> "_Image":
```

Construct an Image from an id and look up the Image result.

The ID of an Image object can be accessed using `.object_id`.
## pip_install

```python
def pip_install(
    self,
    *packages: Union[str, list[str]],  # A list of Python packages, eg. ["numpy", "matplotlib>=3.5.0"]
    find_links: Optional[str] = None,  # Passes -f (--find-links) pip install
    index_url: Optional[str] = None,  # Passes -i (--index-url) to pip install
    extra_index_url: Optional[str] = None,  # Passes --extra-index-url to pip install
    pre: bool = False,  # Passes --pre (allow pre-releases) to pip install
    extra_options: str = "",  # Additional options to pass to pip install, e.g. "--no-build-isolation --no-clean"
    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
    secrets: Sequence[_Secret] = [],
    gpu: GPU_T = None,
) -> "_Image":
```

Install a list of Python packages using pip.

**Examples**

Simple installation:
```python
image = modal.Image.debian_slim().pip_install("click", "httpx~=0.23.3")
```

More complex installation:
```python
image = (
    modal.Image.from_registry(
        "nvidia/cuda:12.2.0-devel-ubuntu22.04", add_python="3.11"
    )
    .pip_install(
        "ninja",
        "packaging",
        "wheel",
        "transformers==4.40.2",
    )
    .pip_install(
        "flash-attn==2.5.8", extra_options="--no-build-isolation"
    )
)
```
## pip_install_private_repos

```python
def pip_install_private_repos(
    self,
    *repositories: str,
    git_user: str,
    find_links: Optional[str] = None,  # Passes -f (--find-links) pip install
    index_url: Optional[str] = None,  # Passes -i (--index-url) to pip install
    extra_index_url: Optional[str] = None,  # Passes --extra-index-url to pip install
    pre: bool = False,  # Passes --pre (allow pre-releases) to pip install
    extra_options: str = "",  # Additional options to pass to pip install, e.g. "--no-build-isolation --no-clean"
    gpu: GPU_T = None,
    secrets: Sequence[_Secret] = [],
    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
) -> "_Image":
```

Install a list of Python packages from private git repositories using pip.

This method currently supports Github and Gitlab only.

- **Github:** Provide a `modal.Secret` that contains a `GITHUB_TOKEN` key-value pair
- **Gitlab:** Provide a `modal.Secret` that contains a `GITLAB_TOKEN` key-value pair

These API tokens should have permissions to read the list of private repositories provided as arguments.

We recommend using Github's ['fine-grained' access tokens](https://github.blog/2022-10-18-introducing-fine-grained-personal-access-tokens-for-github/).
These tokens are repo-scoped, and avoid granting read permission across all of a user's private repos.

**Example**

```python
image = (
    modal.Image
    .debian_slim()
    .pip_install_private_repos(
        "github.com/ecorp/private-one@1.0.0",
        "github.com/ecorp/private-two@main"
        "github.com/ecorp/private-three@d4776502"
        # install from 'inner' directory on default branch.
        "github.com/ecorp/private-four#subdirectory=inner",
        git_user="erikbern",
        secrets=[modal.Secret.from_name("github-read-private")],
    )
)
```
## pip_install_from_requirements

```python
def pip_install_from_requirements(
    self,
    requirements_txt: str,  # Path to a requirements.txt file.
    find_links: Optional[str] = None,  # Passes -f (--find-links) pip install
    *,
    index_url: Optional[str] = None,  # Passes -i (--index-url) to pip install
    extra_index_url: Optional[str] = None,  # Passes --extra-index-url to pip install
    pre: bool = False,  # Passes --pre (allow pre-releases) to pip install
    extra_options: str = "",  # Additional options to pass to pip install, e.g. "--no-build-isolation --no-clean"
    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
    secrets: Sequence[_Secret] = [],
    gpu: GPU_T = None,
) -> "_Image":
```

Install a list of Python packages from a local `requirements.txt` file.
## pip_install_from_pyproject

```python
def pip_install_from_pyproject(
    self,
    pyproject_toml: str,
    optional_dependencies: list[str] = [],
    *,
    find_links: Optional[str] = None,  # Passes -f (--find-links) pip install
    index_url: Optional[str] = None,  # Passes -i (--index-url) to pip install
    extra_index_url: Optional[str] = None,  # Passes --extra-index-url to pip install
    pre: bool = False,  # Passes --pre (allow pre-releases) to pip install
    extra_options: str = "",  # Additional options to pass to pip install, e.g. "--no-build-isolation --no-clean"
    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
    secrets: Sequence[_Secret] = [],
    gpu: GPU_T = None,
) -> "_Image":
```

Install dependencies specified by a local `pyproject.toml` file.

`optional_dependencies` is a list of the keys of the
optional-dependencies section(s) of the `pyproject.toml` file
(e.g. test, doc, experiment, etc). When provided,
all of the packages in each listed section are installed as well.
## uv_pip_install

```python
def uv_pip_install(
    self,
    *packages: Union[str, list[str]],  # A list of Python packages, eg. ["numpy", "matplotlib>=3.5.0"]
    requirements: Optional[list[str]] = None,  # Passes -r (--requirements) to uv pip install
    find_links: Optional[str] = None,  # Passes -f (--find-links) to uv pip install
    index_url: Optional[str] = None,  # Passes -i (--index-url) to uv pip install
    extra_index_url: Optional[str] = None,  # Passes --extra-index-url to uv pip install
    pre: bool = False,  # Allow pre-releases using uv pip install --prerelease allow
    extra_options: str = "",  # Additional options to pass to pip install, e.g. "--no-build-isolation"
    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
    uv_version: Optional[str] = None,  # uv version to use
    secrets: Sequence[_Secret] = [],
    gpu: GPU_T = None,
) -> "_Image":
```

Install a list of Python packages using uv pip install.

**Examples**

Simple installation:
```python
image = modal.Image.debian_slim().uv_pip_install("torch==2.7.1", "numpy")
```

This method assumes that:
- Python is on the `$PATH` and dependencies are installed with the first Python on the `$PATH`.
- Shell supports backticks for substitution
- `which` command is on the `$PATH`

Added in v1.1.0.
## poetry_install_from_file

```python
def poetry_install_from_file(
    self,
    poetry_pyproject_toml: str,
    poetry_lockfile: Optional[str] = None,  # Path to lockfile. If not provided, uses poetry.lock in same directory.
    *,
    ignore_lockfile: bool = False,  # If set to True, do not use poetry.lock, even when present
    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
    # Selected optional dependency groups to install (See https://python-poetry.org/docs/cli/#install)
    with_: list[str] = [],
    # Selected optional dependency groups to exclude (See https://python-poetry.org/docs/cli/#install)
    without: list[str] = [],
    only: list[str] = [],  # Only install dependency groups specifed in this list.
    poetry_version: Optional[str] = "latest",  # Version of poetry to install, or None to skip installation
    # If set to True, use old installer. See https://github.com/python-poetry/poetry/issues/3336
    old_installer: bool = False,
    secrets: Sequence[_Secret] = [],
    gpu: GPU_T = None,
) -> "_Image":
```

Install poetry *dependencies* specified by a local `pyproject.toml` file.

If not provided as argument the path to the lockfile is inferred. However, the
file has to exist, unless `ignore_lockfile` is set to `True`.

Note that the root project of the poetry project is not installed, only the dependencies.
For including local python source files see `add_local_python_source`

Poetry will be installed to the Image (using pip) unless `poetry_version` is set to None.
Note that the interpretation of `poetry_version="latest"` depends on the Modal Image Builder
version, with versions 2024.10 and earlier limiting poetry to 1.x.
## uv_sync

```python
def uv_sync(
    self,
    uv_project_dir: str = "./",  # Path to local uv managed project
    *,
    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
    groups: Optional[list[str]] = None,  # Dependency group to install using `uv sync --group`
    extras: Optional[list[str]] = None,  # Optional dependencies to install using `uv sync --extra`
    frozen: bool = True,  # If True, then we run `uv sync --frozen` when a uv.lock file is present
    extra_options: str = "",  # Extra options to pass to `uv sync`
    uv_version: Optional[str] = None,  # uv version to use
    secrets: Sequence[_Secret] = [],
    gpu: GPU_T = None,
) -> "_Image":
```

Creates a virtual environment with the dependencies in a uv managed project with `uv sync`.

**Examples**
```python
image = modal.Image.debian_slim().uv_sync()
```

The `pyproject.toml` and `uv.lock` in `uv_project_dir` are automatically added to the build context.

Added in v1.1.0.
## dockerfile_commands

```python
def dockerfile_commands(
    self,
    *dockerfile_commands: Union[str, list[str]],
    context_files: dict[str, str] = {},
    secrets: Sequence[_Secret] = [],
    gpu: GPU_T = None,
    context_mount: Optional[_Mount] = None,  # Deprecated: the context is now inferred
    context_dir: Optional[Union[Path, str]] = None,  # Context for relative COPY commands
    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
    ignore: Union[Sequence[str], Callable[[Path], bool]] = AUTO_DOCKERIGNORE,
) -> "_Image":
```

Extend an image with arbitrary Dockerfile-like commands.

**Usage:**

```python
from modal import FilePatternMatcher

# By default a .dockerignore file is used if present in the current working directory
image = modal.Image.debian_slim().dockerfile_commands(
    ["COPY data /data"],
)

image = modal.Image.debian_slim().dockerfile_commands(
    ["COPY data /data"],
    ignore=["*.venv"],
)

image = modal.Image.debian_slim().dockerfile_commands(
    ["COPY data /data"],
    ignore=lambda p: p.is_relative_to(".venv"),
)

image = modal.Image.debian_slim().dockerfile_commands(
    ["COPY data /data"],
    ignore=FilePatternMatcher("**/*.txt"),
)

# When including files is simpler than excluding them, you can use the `~` operator to invert the matcher.
image = modal.Image.debian_slim().dockerfile_commands(
    ["COPY data /data"],
    ignore=~FilePatternMatcher("**/*.py"),
)

# You can also read ignore patterns from a file.
image = modal.Image.debian_slim().dockerfile_commands(
    ["COPY data /data"],
    ignore=FilePatternMatcher.from_file("/path/to/dockerignore"),
)
```
## entrypoint

```python
def entrypoint(
    self,
    entrypoint_commands: list[str],
) -> "_Image":
```

Set the ENTRYPOINT for the image.
## shell

```python
def shell(
    self,
    shell_commands: list[str],
) -> "_Image":
```

Overwrite default shell for the image.
## run_commands

```python
def run_commands(
    self,
    *commands: Union[str, list[str]],
    secrets: Sequence[_Secret] = [],
    gpu: GPU_T = None,
    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
) -> "_Image":
```

Extend an image with a list of shell commands to run.
## micromamba

```python
@staticmethod
def micromamba(
    python_version: Optional[str] = None,
    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
) -> "_Image":
```

A Micromamba base image. Micromamba allows for fast building of small Conda-based containers.
## micromamba_install

```python
def micromamba_install(
    self,
    # A list of Python packages, eg. ["numpy", "matplotlib>=3.5.0"]
    *packages: Union[str, list[str]],
    # A local path to a file containing package specifications
    spec_file: Optional[str] = None,
    # A list of Conda channels, eg. ["conda-forge", "nvidia"].
    channels: list[str] = [],
    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
    secrets: Sequence[_Secret] = [],
    gpu: GPU_T = None,
) -> "_Image":
```

Install a list of additional packages using micromamba.
## from_registry

```python
@staticmethod
def from_registry(
    tag: str,
    secret: Optional[_Secret] = None,
    *,
    setup_dockerfile_commands: list[str] = [],
    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
    add_python: Optional[str] = None,
    **kwargs,
) -> "_Image":
```

Build a Modal Image from a public or private image registry, such as Docker Hub.

The image must be built for the `linux/amd64` platform.

If your image does not come with Python installed, you can use the `add_python` parameter
to specify a version of Python to add to the image. Otherwise, the image is expected to
have Python on PATH as `python`, along with `pip`.

You may also use `setup_dockerfile_commands` to run Dockerfile commands before the
remaining commands run. This might be useful if you want a custom Python installation or to
set a `SHELL`. Prefer `run_commands()` when possible though.

To authenticate against a private registry with static credentials, you must set the `secret` parameter to
a `modal.Secret` containing a username (`REGISTRY_USERNAME`) and
an access token or password (`REGISTRY_PASSWORD`).

To authenticate against private registries with credentials from a cloud provider,
use `Image.from_gcp_artifact_registry()` or `Image.from_aws_ecr()`.

**Examples**

```python
modal.Image.from_registry("python:3.11-slim-bookworm")
modal.Image.from_registry("ubuntu:22.04", add_python="3.11")
modal.Image.from_registry("nvcr.io/nvidia/pytorch:22.12-py3")
```
## from_gcp_artifact_registry

```python
@staticmethod
def from_gcp_artifact_registry(
    tag: str,
    secret: Optional[_Secret] = None,
    *,
    setup_dockerfile_commands: list[str] = [],
    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
    add_python: Optional[str] = None,
    **kwargs,
) -> "_Image":
```

Build a Modal image from a private image in Google Cloud Platform (GCP) Artifact Registry.

You will need to pass a `modal.Secret` containing [your GCP service account key data](https://cloud.google.com/iam/docs/keys-create-delete#creating)
as `SERVICE_ACCOUNT_JSON`. This can be done from the [Secrets](https://modal.com/secrets) page.
Your service account should be granted a specific role depending on the GCP registry used:

- For Artifact Registry images (`pkg.dev` domains) use
  the ["Artifact Registry Reader"](https://cloud.google.com/artifact-registry/docs/access-control#roles) role
- For Container Registry images (`gcr.io` domains) use
  the ["Storage Object Viewer"](https://cloud.google.com/artifact-registry/docs/transition/setup-gcr-repo) role

**Note:** This method does not use `GOOGLE_APPLICATION_CREDENTIALS` as that
variable accepts a path to a JSON file, not the actual JSON string.

See `Image.from_registry()` for information about the other parameters.

**Example**

```python
modal.Image.from_gcp_artifact_registry(
    "us-east1-docker.pkg.dev/my-project-1234/my-repo/my-image:my-version",
    secret=modal.Secret.from_name(
        "my-gcp-secret",
        required_keys=["SERVICE_ACCOUNT_JSON"],
    ),
    add_python="3.11",
)
```
## from_aws_ecr

```python
@staticmethod
def from_aws_ecr(
    tag: str,
    secret: Optional[_Secret] = None,
    *,
    setup_dockerfile_commands: list[str] = [],
    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
    add_python: Optional[str] = None,
    **kwargs,
) -> "_Image":
```

Build a Modal image from a private image in AWS Elastic Container Registry (ECR).

You will need to pass a `modal.Secret` containing `AWS_ACCESS_KEY_ID`,
`AWS_SECRET_ACCESS_KEY`, and `AWS_REGION` to access the target ECR registry.

IAM configuration details can be found in the AWS documentation for
["Private repository policies"](https://docs.aws.amazon.com/AmazonECR/latest/userguide/repository-policies.html).

See `Image.from_registry()` for information about the other parameters.

**Example**

```python
modal.Image.from_aws_ecr(
    "000000000000.dkr.ecr.us-east-1.amazonaws.com/my-private-registry:my-version",
    secret=modal.Secret.from_name(
        "aws",
        required_keys=["AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY", "AWS_REGION"],
    ),
    add_python="3.11",
)
```
## from_dockerfile

```python
@staticmethod
def from_dockerfile(
    path: Union[str, Path],  # Filepath to Dockerfile.
    *,
    context_mount: Optional[_Mount] = None,  # Deprecated: the context is now inferred
    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
    context_dir: Optional[Union[Path, str]] = None,  # Context for relative COPY commands
    secrets: Sequence[_Secret] = [],
    gpu: GPU_T = None,
    add_python: Optional[str] = None,
    build_args: dict[str, str] = {},
    ignore: Union[Sequence[str], Callable[[Path], bool]] = AUTO_DOCKERIGNORE,
) -> "_Image":
```

Build a Modal image from a local Dockerfile.

If your Dockerfile does not have Python installed, you can use the `add_python` parameter
to specify a version of Python to add to the image.

**Usage:**

```python
from modal import FilePatternMatcher

# By default a .dockerignore file is used if present in the current working directory
image = modal.Image.from_dockerfile(
    "./Dockerfile",
    add_python="3.12",
)

image = modal.Image.from_dockerfile(
    "./Dockerfile",
    add_python="3.12",
    ignore=["*.venv"],
)

image = modal.Image.from_dockerfile(
    "./Dockerfile",
    add_python="3.12",
    ignore=lambda p: p.is_relative_to(".venv"),
)

image = modal.Image.from_dockerfile(
    "./Dockerfile",
    add_python="3.12",
    ignore=FilePatternMatcher("**/*.txt"),
)

# When including files is simpler than excluding them, you can use the `~` operator to invert the matcher.
image = modal.Image.from_dockerfile(
    "./Dockerfile",
    add_python="3.12",
    ignore=~FilePatternMatcher("**/*.py"),
)

# You can also read ignore patterns from a file.
image = modal.Image.from_dockerfile(
    "./Dockerfile",
    add_python="3.12",
    ignore=FilePatternMatcher.from_file("/path/to/dockerignore"),
)
```
## debian_slim

```python
@staticmethod
def debian_slim(python_version: Optional[str] = None, force_build: bool = False) -> "_Image":
```

Default image, based on the official `python` Docker images.
## apt_install

```python
def apt_install(
    self,
    *packages: Union[str, list[str]],  # A list of packages, e.g. ["ssh", "libpq-dev"]
    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
    secrets: Sequence[_Secret] = [],
    gpu: GPU_T = None,
) -> "_Image":
```

Install a list of Debian packages using `apt`.

**Example**

```python
image = modal.Image.debian_slim().apt_install("git")
```
## run_function

```python
def run_function(
    self,
    raw_f: Callable[..., Any],
    *,
    secrets: Sequence[_Secret] = (),  # Optional Modal Secret objects with environment variables for the container
    volumes: dict[Union[str, PurePosixPath], Union[_Volume, _CloudBucketMount]] = {},  # Volume mount paths
    network_file_systems: dict[Union[str, PurePosixPath], _NetworkFileSystem] = {},  # NFS mount paths
    gpu: Union[GPU_T, list[GPU_T]] = None,  # Requested GPU or or list of acceptable GPUs( e.g. ["A10", "A100"])
    cpu: Optional[float] = None,  # How many CPU cores to request. This is a soft limit.
    memory: Optional[int] = None,  # How much memory to request, in MiB. This is a soft limit.
    timeout: Optional[int] = 60 * 60,  # Maximum execution time of the function in seconds.
    cloud: Optional[str] = None,  # Cloud provider to run the function on. Possible values are aws, gcp, oci, auto.
    region: Optional[Union[str, Sequence[str]]] = None,  # Region or regions to run the function on.
    force_build: bool = False,  # Ignore cached builds, similar to 'docker build --no-cache'
    args: Sequence[Any] = (),  # Positional arguments to the function.
    kwargs: dict[str, Any] = {},  # Keyword arguments to the function.
    include_source: bool = True,  # Whether the builder container should have the Function's source added
) -> "_Image":
```

Run user-defined function `raw_f` as an image build step.

The function runs like an ordinary Modal Function, accepting a resource configuration and integrating
with Modal features like Secrets and Volumes. Unlike ordinary Modal Functions, any changes to the
filesystem state will be captured on container exit and saved as a new Image.

**Note**

Only the source code of `raw_f`, the contents of `**kwargs`, and any referenced *global* variables
are used to determine whether the image has changed and needs to be rebuilt.
If this function references other functions or variables, the image will not be rebuilt if you
make changes to them. You can force a rebuild by changing the function's source code itself.

**Example**

```python notest

def my_build_function():
    open("model.pt", "w").write("parameters!")

image = (
    modal.Image
        .debian_slim()
        .pip_install("torch")
        .run_function(my_build_function, secrets=[...], mounts=[...])
)
```
## env

```python
def env(self, vars: dict[str, str]) -> "_Image":
```

Sets the environment variables in an Image.

**Example**

```python
image = (
    modal.Image.debian_slim()
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
)
```
## workdir

```python
def workdir(self, path: Union[str, PurePosixPath]) -> "_Image":
```

Set the working directory for subsequent image build steps and function execution.

**Example**

```python
image = (
    modal.Image.debian_slim()
    .run_commands("git clone https://xyz app")
    .workdir("/app")
    .run_commands("yarn install")
)
```
## cmd

```python
def cmd(self, cmd: list[str]) -> "_Image":
```

Set the default command (`CMD`) to run when a container is started.

Used with `modal.Sandbox`. Has no effect on `modal.Function`.

**Example**

```python
image = (
    modal.Image.debian_slim().cmd(["python", "app.py"])
)
```
## imports

```python
@contextlib.contextmanager
def imports(self):
```

Used to import packages in global scope that are only available when running remotely.
By using this context manager you can avoid an `ImportError` due to not having certain
packages installed locally.

**Usage:**

```python notest
with image.imports():
    import torch
```

#### NetworkFileSystem

# modal.NetworkFileSystem

```python
class NetworkFileSystem(modal.object.Object)
```

A shared, writable file system accessible by one or more Modal functions.

By attaching this file system as a mount to one or more functions, they can
share and persist data with each other.

**Usage**

```python
import modal

nfs = modal.NetworkFileSystem.from_name("my-nfs", create_if_missing=True)
app = modal.App()

@app.function(network_file_systems={"/root/foo": nfs})
def f():
    pass

@app.function(network_file_systems={"/root/goo": nfs})
def g():
    pass
```

Also see the CLI methods for accessing network file systems:

```
modal nfs --help
```

A `NetworkFileSystem` can also be useful for some local scripting scenarios, e.g.:

```python notest
nfs = modal.NetworkFileSystem.from_name("my-network-file-system")
for chunk in nfs.read_file("my_db_dump.csv"):
    ...
```

## hydrate

```python
def hydrate(self, client: Optional[_Client] = None) -> Self:
```

Synchronize the local object with its identity on the Modal server.

It is rarely necessary to call this method explicitly, as most operations
will lazily hydrate when needed. The main use case is when you need to
access object metadata, such as its ID.

*Added in v0.72.39*: This method replaces the deprecated `.resolve()` method.
## from_name

```python
@staticmethod
def from_name(
    name: str,
    *,
    environment_name: Optional[str] = None,
    create_if_missing: bool = False,
) -> "_NetworkFileSystem":
```

Reference a NetworkFileSystem by its name, creating if necessary.

This is a lazy method that defers hydrating the local object with
metadata from Modal servers until the first time it is actually
used.

```python notest
nfs = NetworkFileSystem.from_name("my-nfs", create_if_missing=True)

@app.function(network_file_systems={"/data": nfs})
def f():
    pass
```
## ephemeral

```python
@classmethod
@contextmanager
def ephemeral(
    cls: type["_NetworkFileSystem"],
    client: Optional[_Client] = None,
    environment_name: Optional[str] = None,
    _heartbeat_sleep: float = EPHEMERAL_OBJECT_HEARTBEAT_SLEEP,
) -> Iterator["_NetworkFileSystem"]:
```

Creates a new ephemeral network filesystem within a context manager:

Usage:
```python
with modal.NetworkFileSystem.ephemeral() as nfs:
    assert nfs.listdir("/") == []
```

```python notest
async with modal.NetworkFileSystem.ephemeral() as nfs:
    assert await nfs.listdir("/") == []
```
## delete

```python
@staticmethod
def delete(name: str, client: Optional[_Client] = None, environment_name: Optional[str] = None):
```

## write_file

```python
@live_method
def write_file(self, remote_path: str, fp: BinaryIO, progress_cb: Optional[Callable[..., Any]] = None) -> int:
```

Write from a file object to a path on the network file system, atomically.

Will create any needed parent directories automatically.

If remote_path ends with `/` it's assumed to be a directory and the
file will be uploaded with its current name to that directory.
## read_file

```python
@live_method_gen
def read_file(self, path: str) -> Iterator[bytes]:
```

Read a file from the network file system
## iterdir

```python
@live_method_gen
def iterdir(self, path: str) -> Iterator[FileEntry]:
```

Iterate over all files in a directory in the network file system.

* Passing a directory path lists all files in the directory (names are relative to the directory)
* Passing a file path returns a list containing only that file's listing description
* Passing a glob path (including at least one * or ** sequence) returns all files matching
that glob path (using absolute paths)
## add_local_file

```python
@live_method
def add_local_file(
    self,
    local_path: Union[Path, str],
    remote_path: Optional[Union[str, PurePosixPath, None]] = None,
    progress_cb: Optional[Callable[..., Any]] = None,
):
```

## add_local_dir

```python
@live_method
def add_local_dir(
    self,
    local_path: Union[Path, str],
    remote_path: Optional[Union[str, PurePosixPath, None]] = None,
    progress_cb: Optional[Callable[..., Any]] = None,
):
```

## listdir

```python
@live_method
def listdir(self, path: str) -> list[FileEntry]:
```

List all files in a directory in the network file system.

* Passing a directory path lists all files in the directory (names are relative to the directory)
* Passing a file path returns a list containing only that file's listing description
* Passing a glob path (including at least one * or ** sequence) returns all files matching
that glob path (using absolute paths)
## remove_file

```python
@live_method
def remove_file(self, path: str, recursive=False):
```

Remove a file in a network file system.

#### Period

# modal.Period

```python
class Period(modal.schedule.Schedule)
```

Create a schedule that runs every given time interval.

**Usage**

```python
import modal
app = modal.App()

@app.function(schedule=modal.Period(days=1))
def f():
    print("This function will run every day")

modal.Period(hours=4)          # runs every 4 hours
modal.Period(minutes=15)       # runs every 15 minutes
modal.Period(seconds=math.pi)  # runs every 3.141592653589793 seconds
```

Only `seconds` can be a float. All other arguments are integers.

Note that `days=1` will trigger the function the same time every day.
This does not have the same behavior as `seconds=84000` since days have
different lengths due to daylight savings and leap seconds. Similarly,
using `months=1` will trigger the function on the same day each month.

This behaves similar to the
[dateutil](https://dateutil.readthedocs.io/en/latest/relativedelta.html)
package.

```python
def __init__(
    self,
    *,
    years: int = 0,
    months: int = 0,
    weeks: int = 0,
    days: int = 0,
    hours: int = 0,
    minutes: int = 0,
    seconds: float = 0,
) -> None:
```

#### Proxy

# modal.Proxy

```python
class Proxy(modal.object.Object)
```

Proxy objects give your Modal containers a static outbound IP address.

This can be used for connecting to a remote address with network whitelist, for example
a database. See [the guide](https://modal.com/docs/guide/proxy-ips) for more information.

## hydrate

```python
def hydrate(self, client: Optional[_Client] = None) -> Self:
```

Synchronize the local object with its identity on the Modal server.

It is rarely necessary to call this method explicitly, as most operations
will lazily hydrate when needed. The main use case is when you need to
access object metadata, such as its ID.

*Added in v0.72.39*: This method replaces the deprecated `.resolve()` method.
## from_name

```python
@staticmethod
def from_name(
    name: str,
    *,
    environment_name: Optional[str] = None,
) -> "_Proxy":
```

Reference a Proxy by its name.

In contrast to most other Modal objects, new Proxy objects must be
provisioned via the Dashboard and cannot be created on the fly from code.

#### Queue

# modal.Queue

```python
class Queue(modal.object.Object)
```

Distributed, FIFO queue for data flow in Modal apps.

The queue can contain any object serializable by `cloudpickle`, including Modal objects.

By default, the `Queue` object acts as a single FIFO queue which supports puts and gets (blocking and non-blocking).

**Usage**

```python
from modal import Queue

# Create an ephemeral queue which is anonymous and garbage collected
with Queue.ephemeral() as my_queue:
    # Putting values
    my_queue.put("some value")
    my_queue.put(123)

    # Getting values
    assert my_queue.get() == "some value"
    assert my_queue.get() == 123

    # Using partitions
    my_queue.put(0)
    my_queue.put(1, partition="foo")
    my_queue.put(2, partition="bar")

    # Default and "foo" partition are ignored by the get operation.
    assert my_queue.get(partition="bar") == 2

    # Set custom 10s expiration time on "foo" partition.
    my_queue.put(3, partition="foo", partition_ttl=10)

    # (beta feature) Iterate through items in place (read immutably)
    my_queue.put(1)
    assert [v for v in my_queue.iterate()] == [0, 1]

# You can also create persistent queues that can be used across apps
queue = Queue.from_name("my-persisted-queue", create_if_missing=True)
queue.put(42)
assert queue.get() == 42
```

For more examples, see the [guide](https://modal.com/docs/guide/dicts-and-queues#modal-queues).

**Queue partitions (beta)**

Specifying partition keys gives access to other independent FIFO partitions within the same `Queue` object.
Across any two partitions, puts and gets are completely independent.
For example, a put in one partition does not affect a get in any other partition.

When no partition key is specified (by default), puts and gets will operate on a default partition.
This default partition is also isolated from all other partitions.
Please see the Usage section below for an example using partitions.

**Lifetime of a queue and its partitions**

By default, each partition is cleared 24 hours after the last `put` operation.
A lower TTL can be specified by the `partition_ttl` argument in the `put` or `put_many` methods.
Each partition's expiry is handled independently.

As such, `Queue`s are best used for communication between active functions and not relied on for persistent storage.

On app completion or after stopping an app any associated `Queue` objects are cleaned up.
All its partitions will be cleared.

**Limits**

A single `Queue` can contain up to 100,000 partitions, each with up to 5,000 items. Each item can be up to 1 MiB.

Partition keys must be non-empty and must not exceed 64 bytes.

## hydrate

```python
def hydrate(self, client: Optional[_Client] = None) -> Self:
```

Synchronize the local object with its identity on the Modal server.

It is rarely necessary to call this method explicitly, as most operations
will lazily hydrate when needed. The main use case is when you need to
access object metadata, such as its ID.

*Added in v0.72.39*: This method replaces the deprecated `.resolve()` method.
## name

```python
@property
def name(self) -> Optional[str]:
```

## validate_partition_key

```python
@staticmethod
def validate_partition_key(partition: Optional[str]) -> bytes:
```

## ephemeral

```python
@classmethod
@contextmanager
def ephemeral(
    cls: type["_Queue"],
    client: Optional[_Client] = None,
    environment_name: Optional[str] = None,
    _heartbeat_sleep: float = EPHEMERAL_OBJECT_HEARTBEAT_SLEEP,
) -> Iterator["_Queue"]:
```

Creates a new ephemeral queue within a context manager:

Usage:
```python
from modal import Queue

with Queue.ephemeral() as q:
    q.put(123)
```

```python notest
async with Queue.ephemeral() as q:
    await q.put.aio(123)
```
## from_name

```python
@staticmethod
def from_name(
    name: str,
    *,
    environment_name: Optional[str] = None,
    create_if_missing: bool = False,
) -> "_Queue":
```

Reference a named Queue, creating if necessary.

This is a lazy method the defers hydrating the local
object with metadata from Modal servers until the first
time it is actually used.

```python
q = modal.Queue.from_name("my-queue", create_if_missing=True)
q.put(123)
```
## delete

```python
@staticmethod
def delete(name: str, *, client: Optional[_Client] = None, environment_name: Optional[str] = None):
```

## info

```python
@live_method
def info(self) -> QueueInfo:
```

Return information about the Queue object.
## clear

```python
@live_method
def clear(self, *, partition: Optional[str] = None, all: bool = False) -> None:
```

Clear the contents of a single partition or all partitions.
## get

```python
@live_method
def get(
    self, block: bool = True, timeout: Optional[float] = None, *, partition: Optional[str] = None
) -> Optional[Any]:
```

Remove and return the next object in the queue.

If `block` is `True` (the default) and the queue is empty, `get` will wait indefinitely for
an object, or until `timeout` if specified. Raises a native `queue.Empty` exception
if the `timeout` is reached.

If `block` is `False`, `get` returns `None` immediately if the queue is empty. The `timeout` is
ignored in this case.
## get_many

```python
@live_method
def get_many(
    self, n_values: int, block: bool = True, timeout: Optional[float] = None, *, partition: Optional[str] = None
) -> list[Any]:
```

Remove and return up to `n_values` objects from the queue.

If there are fewer than `n_values` items in the queue, return all of them.

If `block` is `True` (the default) and the queue is empty, `get` will wait indefinitely for
at least 1 object to be present, or until `timeout` if specified. Raises the stdlib's `queue.Empty`
exception if the `timeout` is reached.

If `block` is `False`, `get` returns `None` immediately if the queue is empty. The `timeout` is
ignored in this case.
## put

```python
@live_method
def put(
    self,
    v: Any,
    block: bool = True,
    timeout: Optional[float] = None,
    *,
    partition: Optional[str] = None,
    partition_ttl: int = 24 * 3600,  # After 24 hours of no activity, this partition will be deletd.
) -> None:
```

Add an object to the end of the queue.

If `block` is `True` and the queue is full, this method will retry indefinitely or
until `timeout` if specified. Raises the stdlib's `queue.Full` exception if the `timeout` is reached.
If blocking it is not recommended to omit the `timeout`, as the operation could wait indefinitely.

If `block` is `False`, this method raises `queue.Full` immediately if the queue is full. The `timeout` is
ignored in this case.
## put_many

```python
@live_method
def put_many(
    self,
    vs: list[Any],
    block: bool = True,
    timeout: Optional[float] = None,
    *,
    partition: Optional[str] = None,
    partition_ttl: int = 24 * 3600,  # After 24 hours of no activity, this partition will be deletd.
) -> None:
```

Add several objects to the end of the queue.

If `block` is `True` and the queue is full, this method will retry indefinitely or
until `timeout` if specified. Raises the stdlib's `queue.Full` exception if the `timeout` is reached.
If blocking it is not recommended to omit the `timeout`, as the operation could wait indefinitely.

If `block` is `False`, this method raises `queue.Full` immediately if the queue is full. The `timeout` is
ignored in this case.
## len

```python
@live_method
def len(self, *, partition: Optional[str] = None, total: bool = False) -> int:
```

Return the number of objects in the queue partition.
## iterate

```python
@warn_if_generator_is_not_consumed()
@live_method_gen
def iterate(
    self, *, partition: Optional[str] = None, item_poll_timeout: float = 0.0
) -> AsyncGenerator[Any, None]:
```

(Beta feature) Iterate through items in the queue without mutation.

Specify `item_poll_timeout` to control how long the iterator should wait for the next time before giving up.

#### Retries

# modal.Retries

```python
class Retries(object)
```

Adds a retry policy to a Modal function.

**Usage**

```python
import modal
app = modal.App()

# Basic configuration.
# This sets a policy of max 4 retries with 1-second delay between failures.
@app.function(retries=4)
def f():
    pass

# Fixed-interval retries with 3-second delay between failures.
@app.function(
    retries=modal.Retries(
        max_retries=2,
        backoff_coefficient=1.0,
        initial_delay=3.0,
    )
)
def g():
    pass

# Exponential backoff, with retry delay doubling after each failure.
@app.function(
    retries=modal.Retries(
        max_retries=4,
        backoff_coefficient=2.0,
        initial_delay=1.0,
    )
)
def h():
    pass
```

```python
def __init__(
    self,
    *,
    # The maximum number of retries that can be made in the presence of failures.
    max_retries: int,
    # Coefficent controlling how much the retry delay increases each retry attempt.
    # A backoff coefficient of 1.0 creates fixed-delay where the delay period always equals the initial delay.
    backoff_coefficient: float = 2.0,
    # Number of seconds that must elapse before the first retry occurs.
    initial_delay: float = 1.0,
    # Maximum length of retry delay in seconds, preventing the delay from growing infinitely.
    max_delay: float = 60.0,
):
```

Construct a new retries policy, supporting exponential and fixed-interval delays via a backoff coefficient.

#### Sandbox

# modal.Sandbox

```python
class Sandbox(modal.object.Object)
```

A `Sandbox` object lets you interact with a running sandbox. This API is similar to Python's
[asyncio.subprocess.Process](https://docs.python.org/3/library/asyncio-subprocess.html#asyncio.subprocess.Process).

Refer to the [guide](https://modal.com/docs/guide/sandbox) on how to spawn and use sandboxes.

## hydrate

```python
def hydrate(self, client: Optional[_Client] = None) -> Self:
```

Synchronize the local object with its identity on the Modal server.

It is rarely necessary to call this method explicitly, as most operations
will lazily hydrate when needed. The main use case is when you need to
access object metadata, such as its ID.

*Added in v0.72.39*: This method replaces the deprecated `.resolve()` method.
## create

```python
@staticmethod
def create(
    *args: str,  # Set the CMD of the Sandbox, overriding any CMD of the container image.
    # Associate the sandbox with an app. Required unless creating from a container.
    app: Optional["modal.app._App"] = None,
    name: Optional[str] = None,  # Optionally give the sandbox a name. Unique within an app.
    image: Optional[_Image] = None,  # The image to run as the container for the sandbox.
    secrets: Sequence[_Secret] = (),  # Environment variables to inject into the sandbox.
    network_file_systems: dict[Union[str, os.PathLike], _NetworkFileSystem] = {},
    timeout: Optional[int] = None,  # Maximum execution time of the sandbox in seconds.
    workdir: Optional[str] = None,  # Working directory of the sandbox.
    gpu: GPU_T = None,
    cloud: Optional[str] = None,
    region: Optional[Union[str, Sequence[str]]] = None,  # Region or regions to run the sandbox on.
    # Specify, in fractional CPU cores, how many CPU cores to request.
    # Or, pass (request, limit) to additionally specify a hard limit in fractional CPU cores.
    # CPU throttling will prevent a container from exceeding its specified limit.
    cpu: Optional[Union[float, tuple[float, float]]] = None,
    # Specify, in MiB, a memory request which is the minimum memory required.
    # Or, pass (request, limit) to additionally specify a hard limit in MiB.
    memory: Optional[Union[int, tuple[int, int]]] = None,
    block_network: bool = False,  # Whether to block network access
    # List of CIDRs the sandbox is allowed to access. If None, all CIDRs are allowed.
    cidr_allowlist: Optional[Sequence[str]] = None,
    volumes: dict[
        Union[str, os.PathLike], Union[_Volume, _CloudBucketMount]
    ] = {},  # Mount points for Modal Volumes and CloudBucketMounts
    pty_info: Optional[api_pb2.PTYInfo] = None,
    # List of ports to tunnel into the sandbox. Encrypted ports are tunneled with TLS.
    encrypted_ports: Sequence[int] = [],
    # List of encrypted ports to tunnel into the sandbox, using HTTP/2.
    h2_ports: Sequence[int] = [],
    # List of ports to tunnel into the sandbox without encryption.
    unencrypted_ports: Sequence[int] = [],
    # Reference to a Modal Proxy to use in front of this Sandbox.
    proxy: Optional[_Proxy] = None,
    # Enable verbose logging for sandbox operations.
    verbose: bool = False,
    experimental_options: Optional[dict[str, bool]] = None,
    # Enable memory snapshots.
    _experimental_enable_snapshot: bool = False,
    _experimental_scheduler_placement: Optional[
        SchedulerPlacement
    ] = None,  # Experimental controls over fine-grained scheduling (alpha).
    client: Optional[_Client] = None,
    environment_name: Optional[str] = None,  # *DEPRECATED* Optionally override the default environment
) -> "_Sandbox":
```

Create a new Sandbox to run untrusted, arbitrary code. The Sandbox's corresponding container
will be created asynchronously.

**Usage**

```python
app = modal.App.lookup('sandbox-hello-world', create_if_missing=True)
sandbox = modal.Sandbox.create("echo", "hello world", app=app)
print(sandbox.stdout.read())
sandbox.wait()
```
## from_name

```python
@staticmethod
def from_name(
    app_name: str,
    name: str,
    *,
    environment_name: Optional[str] = None,
    client: Optional[_Client] = None,
) -> "_Sandbox":
```

Get a running Sandbox by name from the given app.

Raises a modal.exception.NotFoundError if no running sandbox is found with the given name.
A Sandbox's name is the `name` argument passed to `Sandbox.create`.
## from_id

```python
@staticmethod
def from_id(sandbox_id: str, client: Optional[_Client] = None) -> "_Sandbox":
```

Construct a Sandbox from an id and look up the Sandbox result.

The ID of a Sandbox object can be accessed using `.object_id`.
## set_tags

```python
def set_tags(self, tags: dict[str, str], *, client: Optional[_Client] = None):
```

Set tags (key-value pairs) on the Sandbox. Tags can be used to filter results in `Sandbox.list`.
## snapshot_filesystem

```python
def snapshot_filesystem(self, timeout: int = 55) -> _Image:
```

Snapshot the filesystem of the Sandbox.

Returns an [`Image`](https://modal.com/docs/reference/modal.Image) object which
can be used to spawn a new Sandbox with the same filesystem.
## wait

```python
def wait(self, raise_on_termination: bool = True):
```

Wait for the Sandbox to finish running.
## tunnels

```python
def tunnels(self, timeout: int = 50) -> dict[int, Tunnel]:
```

Get Tunnel metadata for the sandbox.

Raises `SandboxTimeoutError` if the tunnels are not available after the timeout.

Returns a dictionary of `Tunnel` objects which are keyed by the container port.

NOTE: Previous to client [v0.64.153](https://modal.com/docs/reference/changelog#064153-2024-09-30), this
returned a list of `TunnelData` objects.
## reload_volumes

```python
def reload_volumes(self) -> None:
```

Reload all Volumes mounted in the Sandbox.

Added in v1.1.0.
## terminate

```python
def terminate(self) -> None:
```

Terminate Sandbox execution.

This is a no-op if the Sandbox has already finished running.
## poll

```python
def poll(self) -> Optional[int]:
```

Check if the Sandbox has finished running.

Returns `None` if the Sandbox is still running, else returns the exit code.
## exec

```python
def exec(
    self,
    *args: str,
    pty_info: Optional[api_pb2.PTYInfo] = None,  # Deprecated: internal use only
    stdout: StreamType = StreamType.PIPE,
    stderr: StreamType = StreamType.PIPE,
    timeout: Optional[int] = None,
    workdir: Optional[str] = None,
    secrets: Sequence[_Secret] = (),
    # Encode output as text.
    text: bool = True,
    # Control line-buffered output.
    # -1 means unbuffered, 1 means line-buffered (only available if `text=True`).
    bufsize: Literal[-1, 1] = -1,
    # Internal option to set terminal size and metadata
    _pty_info: Optional[api_pb2.PTYInfo] = None,
):
```

Execute a command in the Sandbox and return a ContainerProcess handle.

See the [`ContainerProcess`](https://modal.com/docs/reference/modal.container_process#modalcontainer_processcontainerprocess)
docs for more information.

**Usage**

```python
app = modal.App.lookup("my-app", create_if_missing=True)

sandbox = modal.Sandbox.create("sleep", "infinity", app=app)

process = sandbox.exec("bash", "-c", "for i in $(seq 1 10); do echo foo $i; sleep 0.5; done")

for line in process.stdout:
    print(line)
```
## open

```python
def open(
    self,
    path: str,
    mode: Union["_typeshed.OpenTextMode", "_typeshed.OpenBinaryMode"] = "r",
):
```

[Alpha] Open a file in the Sandbox and return a FileIO handle.

See the [`FileIO`](https://modal.com/docs/reference/modal.file_io#modalfile_iofileio) docs for more information.

**Usage**

```python notest
sb = modal.Sandbox.create(app=sb_app)
f = sb.open("/test.txt", "w")
f.write("hello")
f.close()
```
## ls

```python
def ls(self, path: str) -> list[str]:
```

[Alpha] List the contents of a directory in the Sandbox.
## mkdir

```python
def mkdir(self, path: str, parents: bool = False) -> None:
```

[Alpha] Create a new directory in the Sandbox.
## rm

```python
def rm(self, path: str, recursive: bool = False) -> None:
```

[Alpha] Remove a file or directory in the Sandbox.
## watch

```python
def watch(
    self,
    path: str,
    filter: Optional[list[FileWatchEventType]] = None,
    recursive: Optional[bool] = None,
    timeout: Optional[int] = None,
) -> Iterator[FileWatchEvent]:
```

[Alpha] Watch a file or directory in the Sandbox for changes.
## stdout

```python
@property
def stdout(self) -> _StreamReader[str]:
```

[`StreamReader`](https://modal.com/docs/reference/modal.io_streams#modalio_streamsstreamreader) for
the sandbox's stdout stream.
## stderr

```python
@property
def stderr(self) -> _StreamReader[str]:
```

[`StreamReader`](https://modal.com/docs/reference/modal.io_streams#modalio_streamsstreamreader) for
the Sandbox's stderr stream.
## stdin

```python
@property
def stdin(self) -> _StreamWriter:
```

[`StreamWriter`](https://modal.com/docs/reference/modal.io_streams#modalio_streamsstreamwriter) for
the Sandbox's stdin stream.
## returncode

```python
@property
def returncode(self) -> Optional[int]:
```

Return code of the Sandbox process if it has finished running, else `None`.
## list

```python
@staticmethod
def list(
    *, app_id: Optional[str] = None, tags: Optional[dict[str, str]] = None, client: Optional[_Client] = None
) -> AsyncGenerator["_Sandbox", None]:
```

List all Sandboxes for the current Environment or App ID (if specified). If tags are specified, only
Sandboxes that have at least those tags are returned. Returns an iterator over `Sandbox` objects.

#### SandboxSnapshot

# modal.SandboxSnapshot

```python
class SandboxSnapshot(modal.object.Object)
```

> Sandbox memory snapshots are in **early preview**.

A `SandboxSnapshot` object lets you interact with a stored Sandbox snapshot that was created by calling
`._experimental_snapshot()` on a Sandbox instance. This includes both the filesystem and memory state of
the original Sandbox at the time the snapshot was taken.

## hydrate

```python
def hydrate(self, client: Optional[_Client] = None) -> Self:
```

Synchronize the local object with its identity on the Modal server.

It is rarely necessary to call this method explicitly, as most operations
will lazily hydrate when needed. The main use case is when you need to
access object metadata, such as its ID.

*Added in v0.72.39*: This method replaces the deprecated `.resolve()` method.
## from_id

```python
@staticmethod
def from_id(sandbox_snapshot_id: str, client: Optional[_Client] = None):
```

Construct a `SandboxSnapshot` object from a sandbox snapshot ID.

#### Secret

# modal.Secret

```python
class Secret(modal.object.Object)
```

Secrets provide a dictionary of environment variables for images.

Secrets are a secure way to add credentials and other sensitive information
to the containers your functions run in. You can create and edit secrets on
[the dashboard](https://modal.com/secrets), or programmatically from Python code.

See [the secrets guide page](https://modal.com/docs/guide/secrets) for more information.

## hydrate

```python
def hydrate(self, client: Optional[_Client] = None) -> Self:
```

Synchronize the local object with its identity on the Modal server.

It is rarely necessary to call this method explicitly, as most operations
will lazily hydrate when needed. The main use case is when you need to
access object metadata, such as its ID.

*Added in v0.72.39*: This method replaces the deprecated `.resolve()` method.
## name

```python
@property
def name(self) -> Optional[str]:
```

## from_dict

```python
@staticmethod
def from_dict(
    env_dict: dict[
        str, Union[str, None]
    ] = {},  # dict of entries to be inserted as environment variables in functions using the secret
) -> "_Secret":
```

Create a secret from a str-str dictionary. Values can also be `None`, which is ignored.

Usage:
```python
@app.function(secrets=[modal.Secret.from_dict({"FOO": "bar"})])
def run():
    print(os.environ["FOO"])
```
## from_local_environ

```python
@staticmethod
def from_local_environ(
    env_keys: list[str],  # list of local env vars to be included for remote execution
) -> "_Secret":
```

Create secrets from local environment variables automatically.
## from_dotenv

```python
@staticmethod
def from_dotenv(path=None, *, filename=".env") -> "_Secret":
```

Create secrets from a .env file automatically.

If no argument is provided, it will use the current working directory as the starting
point for finding a `.env` file. Note that it does not use the location of the module
calling `Secret.from_dotenv`.

If called with an argument, it will use that as a starting point for finding `.env` files.
In particular, you can call it like this:
```python
@app.function(secrets=[modal.Secret.from_dotenv(__file__)])
def run():
    print(os.environ["USERNAME"])  # Assumes USERNAME is defined in your .env file
```

This will use the location of the script calling `modal.Secret.from_dotenv` as a
starting point for finding the `.env` file.

A file named `.env` is expected by default, but this can be overridden with the `filename`
keyword argument:

```python
@app.function(secrets=[modal.Secret.from_dotenv(filename=".env-dev")])
def run():
    ...
```
## from_name

```python
@staticmethod
def from_name(
    name: str,
    *,
    environment_name: Optional[str] = None,
    required_keys: list[
        str
    ] = [],  # Optionally, a list of required environment variables (will be asserted server-side)
) -> "_Secret":
```

Reference a Secret by its name.

In contrast to most other Modal objects, named Secrets must be provisioned
from the Dashboard. See other methods for alternate ways of creating a new
Secret from code.

```python
secret = modal.Secret.from_name("my-secret")

@app.function(secrets=[secret])
def run():
   ...
```
## info

```python
@live_method
def info(self) -> SecretInfo:
```

Return information about the Secret object.

#### Tunnel

# modal.Tunnel

```python
class Tunnel(object)
```

A port forwarded from within a running Modal container. Created by `modal.forward()`.

**Important:** This is an experimental API which may change in the future.

```python
def __init__(self, host: str, port: int, unencrypted_host: str, unencrypted_port: int) -> None
```

## url

```python
@property
def url(self) -> str:
```

Get the public HTTPS URL of the forwarded port.
## tls_socket

```python
@property
def tls_socket(self) -> tuple[str, int]:
```

Get the public TLS socket as a (host, port) tuple.
## tcp_socket

```python
@property
def tcp_socket(self) -> tuple[str, int]:
```

Get the public TCP socket as a (host, port) tuple.

#### Volume

# modal.Volume

```python
class Volume(modal.object.Object)
```

A writeable volume that can be used to share files between one or more Modal functions.

The contents of a volume is exposed as a filesystem. You can use it to share data between different functions, or
to persist durable state across several instances of the same function.

Unlike a networked filesystem, you need to explicitly reload the volume to see changes made since it was mounted.
Similarly, you need to explicitly commit any changes you make to the volume for the changes to become visible
outside the current container.

Concurrent modification is supported, but concurrent modifications of the same files should be avoided! Last write
wins in case of concurrent modification of the same file - any data the last writer didn't have when committing
changes will be lost!

As a result, volumes are typically not a good fit for use cases where you need to make concurrent modifications to
the same file (nor is distributed file locking supported).

Volumes can only be reloaded if there are no open files for the volume - attempting to reload with open files
will result in an error.

**Usage**

```python
import modal

app = modal.App()
volume = modal.Volume.from_name("my-persisted-volume", create_if_missing=True)

@app.function(volumes={"/root/foo": volume})
def f():
    with open("/root/foo/bar.txt", "w") as f:
        f.write("hello")
    volume.commit()  # Persist changes

@app.function(volumes={"/root/foo": volume})
def g():
    volume.reload()  # Fetch latest changes
    with open("/root/foo/bar.txt", "r") as f:
        print(f.read())
```

## hydrate

```python
def hydrate(self, client: Optional[_Client] = None) -> Self:
```

Synchronize the local object with its identity on the Modal server.

It is rarely necessary to call this method explicitly, as most operations
will lazily hydrate when needed. The main use case is when you need to
access object metadata, such as its ID.

*Added in v0.72.39*: This method replaces the deprecated `.resolve()` method.
## read_only

```python
def read_only(self) -> "_Volume":
```

Configure Volume to mount as read-only.

**Example**

```python
import modal

volume = modal.Volume.from_name("my-volume", create_if_missing=True)

@app.function(volumes={"/mnt/items": volume.read_only()})
def f():
    with open("/mnt/items/my-file.txt") as f:
        return f.read()
```

The Volume is mounted as a read-only volume in a function. Any file system write operation into the
mounted volume will result in an error.

Added in v1.0.5.
## name

```python
@property
def name(self) -> Optional[str]:
```

## from_name

```python
@staticmethod
def from_name(
    name: str,
    *,
    environment_name: Optional[str] = None,
    create_if_missing: bool = False,
    version: "typing.Optional[modal_proto.api_pb2.VolumeFsVersion.ValueType]" = None,
) -> "_Volume":
```

Reference a Volume by name, creating if necessary.

This is a lazy method that defers hydrating the local
object with metadata from Modal servers until the first
time is is actually used.

```python
vol = modal.Volume.from_name("my-volume", create_if_missing=True)

app = modal.App()

# Volume refers to the same object, even across instances of `app`.
@app.function(volumes={"/data": vol})
def f():
    pass
```
## ephemeral

```python
@classmethod
@contextmanager
def ephemeral(
    cls: type["_Volume"],
    client: Optional[_Client] = None,
    environment_name: Optional[str] = None,
    version: "typing.Optional[modal_proto.api_pb2.VolumeFsVersion.ValueType]" = None,
    _heartbeat_sleep: float = EPHEMERAL_OBJECT_HEARTBEAT_SLEEP,
) -> AsyncGenerator["_Volume", None]:
```

Creates a new ephemeral volume within a context manager:

Usage:
```python
import modal
with modal.Volume.ephemeral() as vol:
    assert vol.listdir("/") == []
```

```python notest
async with modal.Volume.ephemeral() as vol:
    assert await vol.listdir("/") == []
```
## info

```python
@live_method
def info(self) -> VolumeInfo:
```

Return information about the Volume object.
## commit

```python
@live_method
def commit(self):
```

Commit changes to the volume.

If successful, the changes made are now persisted in durable storage and available to other containers accessing
the volume.
## reload

```python
@live_method
def reload(self):
```

Make latest committed state of volume available in the running container.

Any uncommitted changes to the volume, such as new or modified files, may implicitly be committed when
reloading.

Reloading will fail if there are open files for the volume.
## iterdir

```python
@live_method_gen
def iterdir(self, path: str, *, recursive: bool = True) -> Iterator[FileEntry]:
```

Iterate over all files in a directory in the volume.

Passing a directory path lists all files in the directory. For a file path, return only that
file's description. If `recursive` is set to True, list all files and folders under the path
recursively.
## listdir

```python
@live_method
def listdir(self, path: str, *, recursive: bool = False) -> list[FileEntry]:
```

List all files under a path prefix in the modal.Volume.

Passing a directory path lists all files in the directory. For a file path, return only that
file's description. If `recursive` is set to True, list all files and folders under the path
recursively.
## read_file

```python
@live_method_gen
def read_file(self, path: str) -> Iterator[bytes]:
```

Read a file from the modal.Volume.

Note - this function is primarily intended to be used outside of a Modal App.
For more information on downloading files from a Modal Volume, see
[the guide](https://modal.com/docs/guide/volumes).

**Example:**

```python notest
vol = modal.Volume.from_name("my-modal-volume")
data = b""
for chunk in vol.read_file("1mb.csv"):
    data += chunk
print(len(data))  # == 1024 * 1024
```
## remove_file

```python
@live_method
def remove_file(self, path: str, recursive: bool = False) -> None:
```

Remove a file or directory from a volume.
## copy_files

```python
@live_method
def copy_files(self, src_paths: Sequence[str], dst_path: str, recursive: bool = False) -> None:
```

Copy files within the volume from src_paths to dst_path.
The semantics of the copy operation follow those of the UNIX cp command.

The `src_paths` parameter is a list. If you want to copy a single file, you should pass a list with a
single element.

`src_paths` and `dst_path` should refer to the desired location *inside* the volume. You do not need to prepend
the volume mount path.

**Usage**

```python notest
vol = modal.Volume.from_name("my-modal-volume")

vol.copy_files(["bar/example.txt"], "bar2")  # Copy files to another directory
vol.copy_files(["bar/example.txt"], "bar/example2.txt")  # Rename a file by copying
```

Note that if the volume is already mounted on the Modal function, you should use normal filesystem operations
like `os.rename()` and then `commit()` the volume. The `copy_files()` method is useful when you don't have
the volume mounted as a filesystem, e.g. when running a script on your local computer.
## batch_upload

```python
@live_method
def batch_upload(self, force: bool = False) -> "_AbstractVolumeUploadContextManager":
```

Initiate a batched upload to a volume.

To allow overwriting existing files, set `force` to `True` (you cannot overwrite existing directories with
uploaded files regardless).

**Example:**

```python notest
vol = modal.Volume.from_name("my-modal-volume")

with vol.batch_upload() as batch:
    batch.put_file("local-path.txt", "/remote-path.txt")
    batch.put_directory("/local/directory/", "/remote/directory")
    batch.put_file(io.BytesIO(b"some data"), "/foobar")
```
## delete

```python
@staticmethod
def delete(name: str, client: Optional[_Client] = None, environment_name: Optional[str] = None):
```

## rename

```python
@staticmethod
def rename(
    old_name: str,
    new_name: str,
    *,
    client: Optional[_Client] = None,
    environment_name: Optional[str] = None,
):
```

#### asgi app

# modal.asgi_app

```python
def asgi_app(
    _warn_parentheses_missing=None,
    *,
    label: Optional[str] = None,  # Label for created endpoint. Final subdomain will be <workspace>--<label>.modal.run.
    custom_domains: Optional[Iterable[str]] = None,  # Deploy this endpoint on a custom domain.
    requires_proxy_auth: bool = False,  # Require Modal-Key and Modal-Secret HTTP Headers on requests.
) -> Callable[[Union[_PartialFunction, NullaryFuncOrMethod]], _PartialFunction]:
```

Decorator for registering an ASGI app with a Modal function.

Asynchronous Server Gateway Interface (ASGI) is a standard for Python
synchronous and asynchronous apps, supported by all popular Python web
libraries. This is an advanced decorator that gives full flexibility in
defining one or more web endpoints on Modal.

**Usage:**

```python
from typing import Callable

@app.function()
@modal.asgi_app()
def create_asgi() -> Callable:
    ...
```

To learn how to use Modal with popular web frameworks, see the
[guide on web endpoints](https://modal.com/docs/guide/webhooks).

#### batched

# modal.batched

```python
def batched(
    _warn_parentheses_missing=None,
    *,
    max_batch_size: int,
    wait_ms: int,
) -> Callable[
    [Union[_PartialFunction[P, ReturnType, ReturnType], Callable[P, ReturnType]]],
    _PartialFunction[P, ReturnType, ReturnType],
]:
```

Decorator for functions or class methods that should be batched.

**Usage**

```python
# Stack the decorator under `@app.function()` to enable dynamic batching
@app.function()
@modal.batched(max_batch_size=4, wait_ms=1000)
async def batched_multiply(xs: list[int], ys: list[int]) -> list[int]:
    return [x * y for x, y in zip(xs, ys)]

# call batched_multiply with individual inputs
# batched_multiply.remote.aio(2, 100)

# With `@app.cls()`, apply the decorator to a method (this may change in the future)
@app.cls()
class BatchedClass:
    @modal.batched(max_batch_size=4, wait_ms=1000)
    def batched_multiply(self, xs: list[int], ys: list[int]) -> list[int]:
        return [x * y for x, y in zip(xs, ys)]
```

See the [dynamic batching guide](https://modal.com/docs/guide/dynamic-batching) for more information.

#### call graph

# modal.call_graph

## modal.call_graph.InputInfo

```python
class InputInfo(object)
```

Simple data structure storing information about a function input.

```python
def __init__(self, input_id: str, function_call_id: str, task_id: str, status: modal.call_graph.InputStatus, function_name: str, module_name: str, children: list['InputInfo']) -> None
```

## modal.call_graph.InputStatus

```python
class InputStatus(enum.IntEnum)
```

Enum representing status of a function input.

The possible values are:

* `PENDING`
* `SUCCESS`
* `FAILURE`
* `INIT_FAILURE`
* `TERMINATED`
* `TIMEOUT`

#### concurrent

# modal.concurrent

```python
def concurrent(
    _warn_parentheses_missing=None,
    *,
    max_inputs: int,  # Hard limit on each container's input concurrency
    target_inputs: Optional[int] = None,  # Input concurrency that Modal's autoscaler should target
) -> Callable[
    [Union[Callable[P, ReturnType], _PartialFunction[P, ReturnType, ReturnType]]],
    _PartialFunction[P, ReturnType, ReturnType],
]:
```

Decorator that allows individual containers to handle multiple inputs concurrently.

The concurrency mechanism depends on whether the function is async or not:
- Async functions will run inputs on a single thread as asyncio tasks.
- Synchronous functions will use multi-threading. The code must be thread-safe.

Input concurrency will be most useful for workflows that are IO-bound
(e.g., making network requests) or when running an inference server that supports
dynamic batching.

When `target_inputs` is set, Modal's autoscaler will try to provision resources
such that each container is running that many inputs concurrently, rather than
autoscaling based on `max_inputs`. Containers may burst up to up to `max_inputs`
if resources are insufficient to remain at the target concurrency, e.g. when the
arrival rate of inputs increases. This can trade-off a small increase in average
latency to avoid larger tail latencies from input queuing.

**Examples:**
```python
# Stack the decorator under `@app.function()` to enable input concurrency
@app.function()
@modal.concurrent(max_inputs=100)
async def f(data):
    # Async function; will be scheduled as asyncio task
    ...

# With `@app.cls()`, apply the decorator at the class level, not on individual methods
@app.cls()
@modal.concurrent(max_inputs=100, target_inputs=80)
class C:
    @modal.method()
    def f(self, data):
        # Sync function; must be thread-safe
        ...

```

*Added in v0.73.148:* This decorator replaces the `allow_concurrent_inputs` parameter
in `@app.function()` and `@app.cls()`.

#### config

# modal.config

Modal intentionally keeps configurability to a minimum.

The main configuration options are the API tokens: the token id and the token secret.
These can be configured in two ways:

1. By running the `modal token set` command.
   This writes the tokens to `.modal.toml` file in your home directory.
2. By setting the environment variables `MODAL_TOKEN_ID` and `MODAL_TOKEN_SECRET`.
   This takes precedence over the previous method.

.modal.toml
---------------

The `.modal.toml` file is generally stored in your home directory.
It should look like this::

```toml
[default]
token_id = "ak-12345..."
token_secret = "as-12345..."
```

You can create this file manually, or you can run the `modal token set ...`
command (see below).

Setting tokens using the CLI
----------------------------

You can set a token by running the command::

```
modal token set \
  --token-id <token id> \
  --token-secret <token secret>
```

This will write the token id and secret to `.modal.toml`.

If the token id or secret is provided as the string `-` (a single dash),
then it will be read in a secret way from stdin instead.

Other configuration options
---------------------------

Other possible configuration options are:

* `loglevel` (in the .toml file) / `MODAL_LOGLEVEL` (as an env var).
  Defaults to `WARNING`. Set this to `DEBUG` to see internal messages.
* `logs_timeout` (in the .toml file) / `MODAL_LOGS_TIMEOUT` (as an env var).
  Defaults to 10.
  Number of seconds to wait for logs to drain when closing the session,
  before giving up.
* `force_build` (in the .toml file) / `MODAL_FORCE_BUILD` (as an env var).
  Defaults to False.
  When set, ignores the Image cache and builds all Image layers. Note that this
  will break the cache for all images based on the rebuilt layers, so other images
  may rebuild on subsequent runs / deploys even if the config is reverted.
* `ignore_cache` (in the .toml file) / `MODAL_IGNORE_CACHE` (as an env var).
  Defaults to False.
  When set, ignores the Image cache and builds all Image layers. Unlike `force_build`,
  this will not overwrite the cache for other images that have the same recipe.
  Subsequent runs that do not use this option will pull the *previous* Image from
  the cache, if one exists. It can be useful for testing an App's robustness to
  Image rebuilds without clobbering Images used by other Apps.
* `traceback` (in the .toml file) / `MODAL_TRACEBACK` (as an env var).
  Defaults to False. Enables printing full tracebacks on unexpected CLI
  errors, which can be useful for debugging client issues.
* `log_pattern` (in the .toml file) / MODAL_LOG_PATTERN` (as an env var).
  Defaults to "[modal-client] %(asctime)s %(message)s"
  The log formatting pattern that will be used by the modal client itself.
  See https://docs.python.org/3/library/logging.html#logrecord-attributes for available
  log attributes.

Meta-configuration
------------------

Some "meta-options" are set using environment variables only:

* `MODAL_CONFIG_PATH` lets you override the location of the .toml file,
  by default `~/.modal.toml`.
* `MODAL_PROFILE` lets you use multiple sections in the .toml file
  and switch between them. It defaults to "default".

## modal.config.Config

```python
class Config(object)
```

Singleton that holds configuration used by Modal internally.

```python
def __init__(self):
```

### get

```python
def get(self, key, profile=None, use_env=True):
```

Looks up a configuration value.

Will check (in decreasing order of priority):
1. Any environment variable of the form MODAL_FOO_BAR (when use_env is True)
2. Settings in the user's .toml configuration file
3. The default value of the setting
### override_locally

```python
def override_locally(self, key: str, value: str):
    # Override setting in this process by overriding environment variable for the setting
    #
    # Does NOT write back to settings file etc.
```

### to_dict

```python
def to_dict(self):
```

## modal.config.config_profiles

```python
def config_profiles():
```

List the available modal profiles in the .modal.toml file.
## modal.config.config_set_active_profile

```python
def config_set_active_profile(env: str) -> None:
```

Set the user's active modal profile by writing it to the `.modal.toml` file.

#### container process

# modal.container_process

## modal.container_process.ContainerProcess

```python
class ContainerProcess(typing.Generic)
```

```python
def __init__(
    self,
    process_id: str,
    client: _Client,
    stdout: StreamType = StreamType.PIPE,
    stderr: StreamType = StreamType.PIPE,
    exec_deadline: Optional[float] = None,
    text: bool = True,
    by_line: bool = False,
) -> None:
```

### stdout

```python
@property
def stdout(self) -> _StreamReader[T]:
```

StreamReader for the container process's stdout stream.
### stderr

```python
@property
def stderr(self) -> _StreamReader[T]:
```

StreamReader for the container process's stderr stream.
### stdin

```python
@property
def stdin(self) -> _StreamWriter:
```

StreamWriter for the container process's stdin stream.
### returncode

```python
@property
def returncode(self) -> int:
```

### poll

```python
def poll(self) -> Optional[int]:
```

Check if the container process has finished running.

Returns `None` if the process is still running, else returns the exit code.
### wait

```python
def wait(self) -> int:
```

Wait for the container process to finish running. Returns the exit code.

#### current function call id

# modal.current_function_call_id

```python
def current_function_call_id() -> Optional[str]:
```

Returns the function call ID for the current input.

Can only be called from Modal function (i.e. in a container context).

```python
from modal import current_function_call_id

@app.function()
def process_stuff():
    print(f"Starting to process input from {current_function_call_id()}")
```

#### current input id

# modal.current_input_id

```python
def current_input_id() -> Optional[str]:
```

Returns the input ID for the current input.

Can only be called from Modal function (i.e. in a container context).

```python
from modal import current_input_id

@app.function()
def process_stuff():
    print(f"Starting to process {current_input_id()}")
```

#### enable output

# modal.enable_output

```python
@contextlib.contextmanager
def enable_output(show_progress: bool = True) -> Generator[None, None, None]:
```

Context manager that enable output when using the Python SDK.

This will print to stdout and stderr things such as
1. Logs from running functions
2. Status of creating objects
3. Map progress

Example:
```python
app = modal.App()
with modal.enable_output():
    with app.run():
        ...
```

#### enter

# modal.enter

```python
def enter(
    _warn_parentheses_missing=None,
    *,
    snap: bool = False,
) -> Callable[[Union[_PartialFunction, NullaryMethod]], _PartialFunction]:
```

Decorator for methods which should be executed when a new container is started.

See the [lifeycle function guide](https://modal.com/docs/guide/lifecycle-functions#enter) for more information.

#### exception

# modal.exception

## modal.exception.AlreadyExistsError

```python
class AlreadyExistsError(modal.exception.Error)
```

Raised when a resource creation conflicts with an existing resource.

## modal.exception.AuthError

```python
class AuthError(modal.exception.Error)
```

Raised when a client has missing or invalid authentication.

## modal.exception.ClientClosed

```python
class ClientClosed(modal.exception.Error)
```

## modal.exception.ConnectionError

```python
class ConnectionError(modal.exception.Error)
```

Raised when an issue occurs while connecting to the Modal servers.

## modal.exception.DeprecationError

```python
class DeprecationError(UserWarning)
```

UserWarning category emitted when a deprecated Modal feature or API is used.

## modal.exception.DeserializationError

```python
class DeserializationError(modal.exception.Error)
```

Raised to provide more context when an error is encountered during deserialization.

## modal.exception.ExecutionError

```python
class ExecutionError(modal.exception.Error)
```

Raised when something unexpected happened during runtime.

## modal.exception.FilesystemExecutionError

```python
class FilesystemExecutionError(modal.exception.Error)
```

Raised when an unknown error is thrown during a container filesystem operation.

## modal.exception.FunctionTimeoutError

```python
class FunctionTimeoutError(modal.exception.TimeoutError)
```

Raised when a Function exceeds its execution duration limit and times out.

## modal.exception.InputCancellation

```python
class InputCancellation(BaseException)
```

Raised when the current input is cancelled by the task

Intentionally a BaseException instead of an Exception, so it won't get
caught by unspecified user exception clauses that might be used for retries and
other control flow.

## modal.exception.InteractiveTimeoutError

```python
class InteractiveTimeoutError(modal.exception.TimeoutError)
```

Raised when interactive frontends time out while trying to connect to a container.

## modal.exception.InternalFailure

```python
class InternalFailure(modal.exception.Error)
```

Retriable internal error.

## modal.exception.InvalidError

```python
class InvalidError(modal.exception.Error)
```

Raised when user does something invalid.

## modal.exception.ModuleNotMountable

```python
class ModuleNotMountable(Exception)
```

## modal.exception.MountUploadTimeoutError

```python
class MountUploadTimeoutError(modal.exception.TimeoutError)
```

Raised when a Mount upload times out.

## modal.exception.NotFoundError

```python
class NotFoundError(modal.exception.Error)
```

Raised when a requested resource was not found.

## modal.exception.OutputExpiredError

```python
class OutputExpiredError(modal.exception.TimeoutError)
```

Raised when the Output exceeds expiration and times out.

## modal.exception.PendingDeprecationError

```python
class PendingDeprecationError(UserWarning)
```

Soon to be deprecated feature. Only used intermittently because of multi-repo concerns.

## modal.exception.RemoteError

```python
class RemoteError(modal.exception.Error)
```

Raised when an error occurs on the Modal server.

## modal.exception.RequestSizeError

```python
class RequestSizeError(modal.exception.Error)
```

Raised when an operation produces a gRPC request that is rejected by the server for being too large.

## modal.exception.SandboxTerminatedError

```python
class SandboxTerminatedError(modal.exception.Error)
```

Raised when a Sandbox is terminated for an internal reason.

## modal.exception.SandboxTimeoutError

```python
class SandboxTimeoutError(modal.exception.TimeoutError)
```

Raised when a Sandbox exceeds its execution duration limit and times out.

## modal.exception.SerializationError

```python
class SerializationError(modal.exception.Error)
```

Raised to provide more context when an error is encountered during serialization.

## modal.exception.ServerWarning

```python
class ServerWarning(UserWarning)
```

Warning originating from the Modal server and re-issued in client code.

## modal.exception.TimeoutError

```python
class TimeoutError(modal.exception.Error)
```

Base class for Modal timeouts.

## modal.exception.VersionError

```python
class VersionError(modal.exception.Error)
```

Raised when the current client version of Modal is unsupported.

## modal.exception.VolumeUploadTimeoutError

```python
class VolumeUploadTimeoutError(modal.exception.TimeoutError)
```

Raised when a Volume upload times out.

## modal.exception.simulate_preemption

```python
def simulate_preemption(wait_seconds: int, jitter_seconds: int = 0):
```

Utility for simulating a preemption interrupt after `wait_seconds` seconds.
The first interrupt is the SIGINT signal. After 30 seconds, a second
interrupt will trigger.

This second interrupt simulates SIGKILL, and should not be caught.
Optionally add between zero and `jitter_seconds` seconds of additional waiting before first interrupt.

**Usage:**

```python notest
import time
from modal.exception import simulate_preemption

simulate_preemption(3)

try:
    time.sleep(4)
except KeyboardInterrupt:
    print("got preempted") # Handle interrupt
    raise
```

See https://modal.com/docs/guide/preemption for more details on preemption
handling.

#### exit

# modal.exit

```python
def exit(_warn_parentheses_missing=None) -> Callable[[NullaryMethod], _PartialFunction]:
```

Decorator for methods which should be executed when a container is about to exit.

See the [lifeycle function guide](https://modal.com/docs/guide/lifecycle-functions#exit) for more information.

#### fastapi endpoint

# modal.fastapi_endpoint

```python
def fastapi_endpoint(
    _warn_parentheses_missing=None,
    *,
    method: str = "GET",  # REST method for the created endpoint.
    label: Optional[str] = None,  # Label for created endpoint. Final subdomain will be <workspace>--<label>.modal.run.
    custom_domains: Optional[Iterable[str]] = None,  # Custom fully-qualified domain name (FQDN) for the endpoint.
    docs: bool = False,  # Whether to enable interactive documentation for this endpoint at /docs.
    requires_proxy_auth: bool = False,  # Require Modal-Key and Modal-Secret HTTP Headers on requests.
) -> Callable[
    [Union[_PartialFunction[P, ReturnType, ReturnType], Callable[P, ReturnType]]],
    _PartialFunction[P, ReturnType, ReturnType],
]:
```

Convert a function into a basic web endpoint by wrapping it with a FastAPI App.

Modal will internally use [FastAPI](https://fastapi.tiangolo.com/) to expose a
simple, single request handler. If you are defining your own `FastAPI` application
(e.g. if you want to define multiple routes), use `@modal.asgi_app` instead.

The endpoint created with this decorator will automatically have
[CORS](https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS) enabled
and can leverage many of FastAPI's features.

For more information on using Modal with popular web frameworks, see our
[guide on web endpoints](https://modal.com/docs/guide/webhooks).

*Added in v0.73.82*: This function replaces the deprecated `@web_endpoint` decorator.

#### file io

# modal.file_io

## modal.file_io.FileIO

```python
class FileIO(typing.Generic)
```

[Alpha] FileIO handle, used in the Sandbox filesystem API.

The API is designed to mimic Python's io.FileIO.

Currently this API is in Alpha and is subject to change. File I/O operations
may be limited in size to 100 MiB, and the throughput of requests is
restricted in the current implementation. For our recommendations on large file transfers
see the Sandbox [filesystem access guide](https://modal.com/docs/guide/sandbox-files).

**Usage**

```python
import modal

app = modal.App.lookup("my-app", create_if_missing=True)

sb = modal.Sandbox.create(app=app)
f = sb.open("/tmp/foo.txt", "w")
f.write("hello")
f.close()
```

```python
def __init__(self, client: _Client, task_id: str) -> None:
```

### create

```python
@classmethod
def create(
    cls, path: str, mode: Union["_typeshed.OpenTextMode", "_typeshed.OpenBinaryMode"], client: _Client, task_id: str
) -> "_FileIO":
```

Create a new FileIO handle.
### read

```python
def read(self, n: Optional[int] = None) -> T:
```

Read n bytes from the current position, or the entire remaining file if n is None.
### readline

```python
def readline(self) -> T:
```

Read a single line from the current position.
### readlines

```python
def readlines(self) -> Sequence[T]:
```

Read all lines from the current position.
### write

```python
def write(self, data: Union[bytes, str]) -> None:
```

Write data to the current position.

Writes may not appear until the entire buffer is flushed, which
can be done manually with `flush()` or automatically when the file is
closed.
### flush

```python
def flush(self) -> None:
```

Flush the buffer to disk.
### seek

```python
def seek(self, offset: int, whence: int = 0) -> None:
```

Move to a new position in the file.

`whence` defaults to 0 (absolute file positioning); other values are 1
(relative to the current position) and 2 (relative to the file's end).
### ls

```python
@classmethod
def ls(cls, path: str, client: _Client, task_id: str) -> list[str]:
```

List the contents of the provided directory.
### mkdir

```python
@classmethod
def mkdir(cls, path: str, client: _Client, task_id: str, parents: bool = False) -> None:
```

Create a new directory.
### rm

```python
@classmethod
def rm(cls, path: str, client: _Client, task_id: str, recursive: bool = False) -> None:
```

Remove a file or directory in the Sandbox.
### watch

```python
@classmethod
def watch(
    cls,
    path: str,
    client: _Client,
    task_id: str,
    filter: Optional[list[FileWatchEventType]] = None,
    recursive: bool = False,
    timeout: Optional[int] = None,
) -> Iterator[FileWatchEvent]:
```

### close

```python
def close(self) -> None:
```

Flush the buffer and close the file.
## modal.file_io.FileWatchEvent

```python
class FileWatchEvent(object)
```

FileWatchEvent(paths: list[str], type: modal.file_io.FileWatchEventType)

```python
def __init__(self, paths: list[str], type: modal.file_io.FileWatchEventType) -> None
```

## modal.file_io.FileWatchEventType

```python
class FileWatchEventType(enum.Enum)
```

An enumeration.

The possible values are:

* `Unknown`
* `Access`
* `Create`
* `Modify`
* `Remove`
## modal.file_io.delete_bytes

```python
async def delete_bytes(file: "_FileIO", start: Optional[int] = None, end: Optional[int] = None) -> None:
```

Delete a range of bytes from the file.

`start` and `end` are byte offsets. `start` is inclusive, `end` is exclusive.
If either is None, the start or end of the file is used, respectively.
## modal.file_io.replace_bytes

```python
async def replace_bytes(file: "_FileIO", data: bytes, start: Optional[int] = None, end: Optional[int] = None) -> None:
```

Replace a range of bytes in the file with new data. The length of the data does not
have to be the same as the length of the range being replaced.

`start` and `end` are byte offsets. `start` is inclusive, `end` is exclusive.
If either is None, the start or end of the file is used, respectively.

#### forward

# modal.forward

```python
@contextmanager
def forward(port: int, *, unencrypted: bool = False, client: Optional[_Client] = None) -> Iterator[Tunnel]:
```

Expose a port publicly from inside a running Modal container, with TLS.

If `unencrypted` is set, this also exposes the TCP socket without encryption on a random port
number. This can be used to SSH into a container (see example below). Note that it is on the public Internet, so
make sure you are using a secure protocol over TCP.

**Important:** This is an experimental API which may change in the future.

**Usage:**

```python notest
import modal
from flask import Flask

app = modal.App(image=modal.Image.debian_slim().pip_install("Flask"))
flask_app = Flask(__name__)

@flask_app.route("/")
def hello_world():
    return "Hello, World!"

@app.function()
def run_app():
    # Start a web server inside the container at port 8000. `modal.forward(8000)` lets us
    # expose that port to the world at a random HTTPS URL.
    with modal.forward(8000) as tunnel:
        print("Server listening at", tunnel.url)
        flask_app.run("0.0.0.0", 8000)

    # When the context manager exits, the port is no longer exposed.
```

**Raw TCP usage:**

```python
import socket
import threading

import modal

def run_echo_server(port: int):
    """Run a TCP echo server listening on the given port."""
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    sock.bind(("0.0.0.0", port))
    sock.listen(1)

    while True:
        conn, addr = sock.accept()
        print("Connection from:", addr)

        # Start a new thread to handle the connection
        def handle(conn):
            with conn:
                while True:
                    data = conn.recv(1024)
                    if not data:
                        break
                    conn.sendall(data)

        threading.Thread(target=handle, args=(conn,)).start()

app = modal.App()

@app.function()
def tcp_tunnel():
    # This exposes port 8000 to public Internet traffic over TCP.
    with modal.forward(8000, unencrypted=True) as tunnel:
        # You can connect to this TCP socket from outside the container, for example, using `nc`:
        #  nc <HOST> <PORT>
        print("TCP tunnel listening at:", tunnel.tcp_socket)
        run_echo_server(8000)
```

**SSH example:**
This assumes you have a rsa keypair in `~/.ssh/id_rsa{.pub}`, this is a bare-bones example
letting you SSH into a Modal container.

```python
import subprocess
import time

import modal

app = modal.App()
image = (
    modal.Image.debian_slim()
    .apt_install("openssh-server")
    .run_commands("mkdir /run/sshd")
    .add_local_file("~/.ssh/id_rsa.pub", "/root/.ssh/authorized_keys", copy=True)
)

@app.function(image=image, timeout=3600)
def some_function():
    subprocess.Popen(["/usr/sbin/sshd", "-D", "-e"])
    with modal.forward(port=22, unencrypted=True) as tunnel:
        hostname, port = tunnel.tcp_socket
        connection_cmd = f'ssh -p {port} root@{hostname}'
        print(f"ssh into container using: {connection_cmd}")
        time.sleep(3600)  # keep alive for 1 hour or until killed
```

If you intend to use this more generally, a suggestion is to put the subprocess and port
forwarding code in an `@enter` lifecycle method of an @app.cls, to only make a single
ssh server and port for each container (and not one for each input to the function).

#### gpu

# modal.gpu

**GPU configuration shortcodes**

You can pass a wide range of `str` values for the `gpu` parameter of
[`@app.function`](https://modal.com/docs/reference/modal.App#function).

For instance:
- `gpu="H100"` will attach 1 H100 GPU to each container
- `gpu="L40S"` will attach 1 L40S GPU to each container
- `gpu="T4:4"` will attach 4 T4 GPUs to each container

You can see a list of Modal GPU options in the
[GPU docs](https://modal.com/docs/guide/gpu).

**Example**

```python
@app.function(gpu="A100-80GB:4")
def my_gpu_function():
    ... # This will have 4 A100-80GB with each container
```

**Deprecation notes**

An older deprecated way to configure GPU is also still supported,
but will be removed in future versions of Modal. Examples:

- `gpu=modal.gpu.H100()` will attach 1 H100 GPU to each container
- `gpu=modal.gpu.T4(count=4)` will attach 4 T4 GPUs to each container
- `gpu=modal.gpu.A100()` will attach 1 A100-40GB GPUs to each container
- `gpu=modal.gpu.A100(size="80GB")` will attach 1 A100-80GB GPUs to each container

## modal.gpu.A100

```python
class A100(modal.gpu._GPUConfig)
```

[NVIDIA A100 Tensor Core](https://www.nvidia.com/en-us/data-center/a100/) GPU class.

The flagship data center GPU of the Ampere architecture. Available in 40GB and 80GB GPU memory configurations.

```python
def __init__(
    self,
    *,
    count: int = 1,  # Number of GPUs per container. Defaults to 1.
    size: Union[str, None] = None,  # Select GB configuration of GPU device: "40GB" or "80GB". Defaults to "40GB".
):
```

## modal.gpu.A10G

```python
class A10G(modal.gpu._GPUConfig)
```

[NVIDIA A10G Tensor Core](https://www.nvidia.com/en-us/data-center/products/a10-gpu/) GPU class.

A mid-tier data center GPU based on the Ampere architecture, providing 24 GB of memory.
A10G GPUs deliver up to 3.3x better ML training performance, 3x better ML inference performance,
and 3x better graphics performance, in comparison to NVIDIA T4 GPUs.

```python
def __init__(
    self,
    *,
    # Number of GPUs per container. Defaults to 1.
    # Useful if you have very large models that don't fit on a single GPU.
    count: int = 1,
):
```

## modal.gpu.Any

```python
class Any(modal.gpu._GPUConfig)
```

Selects any one of the GPU classes available within Modal, according to availability.

```python
def __init__(self, *, count: int = 1):
```

## modal.gpu.H100

```python
class H100(modal.gpu._GPUConfig)
```

[NVIDIA H100 Tensor Core](https://www.nvidia.com/en-us/data-center/h100/) GPU class.

The flagship data center GPU of the Hopper architecture.
Enhanced support for FP8 precision and a Transformer Engine that provides up to 4X faster training
over the prior generation for GPT-3 (175B) models.

```python
def __init__(
    self,
    *,
    # Number of GPUs per container. Defaults to 1.
    # Useful if you have very large models that don't fit on a single GPU.
    count: int = 1,
):
```

## modal.gpu.L4

```python
class L4(modal.gpu._GPUConfig)
```

[NVIDIA L4 Tensor Core](https://www.nvidia.com/en-us/data-center/l4/) GPU class.

A mid-tier data center GPU based on the Ada Lovelace architecture, providing 24GB of GPU memory.
Includes RTX (ray tracing) support.

```python
def __init__(
    self,
    count: int = 1,  # Number of GPUs per container. Defaults to 1.
):
```

## modal.gpu.L40S

```python
class L40S(modal.gpu._GPUConfig)
```

[NVIDIA L40S](https://www.nvidia.com/en-us/data-center/l40s/) GPU class.

The L40S is a data center GPU for the Ada Lovelace architecture. It has 48 GB of on-chip
GDDR6 RAM and enhanced support for FP8 precision.

```python
def __init__(
    self,
    *,
    # Number of GPUs per container. Defaults to 1.
    # Useful if you have very large models that don't fit on a single GPU.
    count: int = 1,
):
```

## modal.gpu.T4

```python
class T4(modal.gpu._GPUConfig)
```

[NVIDIA T4 Tensor Core](https://www.nvidia.com/en-us/data-center/tesla-t4/) GPU class.

A low-cost data center GPU based on the Turing architecture, providing 16GB of GPU memory.

```python
def __init__(
    self,
    count: int = 1,  # Number of GPUs per container. Defaults to 1.
):
```

## modal.gpu.parse_gpu_config

```python
def parse_gpu_config(value: GPU_T) -> api_pb2.GPUConfig:
```

#### interact

# modal.interact

```python
def interact() -> None:
```

Enable interactivity with user input inside a Modal container.

See the [interactivity guide](https://modal.com/docs/guide/developing-debugging#interactivity)
for more information on how to use this function.

#### io streams

# modal.io_streams

## modal.io_streams.StreamReader

```python
class StreamReader(typing.Generic)
```

Retrieve logs from a stream (`stdout` or `stderr`).

As an asynchronous iterable, the object supports the `for` and `async for`
statements. Just loop over the object to read in chunks.

**Usage**

```python fixture:running_app
from modal import Sandbox

sandbox = Sandbox.create(
    "bash",
    "-c",
    "for i in $(seq 1 10); do echo foo; sleep 0.1; done",
    app=running_app,
)
for message in sandbox.stdout:
    print(f"Message: {message}")
```

### file_descriptor

```python
@property
def file_descriptor(self) -> int:
```

Possible values are `1` for stdout and `2` for stderr.
### read

```python
def read(self) -> T:
```

Fetch the entire contents of the stream until EOF.

**Usage**

```python fixture:running_app
from modal import Sandbox

sandbox = Sandbox.create("echo", "hello", app=running_app)
sandbox.wait()

print(sandbox.stdout.read())
```
## modal.io_streams.StreamWriter

```python
class StreamWriter(object)
```

Provides an interface to buffer and write logs to a sandbox or container process stream (`stdin`).

### write

```python
def write(self, data: Union[bytes, bytearray, memoryview, str]) -> None:
```

Write data to the stream but does not send it immediately.

This is non-blocking and queues the data to an internal buffer. Must be
used along with the `drain()` method, which flushes the buffer.

**Usage**

```python fixture:running_app
from modal import Sandbox

sandbox = Sandbox.create(
    "bash",
    "-c",
    "while read line; do echo $line; done",
    app=running_app,
)
sandbox.stdin.write(b"foo\n")
sandbox.stdin.write(b"bar\n")
sandbox.stdin.write_eof()

sandbox.stdin.drain()
sandbox.wait()
```
### write_eof

```python
def write_eof(self) -> None:
```

Close the write end of the stream after the buffered data is drained.

If the process was blocked on input, it will become unblocked after
`write_eof()`. This method needs to be used along with the `drain()`
method, which flushes the EOF to the process.
### drain

```python
def drain(self) -> None:
```

Flush the write buffer and send data to the running process.

This is a flow control method that blocks until data is sent. It returns
when it is appropriate to continue writing data to the stream.

**Usage**

```python notest
writer.write(data)
writer.drain()
```

Async usage:
```python notest
writer.write(data)  # not a blocking operation
await writer.drain.aio()
```

#### is local

# modal.is_local

```python
def is_local() -> bool:
```

Returns if we are currently on the machine launching/deploying a Modal app

Returns `True` when executed locally on the user's machine.
Returns `False` when executed from a Modal container in the cloud.

#### method

# modal.method

```python
def method(
    _warn_parentheses_missing=None,
    *,
    # Set this to True if it's a non-generator function returning
    # a [sync/async] generator object
    is_generator: Optional[bool] = None,
) -> _MethodDecoratorType:
```

Decorator for methods that should be transformed into a Modal Function registered against this class's App.

**Usage:**

```python
@app.cls(cpu=8)
class MyCls:

    @modal.method()
    def f(self):
        ...
```

#### parameter

# modal.parameter

```python
def parameter(*, default: Any = _no_default, init: bool = True) -> Any:
```

Used to specify options for modal.cls parameters, similar to dataclass.field for dataclasses
```
class A:
    a: str = modal.parameter()

```

If `init=False` is specified, the field is not considered a parameter for the
Modal class and not used in the synthesized constructor. This can be used to
optionally annotate the type of a field that's used internally, for example values
being set by @enter lifecycle methods, without breaking type checkers, but it has
no runtime effect on the class.

#### web endpoint

# modal.web_endpoint

```python
def web_endpoint(
    _warn_parentheses_missing=None,
    *,
    method: str = "GET",  # REST method for the created endpoint.
    label: Optional[str] = None,  # Label for created endpoint. Final subdomain will be <workspace>--<label>.modal.run.
    docs: bool = False,  # Whether to enable interactive documentation for this endpoint at /docs.
    custom_domains: Optional[
        Iterable[str]
    ] = None,  # Create an endpoint using a custom domain fully-qualified domain name (FQDN).
    requires_proxy_auth: bool = False,  # Require Modal-Key and Modal-Secret HTTP Headers on requests.
) -> Callable[
    [Union[_PartialFunction[P, ReturnType, ReturnType], Callable[P, ReturnType]]],
    _PartialFunction[P, ReturnType, ReturnType],
]:
```

Register a basic web endpoint with this application.

DEPRECATED: This decorator has been renamed to `@modal.fastapi_endpoint`.

This is the simple way to create a web endpoint on Modal. The function
behaves as a [FastAPI](https://fastapi.tiangolo.com/) handler and should
return a response object to the caller.

Endpoints created with `@modal.web_endpoint` are meant to be simple, single
request handlers and automatically have
[CORS](https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS) enabled.
For more flexibility, use `@modal.asgi_app`.

To learn how to use Modal with popular web frameworks, see the
[guide on web endpoints](https://modal.com/docs/guide/webhooks).

#### web server

# modal.web_server

```python
def web_server(
    port: int,
    *,
    startup_timeout: float = 5.0,  # Maximum number of seconds to wait for the web server to start.
    label: Optional[str] = None,  # Label for created endpoint. Final subdomain will be <workspace>--<label>.modal.run.
    custom_domains: Optional[Iterable[str]] = None,  # Deploy this endpoint on a custom domain.
    requires_proxy_auth: bool = False,  # Require Modal-Key and Modal-Secret HTTP Headers on requests.
) -> Callable[[Union[_PartialFunction, NullaryFuncOrMethod]], _PartialFunction]:
```

Decorator that registers an HTTP web server inside the container.

This is similar to `@asgi_app` and `@wsgi_app`, but it allows you to expose a full HTTP server
listening on a container port. This is useful for servers written in other languages like Rust,
as well as integrating with non-ASGI frameworks like aiohttp and Tornado.

**Usage:**

```python
import subprocess

@app.function()
@modal.web_server(8000)
def my_file_server():
    subprocess.Popen("python -m http.server -d / 8000", shell=True)
```

The above example starts a simple file server, displaying the contents of the root directory.
Here, requests to the web endpoint will go to external port 8000 on the container. The
`http.server` module is included with Python, but you could run anything here.

Internally, the web server is transparently converted into a web endpoint by Modal, so it has
the same serverless autoscaling behavior as other web endpoints.

For more info, see the [guide on web endpoints](https://modal.com/docs/guide/webhooks).

#### wsgi app

# modal.wsgi_app

```python
def wsgi_app(
    _warn_parentheses_missing=None,
    *,
    label: Optional[str] = None,  # Label for created endpoint. Final subdomain will be <workspace>--<label>.modal.run.
    custom_domains: Optional[Iterable[str]] = None,  # Deploy this endpoint on a custom domain.
    requires_proxy_auth: bool = False,  # Require Modal-Key and Modal-Secret HTTP Headers on requests.
) -> Callable[[Union[_PartialFunction, NullaryFuncOrMethod]], _PartialFunction]:
```

Decorator for registering a WSGI app with a Modal function.

Web Server Gateway Interface (WSGI) is a standard for synchronous Python web apps.
It has been [succeeded by the ASGI interface](https://asgi.readthedocs.io/en/latest/introduction.html#wsgi-compatibility)
which is compatible with ASGI and supports additional functionality such as web sockets.
Modal supports ASGI via [`asgi_app`](https://modal.com/docs/reference/modal.asgi_app).

**Usage:**

```python
from typing import Callable

@app.function()
@modal.wsgi_app()
def create_wsgi() -> Callable:
    ...
```

To learn how to use this decorator with popular web frameworks, see the
[guide on web endpoints](https://modal.com/docs/guide/webhooks).

### CLI Reference

### modal app

# `modal app`

Manage deployed and running apps.

**Usage**:

```shell
modal app [OPTIONS] COMMAND [ARGS]...
```

**Options**:

* `--help`: Show this message and exit.

**Commands**:

* `list`: List Modal apps that are currently deployed/running or recently stopped.
* `logs`: Show App logs, streaming while active.
* `rollback`: Redeploy a previous version of an App.
* `stop`: Stop an app.
* `history`: Show App deployment history, for a currently deployed app

## `modal app list`

List Modal apps that are currently deployed/running or recently stopped.

**Usage**:

```shell
modal app list [OPTIONS]
```

**Options**:

* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--json / --no-json`: \[default: no-json]
* `--help`: Show this message and exit.

## `modal app logs`

Show App logs, streaming while active.

**Examples:**

Get the logs based on an app ID:

```
modal app logs ap-123456
```

Get the logs for a currently deployed App based on its name:

```
modal app logs my-app
```

**Usage**:

```shell
modal app logs [OPTIONS] [APP_IDENTIFIER]
```

**Arguments**:

* `[APP_IDENTIFIER]`: App name or ID

**Options**:

* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--timestamps`: Show timestamps for each log line
* `--help`: Show this message and exit.

## `modal app rollback`

Redeploy a previous version of an App.

Note that the App must currently be in a "deployed" state.
Rollbacks will appear as a new deployment in the App history, although
the App state will be reset to the state at the time of the previous deployment.

**Examples:**

Rollback an App to its previous version:

```
modal app rollback my-app
```

Rollback an App to a specific version:

```
modal app rollback my-app v3
```

Rollback an App using its App ID instead of its name:

```
modal app rollback ap-abcdefghABCDEFGH123456
```

**Usage**:

```shell
modal app rollback [OPTIONS] [APP_IDENTIFIER] [VERSION]
```

**Arguments**:

* `[APP_IDENTIFIER]`: App name or ID
* `[VERSION]`: Target version for rollback.

**Options**:

* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--help`: Show this message and exit.

## `modal app stop`

Stop an app.

**Usage**:

```shell
modal app stop [OPTIONS] [APP_IDENTIFIER]
```

**Arguments**:

* `[APP_IDENTIFIER]`: App name or ID

**Options**:

* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--help`: Show this message and exit.

## `modal app history`

Show App deployment history, for a currently deployed app

**Examples:**

Get the history based on an app ID:

```
modal app history ap-123456
```

Get the history for a currently deployed App based on its name:

```
modal app history my-app
```

**Usage**:

```shell
modal app history [OPTIONS] [APP_IDENTIFIER]
```

**Arguments**:

* `[APP_IDENTIFIER]`: App name or ID

**Options**:

* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--json / --no-json`: \[default: no-json]
* `--help`: Show this message and exit.

### modal config

# `modal config`

Manage client configuration for the current profile.

Refer to https://modal.com/docs/reference/modal.config for a full explanation
of what these options mean, and how to set them.

**Usage**:

```shell
modal config [OPTIONS] COMMAND [ARGS]...
```

**Options**:

* `--help`: Show this message and exit.

**Commands**:

* `show`: Show current configuration values (debugging command).
* `set-environment`: Set the default Modal environment for the active profile

## `modal config show`

Show current configuration values (debugging command).

**Usage**:

```shell
modal config show [OPTIONS]
```

**Options**:

* `--redact / --no-redact`: Redact the `token_secret` value.  \[default: redact]
* `--help`: Show this message and exit.

## `modal config set-environment`

Set the default Modal environment for the active profile

The default environment of a profile is used when no --env flag is passed to `modal run`, `modal deploy` etc.

If no default environment is set, and there exists multiple environments in a workspace, an error will be raised
when running a command that requires an environment.

**Usage**:

```shell
modal config set-environment [OPTIONS] ENVIRONMENT_NAME
```

**Arguments**:

* `ENVIRONMENT_NAME`: \[required]

**Options**:

* `--help`: Show this message and exit.

### modal container

# `modal container`

Manage and connect to running containers.

**Usage**:

```shell
modal container [OPTIONS] COMMAND [ARGS]...
```

**Options**:

* `--help`: Show this message and exit.

**Commands**:

* `list`: List all containers that are currently running.
* `logs`: Show logs for a specific container, streaming while active.
* `exec`: Execute a command in a container.
* `stop`: Stop a currently-running container and reassign its in-progress inputs.

## `modal container list`

List all containers that are currently running.

**Usage**:

```shell
modal container list [OPTIONS]
```

**Options**:

* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--json / --no-json`: \[default: no-json]
* `--help`: Show this message and exit.

## `modal container logs`

Show logs for a specific container, streaming while active.

**Usage**:

```shell
modal container logs [OPTIONS] CONTAINER_ID
```

**Arguments**:

* `CONTAINER_ID`: Container ID  \[required]

**Options**:

* `--help`: Show this message and exit.

## `modal container exec`

Execute a command in a container.

**Usage**:

```shell
modal container exec [OPTIONS] CONTAINER_ID COMMAND...
```

**Arguments**:

* `CONTAINER_ID`: Container ID  \[required]
* `COMMAND...`: A command to run inside the container.

To pass command-line flags or options, add `--` before the start of your commands. For example: `modal container exec <id> -- /bin/bash -c 'echo hi'`  \[required]

**Options**:

* `--pty / --no-pty`: Run the command using a PTY.
* `--help`: Show this message and exit.

## `modal container stop`

Stop a currently-running container and reassign its in-progress inputs.

This will send the container a SIGINT signal that Modal will handle.

**Usage**:

```shell
modal container stop [OPTIONS] CONTAINER_ID
```

**Arguments**:

* `CONTAINER_ID`: Container ID  \[required]

**Options**:

* `--help`: Show this message and exit.

### modal deploy

# `modal deploy`

Deploy a Modal application.

**Usage:**
modal deploy my_script.py
modal deploy -m my_package.my_mod

**Usage**:

```shell
modal deploy [OPTIONS] APP_REF
```

**Arguments**:

* `APP_REF`: Path to a Python file with an app to deploy  \[required]

**Options**:

* `--name TEXT`: Name of the deployment.
* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--stream-logs / --no-stream-logs`: Stream logs from the app upon deployment.  \[default: no-stream-logs]
* `--tag TEXT`: Tag the deployment with a version.
* `-m`: Interpret argument as a Python module path instead of a file/script path
* `--help`: Show this message and exit.

### modal dict

# `modal dict`

Manage `modal.Dict` objects and inspect their contents.

**Usage**:

```shell
modal dict [OPTIONS] COMMAND [ARGS]...
```

**Options**:

* `--help`: Show this message and exit.

**Commands**:

* `create`: Create a named Dict object.
* `list`: List all named Dicts.
* `clear`: Clear the contents of a named Dict by deleting all of its data.
* `delete`: Delete a named Dict and all of its data.
* `get`: Print the value for a specific key.
* `items`: Print the contents of a Dict.

## `modal dict create`

Create a named Dict object.

Note: This is a no-op when the Dict already exists.

**Usage**:

```shell
modal dict create [OPTIONS] NAME
```

**Arguments**:

* `NAME`: \[required]

**Options**:

* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--help`: Show this message and exit.

## `modal dict list`

List all named Dicts.

**Usage**:

```shell
modal dict list [OPTIONS]
```

**Options**:

* `--json / --no-json`: \[default: no-json]
* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--help`: Show this message and exit.

## `modal dict clear`

Clear the contents of a named Dict by deleting all of its data.

**Usage**:

```shell
modal dict clear [OPTIONS] NAME
```

**Arguments**:

* `NAME`: \[required]

**Options**:

* `-y, --yes`: Run without pausing for confirmation.
* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--help`: Show this message and exit.

## `modal dict delete`

Delete a named Dict and all of its data.

**Usage**:

```shell
modal dict delete [OPTIONS] NAME
```

**Arguments**:

* `NAME`: \[required]

**Options**:

* `-y, --yes`: Run without pausing for confirmation.
* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--help`: Show this message and exit.

## `modal dict get`

Print the value for a specific key.

Note: When using the CLI, keys are always interpreted as having a string type.

**Usage**:

```shell
modal dict get [OPTIONS] NAME KEY
```

**Arguments**:

* `NAME`: \[required]
* `KEY`: \[required]

**Options**:

* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--help`: Show this message and exit.

## `modal dict items`

Print the contents of a Dict.

Note: By default, this command truncates the contents. Use the `N` argument to control the
amount of data shown or the `--all` option to retrieve the entire Dict, which may be slow.

**Usage**:

```shell
modal dict items [OPTIONS] NAME [N]
```

**Arguments**:

* `NAME`: \[required]
* `[N]`: Limit the number of entries shown  \[default: 20]

**Options**:

* `-a, --all`: Ignore N and print all entries in the Dict (may be slow)
* `-r, --repr`: Display items using `repr()` to see more details
* `--json / --no-json`: \[default: no-json]
* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--help`: Show this message and exit.

### modal environment

# `modal environment`

Create and interact with Environments

Environments are sub-divisons of workspaces, allowing you to deploy the same app
in different namespaces. Each environment has their own set of Secrets and any
lookups performed from an app in an environment will by default look for entities
in the same environment.

Typical use cases for environments include having one for development and one for
production, to prevent overwriting production apps when developing new features
while still being able to deploy changes to a live environment.

**Usage**:

```shell
modal environment [OPTIONS] COMMAND [ARGS]...
```

**Options**:

* `--help`: Show this message and exit.

**Commands**:

* `list`: List all environments in the current workspace
* `create`: Create a new environment in the current workspace
* `delete`: Delete an environment in the current workspace
* `update`: Update the name or web suffix of an environment

## `modal environment list`

List all environments in the current workspace

**Usage**:

```shell
modal environment list [OPTIONS]
```

**Options**:

* `--json / --no-json`: \[default: no-json]
* `--help`: Show this message and exit.

## `modal environment create`

Create a new environment in the current workspace

**Usage**:

```shell
modal environment create [OPTIONS] NAME
```

**Arguments**:

* `NAME`: Name of the new environment. Must be unique. Case sensitive  \[required]

**Options**:

* `--help`: Show this message and exit.

## `modal environment delete`

Delete an environment in the current workspace

Deletes all apps in the selected environment and deletes the environment irrevocably.

**Usage**:

```shell
modal environment delete [OPTIONS] NAME
```

**Arguments**:

* `NAME`: Name of the environment to be deleted. Case sensitive  \[required]

**Options**:

* `--confirm / --no-confirm`: Set this flag to delete without prompting for confirmation  \[default: no-confirm]
* `--help`: Show this message and exit.

## `modal environment update`

Update the name or web suffix of an environment

**Usage**:

```shell
modal environment update [OPTIONS] CURRENT_NAME
```

**Arguments**:

* `CURRENT_NAME`: \[required]

**Options**:

* `--set-name TEXT`: New name of the environment
* `--set-web-suffix TEXT`: New web suffix of environment (empty string is no suffix)
* `--help`: Show this message and exit.

### modal launch

# `modal launch`

Open a serverless app instance on Modal.

This command is in preview and may change in the future.

**Usage**:

```shell
modal launch [OPTIONS] COMMAND [ARGS]...
```

**Options**:

* `--help`: Show this message and exit.

**Commands**:

* `jupyter`: Start Jupyter Lab on Modal.
* `vscode`: Start Visual Studio Code on Modal.

## `modal launch jupyter`

Start Jupyter Lab on Modal.

**Usage**:

```shell
modal launch jupyter [OPTIONS]
```

**Options**:

* `--cpu INTEGER`: \[default: 8]
* `--memory INTEGER`: \[default: 32768]
* `--gpu TEXT`
* `--timeout INTEGER`: \[default: 3600]
* `--image TEXT`: \[default: ubuntu:22.04]
* `--add-python TEXT`: \[default: 3.11]
* `--mount TEXT`
* `--volume TEXT`
* `--detach / --no-detach`: \[default: no-detach]
* `--help`: Show this message and exit.

## `modal launch vscode`

Start Visual Studio Code on Modal.

**Usage**:

```shell
modal launch vscode [OPTIONS]
```

**Options**:

* `--cpu INTEGER`: \[default: 8]
* `--memory INTEGER`: \[default: 32768]
* `--gpu TEXT`
* `--image TEXT`: \[default: debian:12]
* `--timeout INTEGER`: \[default: 3600]
* `--mount TEXT`
* `--volume TEXT`
* `--detach / --no-detach`: \[default: no-detach]
* `--help`: Show this message and exit.

### modal nfs

# `modal nfs`

Read and edit `modal.NetworkFileSystem` file systems.

**Usage**:

```shell
modal nfs [OPTIONS] COMMAND [ARGS]...
```

**Options**:

* `--help`: Show this message and exit.

**Commands**:

* `list`: List the names of all network file systems.
* `create`: Create a named network file system.
* `ls`: List files and directories in a network file system.
* `put`: Upload a file or directory to a network file system.
* `get`: Download a file from a network file system.
* `rm`: Delete a file or directory from a network file system.
* `delete`: Delete a named, persistent modal.NetworkFileSystem.

## `modal nfs list`

List the names of all network file systems.

**Usage**:

```shell
modal nfs list [OPTIONS]
```

**Options**:

* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--json / --no-json`: \[default: no-json]
* `--help`: Show this message and exit.

## `modal nfs create`

Create a named network file system.

**Usage**:

```shell
modal nfs create [OPTIONS] NAME
```

**Arguments**:

* `NAME`: \[required]

**Options**:

* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--help`: Show this message and exit.

## `modal nfs ls`

List files and directories in a network file system.

**Usage**:

```shell
modal nfs ls [OPTIONS] VOLUME_NAME [PATH]
```

**Arguments**:

* `VOLUME_NAME`: \[required]
* `[PATH]`: \[default: /]

**Options**:

* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--help`: Show this message and exit.

## `modal nfs put`

Upload a file or directory to a network file system.

Remote parent directories will be created as needed.

Ending the REMOTE_PATH with a forward slash (/), it's assumed to be a directory and the file
will be uploaded with its current name under that directory.

**Usage**:

```shell
modal nfs put [OPTIONS] VOLUME_NAME LOCAL_PATH [REMOTE_PATH]
```

**Arguments**:

* `VOLUME_NAME`: \[required]
* `LOCAL_PATH`: \[required]
* `[REMOTE_PATH]`: \[default: /]

**Options**:

* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--help`: Show this message and exit.

## `modal nfs get`

Download a file from a network file system.

Specifying a glob pattern (using any `*` or `**` patterns) as the `remote_path` will download
all matching files, preserving their directory structure.

For example, to download an entire network file system into `dump_volume`:

```
modal nfs get <volume-name> "**" dump_volume
```

Use "-" as LOCAL_DESTINATION to write file contents to standard output.

**Usage**:

```shell
modal nfs get [OPTIONS] VOLUME_NAME REMOTE_PATH [LOCAL_DESTINATION]
```

**Arguments**:

* `VOLUME_NAME`: \[required]
* `REMOTE_PATH`: \[required]
* `[LOCAL_DESTINATION]`: \[default: .]

**Options**:

* `--force / --no-force`: \[default: no-force]
* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--help`: Show this message and exit.

## `modal nfs rm`

Delete a file or directory from a network file system.

**Usage**:

```shell
modal nfs rm [OPTIONS] VOLUME_NAME REMOTE_PATH
```

**Arguments**:

* `VOLUME_NAME`: \[required]
* `REMOTE_PATH`: \[required]

**Options**:

* `-r, --recursive`: Delete directory recursively
* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--help`: Show this message and exit.

## `modal nfs delete`

Delete a named, persistent modal.NetworkFileSystem.

**Usage**:

```shell
modal nfs delete [OPTIONS] NFS_NAME
```

**Arguments**:

* `NFS_NAME`: Name of the modal.NetworkFileSystem to be deleted. Case sensitive  \[required]

**Options**:

* `-y, --yes`: Run without pausing for confirmation.
* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--help`: Show this message and exit.

### modal profile

# `modal profile`

Switch between Modal profiles.

**Usage**:

```shell
modal profile [OPTIONS] COMMAND [ARGS]...
```

**Options**:

* `--help`: Show this message and exit.

**Commands**:

* `activate`: Change the active Modal profile.
* `current`: Print the currently active Modal profile.
* `list`: Show all Modal profiles and highlight the active one.

## `modal profile activate`

Change the active Modal profile.

**Usage**:

```shell
modal profile activate [OPTIONS] PROFILE
```

**Arguments**:

* `PROFILE`: Modal profile to activate.  \[required]

**Options**:

* `--help`: Show this message and exit.

## `modal profile current`

Print the currently active Modal profile.

**Usage**:

```shell
modal profile current [OPTIONS]
```

**Options**:

* `--help`: Show this message and exit.

## `modal profile list`

Show all Modal profiles and highlight the active one.

**Usage**:

```shell
modal profile list [OPTIONS]
```

**Options**:

* `--json / --no-json`: \[default: no-json]
* `--help`: Show this message and exit.

### modal queue

# `modal queue`

Manage `modal.Queue` objects and inspect their contents.

**Usage**:

```shell
modal queue [OPTIONS] COMMAND [ARGS]...
```

**Options**:

* `--help`: Show this message and exit.

**Commands**:

* `create`: Create a named Queue.
* `delete`: Delete a named Queue and all of its data.
* `list`: List all named Queues.
* `clear`: Clear the contents of a queue by removing all of its data.
* `peek`: Print the next N items in the queue or queue partition (without removal).
* `len`: Print the length of a queue partition or the total length of all partitions.

## `modal queue create`

Create a named Queue.

Note: This is a no-op when the Queue already exists.

**Usage**:

```shell
modal queue create [OPTIONS] NAME
```

**Arguments**:

* `NAME`: \[required]

**Options**:

* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--help`: Show this message and exit.

## `modal queue delete`

Delete a named Queue and all of its data.

**Usage**:

```shell
modal queue delete [OPTIONS] NAME
```

**Arguments**:

* `NAME`: \[required]

**Options**:

* `-y, --yes`: Run without pausing for confirmation.
* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--help`: Show this message and exit.

## `modal queue list`

List all named Queues.

**Usage**:

```shell
modal queue list [OPTIONS]
```

**Options**:

* `--json / --no-json`: \[default: no-json]
* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--help`: Show this message and exit.

## `modal queue clear`

Clear the contents of a queue by removing all of its data.

**Usage**:

```shell
modal queue clear [OPTIONS] NAME
```

**Arguments**:

* `NAME`: \[required]

**Options**:

* `-p, --partition TEXT`: Name of the partition to use, otherwise use the default (anonymous) partition.
* `-a, --all`: Clear the contents of all partitions.
* `-y, --yes`: Run without pausing for confirmation.
* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--help`: Show this message and exit.

## `modal queue peek`

Print the next N items in the queue or queue partition (without removal).

**Usage**:

```shell
modal queue peek [OPTIONS] NAME [N]
```

**Arguments**:

* `NAME`: \[required]
* `[N]`: \[default: 1]

**Options**:

* `-p, --partition TEXT`: Name of the partition to use, otherwise use the default (anonymous) partition.
* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--help`: Show this message and exit.

## `modal queue len`

Print the length of a queue partition or the total length of all partitions.

**Usage**:

```shell
modal queue len [OPTIONS] NAME
```

**Arguments**:

* `NAME`: \[required]

**Options**:

* `-p, --partition TEXT`: Name of the partition to use, otherwise use the default (anonymous) partition.
* `-t, --total`: Compute the sum of the queue lengths across all partitions
* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--help`: Show this message and exit.

### modal run

# `modal run`

Run a Modal function or local entrypoint.

`FUNC_REF` should be of the format `{file or module}::{function name}`.
Alternatively, you can refer to the function via the app:

`{file or module}::{app variable name}.{function name}`

**Examples:**

To run the hello_world function (or local entrypoint) in my_app.py:

```
modal run my_app.py::hello_world
```

If your module only has a single app and your app has a
single local entrypoint (or single function), you can omit the app and
function parts:

```
modal run my_app.py
```

Instead of pointing to a file, you can also use the Python module path, which
by default will ensure that your remote functions will use the same module
names as they do locally.

```
modal run -m my_project.my_app
```

**Usage**:

```shell
modal run [OPTIONS] FUNC_REF
```

**Options**:

* `-w, --write-result TEXT`: Write return value (which must be str or bytes) to this local path.
* `-q, --quiet`: Don't show Modal progress indicators.
* `-d, --detach`: Don't stop the app if the local process dies or disconnects.
* `-i, --interactive`: Run the app in interactive mode.
* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `-m`: Interpret argument as a Python module path instead of a file/script path
* `--help`: Show this message and exit.

### modal secret

# `modal secret`

Manage secrets.

**Usage**:

```shell
modal secret [OPTIONS] COMMAND [ARGS]...
```

**Options**:

* `--help`: Show this message and exit.

**Commands**:

* `list`: List your published secrets.
* `create`: Create a new secret.
* `delete`: Delete a named secret.

## `modal secret list`

List your published secrets.

**Usage**:

```shell
modal secret list [OPTIONS]
```

**Options**:

* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--json / --no-json`: \[default: no-json]
* `--help`: Show this message and exit.

## `modal secret create`

Create a new secret.

**Usage**:

```shell
modal secret create [OPTIONS] SECRET_NAME [KEYVALUES]...
```

**Arguments**:

* `SECRET_NAME`: \[required]
* `[KEYVALUES]...`: Space-separated KEY=VALUE items.

**Options**:

* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--from-dotenv PATH`: Path to a .env file to load secrets from.
* `--from-json PATH`: Path to a JSON file to load secrets from.
* `--force`: Overwrite the secret if it already exists.
* `--help`: Show this message and exit.

## `modal secret delete`

Delete a named secret.

**Usage**:

```shell
modal secret delete [OPTIONS] SECRET_NAME
```

**Arguments**:

* `SECRET_NAME`: Name of the modal.Secret to be deleted. Case sensitive  \[required]

**Options**:

* `-y, --yes`: Run without pausing for confirmation.
* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--help`: Show this message and exit.

### modal serve

# `modal serve`

Run a web endpoint(s) associated with a Modal app and hot-reload code.

**Examples:**

```
modal serve hello_world.py
```

**Usage**:

```shell
modal serve [OPTIONS] APP_REF
```

**Arguments**:

* `APP_REF`: Path to a Python file with an app.  \[required]

**Options**:

* `--timeout FLOAT`
* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `-m`: Interpret argument as a Python module path instead of a file/script path
* `--help`: Show this message and exit.

### modal setup

# `modal setup`

Bootstrap Modal's configuration.

**Usage**:

```shell
modal setup [OPTIONS]
```

**Options**:

* `--profile TEXT`
* `--help`: Show this message and exit.

### modal shell

# `modal shell`

Run a command or interactive shell inside a Modal container.

**Examples:**

Start an interactive shell inside the default Debian-based image:

```
modal shell
```

Start an interactive shell with the spec for `my_function` in your App
(uses the same image, volumes, mounts, etc.):

```
modal shell hello_world.py::my_function
```

Or, if you're using a [modal.Cls](https://modal.com/docs/reference/modal.Cls)
you can refer to a `@modal.method` directly:

```
modal shell hello_world.py::MyClass.my_method
```

Start a `python` shell:

```
modal shell hello_world.py --cmd=python
```

Run a command with your function's spec and pipe the output to a file:

```
modal shell hello_world.py -c 'uv pip list' > env.txt
```

**Usage**:

```shell
modal shell [OPTIONS] REF
```

**Arguments**:

* `REF`: ID of running container, or path to a Python file containing a Modal App. Can also include a function specifier, like `module.py::func`, if the file defines multiple functions.

**Options**:

* `-c, --cmd TEXT`: Command to run inside the Modal image.  \[default: /bin/bash]
* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--image TEXT`: Container image tag for inside the shell (if not using REF).
* `--add-python TEXT`: Add Python to the image (if not using REF).
* `--volume TEXT`: Name of a `modal.Volume` to mount inside the shell at `/mnt/{name}` (if not using REF). Can be used multiple times.
* `--secret TEXT`: Name of a `modal.Secret` to mount inside the shell (if not using REF). Can be used multiple times.
* `--cpu INTEGER`: Number of CPUs to allocate to the shell (if not using REF).
* `--memory INTEGER`: Memory to allocate for the shell, in MiB (if not using REF).
* `--gpu TEXT`: GPUs to request for the shell, if any. Examples are `any`, `a10g`, `a100:4` (if not using REF).
* `--cloud TEXT`: Cloud provider to run the shell on. Possible values are `aws`, `gcp`, `oci`, `auto` (if not using REF).
* `--region TEXT`: Region(s) to run the container on. Can be a single region or a comma-separated list to choose from (if not using REF).
* `--pty / --no-pty`: Run the command using a PTY.
* `-m`: Interpret argument as a Python module path instead of a file/script path
* `--help`: Show this message and exit.

### modal token

# `modal token`

Manage tokens.

**Usage**:

```shell
modal token [OPTIONS] COMMAND [ARGS]...
```

**Options**:

* `--help`: Show this message and exit.

**Commands**:

* `set`: Set account credentials for connecting to Modal.
* `new`: Create a new token by using an authenticated web session.

## `modal token set`

Set account credentials for connecting to Modal.

If the credentials are not provided on the command line, you will be prompted to enter them.

**Usage**:

```shell
modal token set [OPTIONS]
```

**Options**:

* `--token-id TEXT`: Account token ID.
* `--token-secret TEXT`: Account token secret.
* `--profile TEXT`: Modal profile to set credentials for. If unspecified (and MODAL_PROFILE environment variable is not set), uses the workspace name associated with the credentials.
* `--activate / --no-activate`: Activate the profile containing this token after creation.  \[default: activate]
* `--verify / --no-verify`: Make a test request to verify the new credentials.  \[default: verify]
* `--help`: Show this message and exit.

## `modal token new`

Create a new token by using an authenticated web session.

**Usage**:

```shell
modal token new [OPTIONS]
```

**Options**:

* `--profile TEXT`: Modal profile to set credentials for. If unspecified (and MODAL_PROFILE environment variable is not set), uses the workspace name associated with the credentials.
* `--activate / --no-activate`: Activate the profile containing this token after creation.  \[default: activate]
* `--verify / --no-verify`: Make a test request to verify the new credentials.  \[default: verify]
* `--source TEXT`
* `--help`: Show this message and exit.

### modal volume

# `modal volume`

Read and edit `modal.Volume` volumes.

Note: users of `modal.NetworkFileSystem` should use the `modal nfs` command instead.

**Usage**:

```shell
modal volume [OPTIONS] COMMAND [ARGS]...
```

**Options**:

* `--help`: Show this message and exit.

**Commands**:

* `create`: Create a named, persistent modal.Volume.
* `get`: Download files from a modal.Volume object.
* `list`: List the details of all modal.Volume volumes in an Environment.
* `ls`: List files and directories in a modal.Volume volume.
* `put`: Upload a file or directory to a modal.Volume.
* `rm`: Delete a file or directory from a modal.Volume.
* `cp`: Copy within a modal.Volume.
* `delete`: Delete a named, persistent modal.Volume.
* `rename`: Rename a modal.Volume.

## `modal volume create`

Create a named, persistent modal.Volume.

**Usage**:

```shell
modal volume create [OPTIONS] NAME
```

**Arguments**:

* `NAME`: \[required]

**Options**:

* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--version INTEGER`: VolumeFS version. (Experimental)
* `--help`: Show this message and exit.

## `modal volume get`

Download files from a modal.Volume object.

If a folder is passed for REMOTE_PATH, the contents of the folder will be downloaded
recursively, including all subdirectories.

**Example**

```
modal volume get <volume_name> logs/april-12-1.txt
modal volume get <volume_name> / volume_data_dump
```

Use "-" as LOCAL_DESTINATION to write file contents to standard output.

**Usage**:

```shell
modal volume get [OPTIONS] VOLUME_NAME REMOTE_PATH [LOCAL_DESTINATION]
```

**Arguments**:

* `VOLUME_NAME`: \[required]
* `REMOTE_PATH`: \[required]
* `[LOCAL_DESTINATION]`: \[default: .]

**Options**:

* `--force / --no-force`: \[default: no-force]
* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--help`: Show this message and exit.

## `modal volume list`

List the details of all modal.Volume volumes in an Environment.

**Usage**:

```shell
modal volume list [OPTIONS]
```

**Options**:

* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--json / --no-json`: \[default: no-json]
* `--help`: Show this message and exit.

## `modal volume ls`

List files and directories in a modal.Volume volume.

**Usage**:

```shell
modal volume ls [OPTIONS] VOLUME_NAME [PATH]
```

**Arguments**:

* `VOLUME_NAME`: \[required]
* `[PATH]`: \[default: /]

**Options**:

* `--json / --no-json`: \[default: no-json]
* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--help`: Show this message and exit.

## `modal volume put`

Upload a file or directory to a modal.Volume.

Remote parent directories will be created as needed.

Ending the REMOTE_PATH with a forward slash (/), it's assumed to be a directory
and the file will be uploaded with its current name under that directory.

**Usage**:

```shell
modal volume put [OPTIONS] VOLUME_NAME LOCAL_PATH [REMOTE_PATH]
```

**Arguments**:

* `VOLUME_NAME`: \[required]
* `LOCAL_PATH`: \[required]
* `[REMOTE_PATH]`: \[default: /]

**Options**:

* `-f, --force`: Overwrite existing files.
* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--help`: Show this message and exit.

## `modal volume rm`

Delete a file or directory from a modal.Volume.

**Usage**:

```shell
modal volume rm [OPTIONS] VOLUME_NAME REMOTE_PATH
```

**Arguments**:

* `VOLUME_NAME`: \[required]
* `REMOTE_PATH`: \[required]

**Options**:

* `-r, --recursive`: Delete directory recursively
* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--help`: Show this message and exit.

## `modal volume cp`

Copy within a modal.Volume. Copy source file to destination file or multiple source files to destination directory.

**Usage**:

```shell
modal volume cp [OPTIONS] VOLUME_NAME PATHS...
```

**Arguments**:

* `VOLUME_NAME`: \[required]
* `PATHS...`: \[required]

**Options**:

* `-r, --recursive`: Copy directories recursively
* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--help`: Show this message and exit.

## `modal volume delete`

Delete a named, persistent modal.Volume.

**Usage**:

```shell
modal volume delete [OPTIONS] VOLUME_NAME
```

**Arguments**:

* `VOLUME_NAME`: Name of the modal.Volume to be deleted. Case sensitive  \[required]

**Options**:

* `-y, --yes`: Run without pausing for confirmation.
* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--help`: Show this message and exit.

## `modal volume rename`

Rename a modal.Volume.

**Usage**:

```shell
modal volume rename [OPTIONS] OLD_NAME NEW_NAME
```

**Arguments**:

* `OLD_NAME`: \[required]
* `NEW_NAME`: \[required]

**Options**:

* `-y, --yes`: Run without pausing for confirmation.
* `-e, --env TEXT`: Environment to interact with.

If not specified, Modal will use the default environment of your current profile, or the `MODAL_ENVIRONMENT` variable.
Otherwise, raises an error if the workspace has multiple environments.
* `--help`: Show this message and exit.

# Changelog

This changelog documents user-facing updates (features, enhancements, fixes, and deprecations) to the `modal` client library.

## Latest

### 1.1.1 (2025-08-01)

- We're introducing the concept of "named Sandboxes" for usecases where Sandboxes need to have unique ownership over a resource. A named Sandbox can be created by passing `name=` to `modal.Sandbox.create()`, and it can be retrieved with the new `modal.Sandbox.from_name()` constructor. Only one running Sandbox can use a given name (scoped within the App that is managing the Sandbox) at any time, so trying to create a Sandbox with a name that is already taken will fail. Sandboxes release their name when they terminate. See the [guide](https://modal.com/docs/guide/sandbox#named-sandboxes) for more information about using this new feature.
- We've made an internal change to the `modal.Image.uv_pip_install` method to make it more portable across different base Images. As a consequence, Images built with this method on 1.1.0 will need to rebuild the next time they are used.
- We've added a `.name` property and `.info()` method to `modal.Dict`, `modal.Queue`, `modal.Volume`, and `modal.Secret` objects.
- Sandboxes now support `experimental_options` configuration for enabling preview functionality.
- We've Improved Modal's rich output when used in a Jupyter notebook.

### 1.1.0 (2025-07-17)

This release introduces support for the `2025.06` [Image Builder Version](https://modal.com/docs/guide/images#image-builder-updates), which is in a "preview" state. The new image builder includes several major changes to how the Modal client dependencies are included in Modal Images. These improvements should greatly reduce the risk of conflicts with user code dependencies. They also allow Modal Sandboxes to easily be used with existing Images or Dockerfiles that are not themselves compatible with the Modal client library. You can see more details and update your Workspace on its [Image Config](https://modal.com/settings/image-config) page. Please share any issues that you encounter as we work to make the version stable.

We're also introducing first-class support for building Modal Images with the [uv package manager](https://docs.astral.sh/uv/) through the new [`modal.Image.uv_pip_install`](https://modal.com/docs/reference/modal.Image#uv_pip_install) and [`modal.Image.uv_sync`](https://modal.com/docs/reference/modal.Image#uv_sync) methods:

```python
import modal

# uv_pip_install accepts a list of packages, like pip_install, but up to 50% faster
image = modal.Image.debian_slim().uv_pip_install("torch==2.7.1", "numpy==2.3.1")

# uv_sync accepts a local `uv_project_dir` (defaulting to the local working directory)
# and uses the pyproject.toml and uv.lock files to specify the environment
image = modal.Image.debian_slim().uv_sync()
```

Please note that, as these methods are new, there is some chance that future releases will need to fix bugs or address edge cases in ways that break the cache for existing Images. When using `modal.Image.uv_pip_install`, we recommend pinning dependency versions so that any necessary rebuilds produce a consistent environment.

This release also includes a number of other new features and bug fixes:

- Optimized handling of the `ignore` parameter in `Image.add_local_dir` and similar methods for cases where entire directories are ignored.
- Added a `poetry_version` parameter to `modal.Image.poetry_install_from_file`, which supports installing a specific version of `poetry`. It's also possible to set `poetry_version=None` to skip the install step, i.e. when poetry is already available in the Image.
- Added a [`modal.Sandbox.reload_volumes`](https://modal.com/docs/reference/modal.Sandbox#reload_volumes) method, which triggers a reload of all Volumes currently mounted inside a running Sandbox.
- Added a `build_args` parameter to `modal.Image.from_dockerfile` for passing arguments through to `ARG` instructions in the Dockerfile.
- It's now possible to use `@modal.experimental.clustered` and `i6pn` networking with `modal.Cls`.
- Fixed a bug where `Cls.with_options` would fail when provided with a `modal.Secret` object that was already hydrated.
- Fixed a bug where the timeout specified in `modal.Sandbox.exec()` was not respected by `modal.Sandbox.wait()` or `modal.Sandbox.poll()`.
- Fixed retry handling when using `modal run --detach` directly against a remote Function.

Finally, this release introduces a small number of deprecations and potentially-breaking changes:

- We now raise `modal.exception.NotFoundError` in all cases where Modal object lookups fail; previously some methods could leak an internal `GRPCError` with a `NOT_FOUND` status.
- We're enforcing pre-1.0 deprecations on `modal.build`, `modal.Image.copy_local_file`, and `modal.Image.copy_local_dir`.
- We're deprecating the `environment_name` parameter in `modal.Sandbox.create()`. A Sandbox's environment association will now be determined by its parent App. This should have no user-facing effects.
- We've deprecated the `namespace` parameter in the `.from_name` methods of `Function`, `Cls`, `Dict`, `Queue`, `Volume`, `NetworkFileSystem`, and `Secret`, along with `modal.runner.deploy_app`. These object types do not have a concept of distinct namespaces.

### 1.0.5 (2025-06-27)

- Added a [`modal.Volume.read_only`](/docs/reference/modal.Volume#read_only) method, which will configure a Volume instance to disallow writes:

  ```python notest
  vol = modal.Volume.from_name("models")
  read_only_vol = vol.read_only()

  @app.function(volumes={"/models": read_only_vol})
  def f():
      with open("/models/weights.pt", "w") as fid:  # Raises an OSError
          ...

  @app.local_entrypoint()
  def main():
      with read_only_vol.batch_upload() as batch:  # Raises a modal.exceptions.InvalidError
          ...

      with vol.batch_upload() as batch:  # This instance is still writeable
          ...
  ```

- Introduced a gradual fix for a bug where `Function.map` and `Function.starmap` leak an internal exception wrapping type (`modal.exceptions.UserCodeException`) when `return_exceptions=True` is set. To avoid breaking any user code that depends on the specific types in the return list, these functions will continue returning the wrapper type by default, but they now issue a deprecation warning. To opt into the future behavior and silence the warning, you can set `wrap_returned_exceptions=False` in the call to `.map` or `.starmap`.
- When an `@app.cls()`-decorated class inherits from a class (or classes) with `modal.parameter()` annotations, the parent parameters will now be inherited and included in the parameter set for the modal Cls.
- Redeployments that migrate parameterized functions from an explicit constructor to `modal.parameter()` annotations will now handle requests from outdated clients more gracefully, avoiding a problem where new containers would crashloop on a deserialization error.
- The Modal client will now retry its initial connection to the Modal server, improving stability on flaky networks.

### 1.0.4 (2025-06-13)

- When `modal.Cls.with_options` is called multiple times on the same instance, the overrides will now be merged. For example, the following configuration will use an H100 GPU and request 16 CPU cores:
  ```python
  Model.with_options(gpu="A100", cpu=16).with_options(gpu="H100")
  ```
- Added a `--secret` option to `modal shell` for including environment variables defined by named Secret(s) in the shell session:
  ```
  modal shell --secret huggingface --secret wandb
  ```
- Added a `verbose: bool` option to `modal.Sandbox.create()`. When this is set to `True`, execs and file system operations will appear in the Sandbox logs.
- Updated `modal.Sandbox.watch()` so that exceptions are now raised in (and can be caught by) the calling task.

### 1.0.3 (2025-06-05)

- Added support for specifying a timezone on `Cron` schedules, which allows you to run a Function at a specific local time regardless of daylight savings:

  ```python
  import modal
  app = modal.App()

  @app.function(schedule=modal.Cron("* 6 * * *"), timezone="America/New_York")  # Use tz database naming conventions
  def f():
      print("This function will run every day at 6am New York time.")
  ```

- Added an `h2_ports` parameter to `Sandbox.create`, which exposes encrypted ports using HTTP/2. The following example will create an H2 port on 5002 and a port using HTTPS over HTTP/1.1 on 5003:
  ```python
  sb = modal.Sandbox.create(app=app, h2_ports = [5002], encrypted_ports = [5003])
  ```
- Added `--from-dotenv` and `--from-json` options to `modal secret create`, which will read from local files to populate Secret contents.
- `Sandbox.terminate` no longer waits for container shutdown to complete before returning. It still ensures that a terminated container will shutdown imminently. To restore the previous behavior (i.e., to wait until the Sandbox is actually terminated), call `sb.wait(raise_on_termination=False)` after calling `sb.terminate()`.
- Improved performance and stability for `modal volume get`.
- Fixed a rare race condition that could sometimes make `Function.map` and similar calls deadlock.
- Fixed an issue where `Function.map` and similar methods would stall for 55 seconds when passed an empty iterator as input instead of completing immediately.
- We now raise an error during App setup when using interactive mode without the `modal.enable_output` context manager. Previously, this would run the App but raise when `modal.interact()` was called.

### 1.0.2 (2025-05-26)

- Fixed an incompatibility with breaking changes in `aiohttp` v3.12.0, which caused issues with Volume and large input uploads. The issues typically manifest as `Local data and remote data checksum mismatch` or `'_io.BufferedReader' object has no attribute 'getbuffer'` errors.

### 1.0.1 (2025-05-19)

- Added a `--timestamps` flag to `modal app logs` that prepends a timestamp to each log line.
- Fixed a bug where objects returned by `Sandbox.list` had `returncode == 0` for _running_ Sandboxes. Now the return code for running Sandboxes will be `None`.
- Fixed a bug affecting systems where the `sys.platform.node` name includes unicode characters.

### 1.0.0 (2025-05-16)

With this release, we're beginning to enforce the deprecations discussed in the [1.0 migration guide](https://modal.com/docs/guide/modal-1-0-migration). Going forward, we'll include breaking changes for outstanding deprecations in `1.Y.0` releases, so we recommend pinning Modal on a minor version (`modal~=1.0.0`) if you have not addressed the existing warnings. While we'll continue to make improvements to the Modal API, new deprecations will be introduced at a substantially reduced rate, and support windows for older client versions will lengthen.

⚠️ In this release, we've made some breaking changes to Modal's "automounting" behavior.️ If you've not already adapted your source code in response to warnings about automounting, Apps built with 1.0+ will have different files included and may not run as expected:

- Previously, Modal containers would automatically include the source for local Python packages that were imported by your Modal App. Going forward, it will be necessary to explicitly include such packages in the Image (i.e., with `modal.Image.add_local_python_source`).
- Support for the `automount` configuration (`MODAL_AUTOMOUNT`) has been removed; this environment variable will no longer have any effect.
- Modal will continue to automatically include the Python module or package where the Function is defined. If the Function is defined within a package, the entire directory tree containing the package will be mounted. This limited automounting can also be disabled in cases where your Image definition already includes the package defining the Function: set `include_source=False` in the `modal.App` constructor or `@app.function` decorator.

Additionally, we have enforced a number of previously-introduced deprecations:

- Removed `modal.Mount` as a public object, along with various `mount=` parameters where Mounts could be passed into the Modal API. Usage can be replaced with `modal.Image` methods, e.g.:
  ```python
  @app.function(image=image, mounts=[modal.Mount.from_local_dir("data", "/root/data")])  # This is now an error!
  @app.function(image=image.add_local_dir("data", "/root/data"))  # Correct spelling
  ```
- Removed the `show_progress` parameter from `modal.App.run`. This parameter has been replaced by the `modal.enable_output` context manager:
  ```python
  with modal.enable_output(), app.run():
    ...  # Will produce verbose Modal output
  ```
- Passing flagged options to the `Image.pip_install` package list will now raise an error. Use the `extra_options` parameter to specify options that aren't exposed through the `Image.pip_install` signature:
  ```python
  image.pip_install("flash-attn", "--no-build-isolation")  # This is now an error!
  image.pip_install("flash-attn", extra_options="--no-build-isolation")  # Correct spelling
  ```
- Removed backwards compatibility for using `label=` or `tag=` keywords in object lookup methods. We standardized these methods to use `name=` as the parameter name, but we recommend using positional arguments:
  ```python
  f = modal.Function.from_name("my-app", tag="f")  # No longer supported! Will raise an error!
  f = modal.Function.from_name("my-app", "f")  # Preferred spelling
  ```
- It's no longer possible to invoke a generator Function with `Function.spawn`; previously this warned, now it raises an `InvalidError`. Additionally, the `FunctionCall.get_gen` method has been removed, and it's no longer possible to set `is_generator` when using `FunctionCall.from_id`.
- Removed the `.resolve()` method on Modal objects. This method had not been publicly documented, but where used it can be replaced straightforwardly with `.hydrate()`. Note that explicit hydration should rarely be necessary: in most cases you can rely on lazy hydration semantics (i.e., objects will be hydrated when the first method that requires server metadata is called).
- Functions decorated with `@modal.asgi_app` or `@modal.wsgi_app` are now required to be nullary. Previously, we warned in the case where a function was defined with parameters that all had default arguments.
- Referencing the deprecated `modal.Stub` object will now raise an `AttributeError`, whereas previously it was an alias for `modal.App`. This is a simple name change.

## 0.77

### 0.77.0 (2025-05-13)

- This is the final pre-1.0 release of the Modal client. The next release will be version 1.0. While we do not plan to enforce most major deprecations until later in the 1.0 cycle, there will be some breaking changes introduced in the next release.

## 0.76

### 0.76.3 (2025-05-12)

- Fixed the behavior of `modal app history --json` when the history contains versions with and without commit information or "tag" metadata. Commit information is now always included (with a `null` placeholder when absent), while tag metadata is included only when there is at least one tagged release (other releases will have a `null` placeholder).

### 0.76.0 (2025-05-12)

- Fixed the behavior of `ignore=` in `modal.Image` methods, including when `.dockerignore` files are implicitly used in docker-oriented methods. This may result in Image rebuilds with different final inventories:
  - When using `modal.Image.add_local_dir`, exclusion patterns are now correctly interpreted as relative to the directory being added (e.g., `*.json` will now ignore all json files in the top-level of the directory).
  - When using `modal.Image.from_dockerfile`, exclusion patterns are correctly interpreted as relative to the context directory.
  - As in Docker, leading or trailing path delimiters are stripped from the ignore patterns before being applied.
  - **Breaking change**: When providing a custom function to `ignore=`, file paths passed into the function will now be _relative_, rather than absolute.

## 0.75

### 0.75.8 (2025-05-12)

- Introduced `modal.Cls.with_concurrency` and `modal.Cls.with_batching` for runtime configuration of functionality that is exposed through the `@modal.concurrent` and `@modal.batched` decorators.
  ```python
  model = Model.with_options(gpu="H100").with_concurrency(max_inputs=100)()
  ```
- Added a deprecation warning when using `allow_concurrent_inputs` in `modal.Cls.with_options`.
- Added `buffer_containers` to `modal.Cls.with_options`.
- _Behavior change:_ when `modal.Cls.with_options` is called multiple times on the same object, the configurations will be merged rather than using the most recent.

### 0.75.4 (2025-05-09)

- Fixed issue with .spawn_map producing wrong number of arguments

### 0.75.3 (2025-05-08)

- New `modal.Dict`s (forthcoming on 2025-05-20) use a new durable storage backend with more "cache-like" behavior - items expire after 7 days of inactivity (no reads or writes). Previously created `modal.Dict`s will continue to use the old backend, but support will eventually be dropped.
- `modal.Dict.put` now supports an `skip_if_exists` flag that can be used to avoid overwriting the value for existing keys:

  ```
  item_created = my_dict.put("foo", "bar", skip_if_exists=True)
  assert item_created
  new_item_created = my_dict.put("foo", "baz", skip_if_exists=True)
  assert not new_item_created
  ```

  Note that this flag only works for `modal.Dict` objects with the new backend (forthcoming on 2025-05-20) and will raise an error otherwise.

### 0.75.2 (2025-05-08)

- Reverts defective changes to the interpretation of `ignore=` patterns and `.dockerignore` files that were introduced in v0.75.0.

### 0.75.0 (2025-05-08)

- Introduced some changes to the handling of `ignore=` patterns in `modal.Image` methods. Due to a defect around the handling of leading path delimiter characters, these changes reverted in 0.75.2 and later reintroduced in 0.76.0.

## 0.74

### 0.74.63 (2025-05-08)

- Deprecates `Function.web_url` in favor of a new `Function.get_web_url()` method. This also allows the url of a `Function` to be retrieved in an async manner using `Function.get_web_url.aio()` (like all other io-bearing methods in the Modal API)

### 0.74.61 (2025-05-07)

- Adds a deprecation warning when data is passed directly to `modal.Dict.from_name` or `modal.Dict.ephemeral`. Going forward, it will be necessary to separate `Dict` population from creation.

### 0.74.60 (2025-05-07)

- `modal.Dict.update` now also accepts a positional Mapping, like Python's `dict` type:

  ```python
  d = modal.Dict.from_name("some-dict")
  d.update({"a_key": 1, "another_key": "b"}, some_kwarg=True)
  ```

### 0.74.56 (2025-05-06)

- Experimental `modal cluster` subcommand is added.

### 0.74.53 (2025-05-06)

- Added functionality for `.spawn_map` on a function instantiated from `Function.from_name`.

### 0.74.51 (2025-05-06)

- The `modal` client library can now be installed with Protobuf 6.

### 0.74.49 (2025-05-06)

- Changes the log format of the modal client's default logger. Instead of `[%(threadName)s]`, the client now logs `[modal-client]` as the log line prefix.
- Adds a configuration option (MODAL_LOG_PATTERN) to the modal config for setting the log formatting pattern, in case users want to customize the format. To get the old format, use `MODAL_LOG_PATTERN='[%(threadName)s] %(asctime)s %(message)s'` (or add this to your `.modal.toml` in the `log_pattern` field).

### 0.74.48 (2025-05-05)

- Added a new method for spawning many function calls in parallel: `Function.spawn_map`.

### 0.74.46 (2025-05-05)

- Introduces a new `.update_autoscaler()` method, which will replace the existing `.keep_warm()` method with the ability to dynamically change the entire autoscaler configuration (`min_containers`, `max_containers`, `buffer_containers`, and `scaledown_window`).

### 0.74.39 (2025-04-30)

- The `modal` client no longer includes `fastapi` as a library dependency.

### 0.74.36 (2025-04-29)

- A new parameter, `restrict_modal_access`, can be provided on a Function to prevent it from interacting with other resources in your Modal Workspace like Queues, Volumes, or other Functions. This can be useful for running user-provided or LLM-written code in a safe way.

### 0.74.35 (2025-04-29)

- Fixed a bug that prevented doing `modal run` against an entrypoint defined by `Cls.with_options`.

### 0.74.32 (2025-04-29)

- When setting a custom `name=` in `@app.function()`, an error is now raised unless `serialized=True` is also set.

### 0.74.25 (2025-04-25)

- The `App.include` method now returns `self` so it's possible to build up an App through chained calls:

  ```python
  app = modal.App("main-app").include(sub_app_1).include(sub_app_2)
  ```

### 0.74.23 (2025-04-25)

- Marked some parameters in a small number of Modal functions as requiring keyword arguments (namely, `modal.App.run`, `modal.Cls.with_options`, all `.from_name` methods, and a few others). Code that calls these functions with positional arguments will now raise an error. This is expected to be minimally disruptive as the affected parameters are mostly "extra" options or positioned after parameters that have previously been deprecated.

### 0.74.22 (2025-04-24)

- Added a `modal secret delete` command to the CLI.

### 0.74.21 (2025-04-24)

- The `allow_cross_region_volumes` parameter of the `@app.function` and `@app.cls` decorators now issues a deprecation warning; the parameter is always treated as `True` on the Modal backend.

### 0.74.18 (2025-04-23)

- Adds a `.deploy()` method to the `App` object. This method allows you programmatically deploy Apps from Python:

  ```python
  app = modal.App("programmatic-deploy")
  ...
  app.deploy()
  ```

### 0.74.12 (2025-04-18)

- The `@app.function` and `@app.cls` decorators now support `experimental_options`, which we'll use going forward when testing experimental functionality that depends only on server-side configuration.

### 0.74.7 (2025-04-17)

- Modal will now raise an error if local files included in the App are modified during the build process. This behavior can be controlled with the `MODAL_BUILD_VALIDATION` configuration, which accepts `error` (default), `warn`, or `ignore`.

### 0.74.6 (2025-04-17)

- Internal change that makes containers for functions/classes with `serialized=True` start up _slightly_ faster than before

### 0.74.0 (2025-04-15)

- Introduces a deprecation warning when using explicit constructors (`__init__` methods) on `@modal.cls`-decorated classes. Class parameterization should instead be done via [dataclass-style `modal.parameter()` declarations](https://modal.com/docs/guide/parametrized-functions). Initialization logic should run in `@modal.enter()`-decorated [lifecycle methods](https://modal.com/docs/guide/lifecycle-functions).

## 0.73

### 0.73.173 (2025-04-15)

- Fix bug where containers hang with batch sizes above 100 (with `@modal.batched`).
- Fix bug where containers can fail with large outputs and batch sizes above 49 (with `@modal.batched`)

### 0.73.170 (2025-04-14)

- Fixes a bug where `modal run` didn't recognize `modal.parameter()` class parameters

### 0.73.165 (2025-04-11)

- Allow running new ephemeral apps from **within** Modal containers using `with app.run(): ...`. Use with care, as putting such a run block in global scope of a module could easily lead to infinite app creation recursion

### 0.73.160 (2025-04-10)

- The `allow_concurrent_inputs` parameter of `@app.function` and `@app.cls` is now deprecated in favor of the `@modal.concurrent` decorator. See the [Modal 1.0 Migration Guide](https://modal.com/docs/guide/modal-1-0-migration#replacing-allow_concurrent_inputs-with-modalconcurrent) and documentation on [input concurrency](https://modal.com/docs/guide/concurrent-inputs) for more information.

### 0.73.159 (2025-04-10)

- Fixes a bug where `serialized=True` classes could not `self.` reference other methods on the class, or use `modal.parameter()` synthetic constructors

### 0.73.158 (2025-04-10)

- Adds support for `bool` type to class parameters using `name: bool = modal.parameter()`. Note that older clients can't instantiate classes with bool parameters unless those have default values which are not modified. Bool parameters are also not supported by web endpoints at this time.

### 0.73.148 (2025-04-07)

- Fixes a bug introduced in 0.73.147 that broke App builds when using `@modal.batched` on a class method.

### 0.73.147 (2025-04-07)

- Improved handling of cases where `@modal.concurrent` is stacked with other decorators.

### 0.73.144 (2025-04-04)

- Adds a `context_dir` parameter to `modal.Image.from_dockerfile` and `modal.Image.dockerfile_commands`. This parameter can be used to provide a local reference for relative COPY commands.

### 0.73.139 (2025-04-02)

- Added `modal.experimental.ipython` module, which can be loaded in Jupyter notebooks with `%load_ext modal.experimental.ipython`. Currently it provides the `%modal` line magic for looking up functions:

  ```python
  %modal from main/my-app import my_function, MyClass as Foo

  # Now you can use my_function() and Foo in your notebook.
  my_function.remote()
  Foo().my_method.remote()
  ```

- Removed the legacy `modal.extensions.ipython` module from 2022.

### 0.73.135 (2025-03-29)

- Fix shutdown race bug that emitted spurious error-level logs.

### 0.73.132 (2025-03-28)

- Adds the `@modal.concurrent` decorator, which will be replacing the beta `allow_concurrent_inputs=` parameter of `@app.function` and `@app.cls` for enabling input concurrency. Notably, `@modal.concurrent` introduces a distinction between `max_inputs` and `target_inputs`, allowing containers to burst over the concurrency level targeted by the Modal autoscaler during periods of high load.

### 0.73.131 (2025-03-28)

- Instantiation of classes using keyword arguments that are not defined as as `modal.parameter()` will now raise an error on the calling side rather than in the receiving container. Note that this only applies if there is at least one modal.parameter() defined on the class, but this will likely apply to parameter-less classes in the future as well.

### 0.73.121 (2025-03-24)

- Adds a new "commit info" column to the `modal app history` command. It shows the short git hash at the time of deployment, with an asterisk `*` if the repository had uncommitted changes.

### 0.73.119 (2025-03-21)

- Class parameters are no longer automatically cast into their declared type. If the wrong type is provided to a class parameter, method calls to that class instance will now fail with an exception.

### 0.73.115 (2025-03-19)

- Adds support for new strict `bytes` type for `modal.parameter`

Usage:

```py
import typing
import modal

app = modal.App()

@app.cls()
class Foo:
    a: bytes = modal.parameter(default=b"hello")

    @modal.method()
    def bar(self):
        return f"hello {self.a}"

@app.local_entrypoint()
def main():
    foo = Foo(a=b"world")
    foo.bar.remote()
```

**Note**: For parameterized web endoints you must base64 encode the bytes before passing them in as a query parameter.

### 0.73.107 (2025-03-14)

- Include git commit info at the time of app deployment.

### 0.73.105 (2025-03-14)

- Added `Image.cmd()` for setting image default entrypoint args (a.k.a. `CMD`).

### 0.73.95 (2025-03-12)

- Fixes a bug which could cause `Function.map` and sibling methods to stall indefinitely if there was an exception in the input iterator itself (i.e. not the mapper function)

### 0.73.89 (2025-03-05)

- The `@modal.web_endpoint` decorator is now deprecated. We are replacing it with `@modal.fastapi_endpoint`. This can be a simple name substitution in your code; the two decorators have identical semantics.

### 0.73.84 (2025-03-04)

- The `keep_warm=` parameter has been removed from the`@modal.method` decorator. This parameter has been nonfunctional since v0.63.0; all autoscaler configuration must be done at the level of the modal Cls.

### 0.73.82 (2025-03-04)

- Adds `modal.fastapi_endpoint` as an alias for `modal.web_endpoint`. We will be deprecating the `modal.web_endpoint` _name_ (but not the functionality) as part of the Modal 1.0 release.

### 0.73.81 (2025-03-03)

- The `wait_for_response` parameter of Modal's web endpoint decorators has been removed (originally deprecated in May 2024).

### 0.73.78 (2025-03-01)

- It is now possible to call `Cls.with_options` on an unhydrated Cls, e.g.

  ```python
  ModelWithGPU = modal.Cls.from_name("my-app", "Model").with_options(gpu="H100")
  ```

### 0.73.77 (2025-03-01)

- `Cls.with_options()` now accept unhydated volume and secrets

### 0.73.76 (2025-02-28)

- We're renaming several `App.function` and `App.cls` parameters that configure the behavior of Modal's autoscaler:
  - `concurrency_limit` is now `max_containers`
  - `keep_warm` is now `min_containers`
  - `container_idle_timeout` is now `scaledown_window`
- The old names will continue to work, but using them will issue a deprecation warning. The aim of the renaming is to reduce some persistent confusion about what these parameters mean. Code updates should require only a simple substitution of the new name.
- We're adding a new parameter, `buffer_containers` (previously available as `_experimental_buffer_containers`). When your Function is actively handling inputs, the autoscaler will spin up additional `buffer_containers` so that subsequent inputs will not be blocked on cold starts. When the Function is idle, it will still scale down to the value given by `min_containers`.

### 0.73.75 (2025-02-28)

- Adds a new config field, `ignore_cache` (also accessible via environment variables as `MODAL_IGNORE_CACHE=1`), which will force Images used by the App to rebuild but not clobber any existing cached Images. This can be useful for testing an App's robustness to Image rebuilds without affecting other Apps that depend on the same base Image layer(s).

### 0.73.73 (2025-02-28)

- Adds a deprecation warning to the `workspace` parameter in `modal.Cls` lookup methods. This argument is unused and will be removed in the future.

### 0.73.69 (2025-02-25)

- We've moved the `modal.functions.gather` function to be a staticmethod on `modal.FunctionCall.gather`. The former spelling has been deprecated and will be removed in a future version.

### 0.73.68 (2025-02-25)

- Fixes issue where running `modal shell` with a dot-separated module reference as input would not accept the required `-m` flag for "module mode", but still emitted a warning telling users to use `-m`

### 0.73.60 (2025-02-20)

- Fixes an issue where `modal.runner.deploy_app()` didn't work when called from within a running (remote) Modal Function

### 0.73.58 (2025-02-20)

- Introduces an `-m` flag to `modal run`, `modal shell`, `modal serve` and `modal deploy`, which indicates that the modal app/function file is specified using python "module syntax" rather than a file path. In the future this will be a required flag when using module syntax.

  Old syntax:

  ```sh
  modal run my_package/modal_main.py
  modal run my_package.modal_main
  ```

  New syntax (note the `-m` on the second line):

  ```sh
  modal run my_package/modal_main.py
  modal run -m my_package.modal_main
  ```

### 0.73.54 (2025-02-18)

- Passing `App.lookup` an invalid name now raises an error. App names may contain only alphanumeric characters, dashes, periods, and underscores, must be shorter than 64 characters, and cannot conflict with App ID strings.

### 0.73.51 (2025-02-14)

- Fixes a bug where sandboxes returned from `Sandbox.list()` were not snapshottable even if they were created with `_experimental_enable_snapshot`.

### 0.73.44 (2025-02-13)

- `modal.FunctionCall` is now available in the top-level `modal` namespace. We recommend referencing the class this way instead of using the the fully-qualified `modal.functions.FunctionCall` name.

### 0.73.40 (2025-02-12)

- `Function.web_url` will now return None (instead of raising an error) when the Function is not a web endpoint

### 0.73.31 (2025-02-10)

- Deprecate the GPU classes (`gpu=A100(...)` etc) in favor of just using strings (`gpu="A100"` etc)

### 0.73.26 (2025-02-10)

- Adds a pending deprecation warning when looking up class methods using `Function.from_name`, e.g. `Function.from_name("some_app", "SomeClass.some_method")`. The recommended way to reference methods of classes is to look up the class instead: `RemoteClass = Cls.from_name("some_app", "SomeClass")`

### 0.73.25 (2025-02-09)

- Fixes an issue introduced in `0.73.19` that prevented access to GPUs during image builds

### 0.73.18 (2025-02-06)

- When using a parameterized class (with at least one `modal.parameter()` specified), class instantiation with an incorrect construction signature (wrong arguments or types) will now fail at the `.remote()` calling site instead of container startup for the called class.

### 0.73.14 (2025-02-04)

- Fixed the status message shown in terminal logs for ephemeral Apps to accurately report the number of active containers.

### 0.73.11 (2025-02-04)

- Warns users if the `modal.Image` of a Function/Cls doesn't include all the globally imported "local" modules (using `.add_local_python_source()`), and the user hasn't explicitly set an `include_source` value of True/False. This is in preparation for an upcoming deprecation of the current "auto mount" logic.

### 0.73.10 (2025-02-04)

- Modal functions, methods and entrypoints can now accept variable-length arguments to skip Modal's default CLI parsing. This is useful if you want to use Modal with custom argument parsing via `argparse` or `HfArgumentParser`. For example, the following function can be invoked with `modal run my_file.py --foo=42 --bar="baz"`:

  ```python
  import argparse

  @app.function()
  def train(*arglist):
      parser = argparse.ArgumentParser()
      parser.add_argument("--foo", type=int)
      parser.add_argument("--bar", type=str)
      args = parser.parse_args(args = arglist)
  ```

### 0.73.1 (2025-01-30)

- `modal run` now runs a single local entrypoints/function in the selected module. If exactly one local entrypoint or function exists in the selected module, the user doesn't have to qualify the runnable
  in the modal run command, even if some of the module's referenced apps have additional local entrypoints or functions. This partially restores "auto-inferred function" functionality that was changed in v0.72.48.

### 0.73.0 (2025-01-30)

- Introduces an `include_source` argument in the `App.function` and `App.cls` decorators that let users configure which class of python packages are automatically included as source mounts in created modal functions/classes (what we used to call "automount" behavior). This will supersede the MODAL_AUTOMOUNT configuration value which will eventually be deprecated. As a convenience, the `modal.App` constructor will also accept an `include_source` argument which serves as the default for all the app's functions and classes.

  The `include_source` argument accepts the following values:

  - `True` (default in a future version of Modal) Automatically includes the Python files of the source package of the function's own home module, but not any other local packages. Roughly equivalent ot `MODAL_AUTOMOUNT=0` in previous versions of Modal.
  - `False` - don't include _any_ local source. Assumes the function's home module is importable in the container environment through some other means (typically added to the provided `modal.Image`'s Python environment).
  - `None` (the default) - use current soon-to-be-deprecated automounting behavior, including source of all first party packages that are not installed into site-packages locally.

- Minor change to `MODAL_AUTOMOUNT=0`: When running/deploying using a module path (e.g. `modal run mypak.mymod`), **all non .pyc files** of the source package (`mypak` in this case) are now included in the function's container. Previously, only the function's home `.py` module file + any `__init__.py` files in its package structure were included. Note that this is only for MODAL_AUTOMOUNT=0. To get full control over which source files are included with your functions, you can set `include_source=False` on your function (see above) and manually specify the files to include using the `ignore` argument to `Image.add_local_python_source`.

## 0.72

### 0.72.56 (2025-01-28)

- Deprecated `.lookup` methods on Modal objects. Users are encouraged to use `.from_name` instead. In most cases this will be a simple name substitution. See [the 1.0 migration guide](https://modal.com/docs/guide/modal-1-0-migration#deprecating-the-lookup-method-on-modal-objects) for more information.

### 0.72.54 (2025-01-28)

- Fixes bug introduced in v0.72.48 where `modal run` didn't work with files having global `Function.from_name()`/`Function.lookup()`/`Cls.from_name()`/`Cls.lookup()` calls.

### 0.72.48 (2025-01-24)

- Fixes a CLI bug where you couldn't reference functions via a qualified app, e.g. `mymodule::{app_variable}.{function_name}`.
- The `modal run`, `modal serve` and `modal shell` commands get more consistent error messages in cases where the passed app or function reference isn't resolvable to something that the current command expects.
- Removes the deprecated `__getattr__`, `__setattr__`, `__getitem__` and `__setitem__` methods from `modal.App`

### 0.72.39 (2025-01-22)

- Introduced a new public method, `.hydrate`, for on-demand hydration of Modal objects. This method replaces the existing semi-public `.resolve` method, which is now deprecated.

### 0.72.33 (2025-01-20)

- The Image returned by `Sandbox.snapshot_filesystem` now has `object_id` and other metadata pre-assigned rather than require loading by subsequent calls to sandboxes or similar to set this data.

### 0.72.30 (2025-01-18)

- Adds a new `oidc_auth_role_arn` field to `CloudBucketMount` for using OIDC authentication to create the mountpoint.

### 0.72.24 (2025-01-17)

- No longer prints a warning if `app.include` re-includes an already included function (warning is still printed if _another_ function with the same name is included)

### 0.72.22 (2025-01-17)

- Internal refactor of the `modal.object` module. All entities except `Object` from that module have now been moved to the `modal._object` "private" module.

### 0.72.17 (2025-01-16)

- The `@modal.build` decorator is now deprecated. For storing large assets (e.g. model weights), we now recommend using a `modal.Volume` over writing data to the `modal.Image` filesystem directly.

### 0.72.16 (2025-01-16)

- Fixes bug introduced in v0.72.9 where `modal run SomeClass.some_method` would incorrectly print a deprecation warning.

### 0.72.15 (2025-01-15)

- Added an `environment_name` parameter to the `App.run` context manager.

### 0.72.8 (2025-01-10)

- Fixes a bug introduced in v0.72.2 when specifying `add_python="3.9"` in `Image.from_registry`.

### 0.72.0 (2025-01-09)

- The default behavior`Image.from_dockerfile()` and `image.dockerfile_commands()` if no parameter is passed to `ignore` will be to automatically detect if there is a valid dockerignore file in the current working directory or next to the dockerfile following the same rules as `dockerignore` does using `docker` commands. Previously no patterns were ignored.

## 0.71

### 0.71.13 (2025-01-09)

- `FilePatternMatcher` has a new constructor `from_file` which allows you to read file matching patterns from a file instead of having to pass them in directly, this can be used for `Image` methods accepting an `ignore` parameter in order to read ignore patterns from files.

### 0.71.11 (2025-01-08)

- Modal Volumes can now be renamed via the CLI (`modal volume rename`) or SDK (`modal.Volume.rename`).

### 0.71.7 (2025-01-08)

- Adds `Image.from_id`, which returns an `Image` object from an existing image id.

### 0.71.1 (2025-01-06)

- Sandboxes now support fsnotify-like file watching:

```python
from modal.file_io import FileWatchEventType

app = modal.App.lookup("file-watch", create_if_missing=True)
sb = modal.Sandbox.create(app=app)
events = sb.watch("/foo")
for event in events:
    if event.type == FileWatchEventType.Modify:
        print(event.paths)
```

## 0.70

### 0.70.1 (2024-12-27)

- The sandbox filesystem API now accepts write payloads of sizes up to 1 GiB.

## 0.69

### 0.69.0 (2024-12-21)

- `Image.from_dockerfile()` and `image.dockerfile_commands()` now auto-infer which files need to be uploaded based on COPY commands in the source if `context_mount` is omitted. The `ignore=` argument to these methods can be used to selectively omit files using a set of glob patterns.

## 0.68

### 0.68.53 (2024-12-20)

- You can now point `modal launch vscode` at an arbitrary Dockerhub base image:

  `modal launch vscode --image=nvidia/cuda:12.4.0-devel-ubuntu22.04`

### 0.68.44 (2024-12-19)

- You can now run GPU workloads on [Nvidia L40S GPUs](https://www.nvidia.com/en-us/data-center/l40s/):

  ```python
  @app.function(gpu="L40S")
  def my_gpu_fn():
      ...
  ```

### 0.68.43 (2024-12-19)

- Fixed a bug introduced in v0.68.39 that changed the exception type raise when the target object for `.from_name`/`.lookup` methods was not found.

### 0.68.39 (2024-12-18)

- Standardized terminology in `.from_name`/`.lookup`/`.delete` methods to use `name` consistently where `label` and `tag` were used interchangeably before. Code that invokes these methods using `label=` as an explicit keyword argument will issue a deprecation warning and will break in a future release.

### 0.68.29 (2024-12-17)

- The internal `deprecation_error` and `deprecation_warning` utilities have been moved to a private namespace

### 0.68.28 (2024-12-17)

- Sandboxes now support additional filesystem commands `mkdir`, `rm`, and `ls`.

```python
app = modal.App.lookup("sandbox-fs", create_if_missing=True)
sb = modal.Sandbox.create(app=app)
sb.mkdir("/foo")
with sb.open("/foo/bar.txt", "w") as f:
    f.write("baz")
print(sb.ls("/foo"))
```

### 0.68.27 (2024-12-17)

- Two previously-introduced deprecations are now enforced and raise an error:
  - The `App.spawn_sandbox` method has been removed in favor of `Sandbox.create`
  - `Sandbox.create` now requires an `App` object to be passed

### 0.68.24 (2024-12-16)

- The `modal run` CLI now has a `--write-result` option. When you pass a filename, Modal will write the return value of the entrypoint function to that location on your local filesystem. The return value of the function must be either `str` or `bytes` to use this option; otherwise, an error will be raised. It can be useful for exercising a remote function that returns text, image data, etc.

### 0.68.21 (2024-12-13)

Adds an `ignore` parameter to our `Image` `add_local_dir` and `copy_local_dir` methods. It is similar to the `condition` method on `Mount` methods but instead operates on a `Path` object. It takes either a list of string patterns to ignore which follows the `dockerignore` syntax implemented in our `FilePatternMatcher` class, or you can pass in a callable which allows for more flexible selection of files.

Usage:

```python
img.add_local_dir(
  "./local-dir",
  remote_path="/remote-path",
  ignore=FilePatternMatcher("**/*", "!*.txt") # ignore everything except files ending with .txt
)

img.add_local_dir(
  ...,
  ignore=~FilePatternMatcher("**/*.py") # can be inverted for when inclusion filters are simpler to write
)

img.add_local_dir(
  ...,
  ignore=["**/*.py", "!module/*.py"] # ignore all .py files except those in the module directory
)

img.add_local_dir(
  ...,
  ignore=lambda fp: fp.is_relative_to("somewhere") # use a custom callable
)
```

which will add the `./local-dir` directory to the image but ignore all files except `.txt` files

### 0.68.15 (2024-12-13)

Adds the `requires_proxy_auth` parameter to `web_endpoint`, `asgi_app`, `wsgi_app`, and `web_server` decorators. Requests to the app will respond with 407 Proxy Authorization Required if a webhook token is not supplied in the HTTP headers. Protects against DoS attacks that will unnecessarily charge users.

### 0.68.11 (2024-12-13)

- `Cls.from_name(...)` now works as a lazy alternative to `Cls.lookup()` that doesn't perform any IO until a method on the class is used for a .remote() call or similar

### 0.68.6 (2024-12-12)

- Fixed a bug introduced in v0.67.47 that suppressed console output from the `modal deploy` CLI.

### 0.68.5 (2024-12-12)

We're removing support for `.spawn()`ing generator functions.

### 0.68.2 (2024-12-11)

- Sandboxes now support a new filesystem API. The `open()` method returns a `FileIO` handle for native file handling in sandboxes.

```python
app = modal.App.lookup("sandbox-fs", create_if_missing=True)
sb = modal.Sandbox.create(app=app)

with sb.open("test.txt", "w") as f:
  f.write("Hello World\n")

f = sb.open("test.txt", "rb")
print(f.read())
```

## 0.67

### 0.67.43 (2024-12-11)

- `modal container exec` and `modal shell` now work correctly even when a pseudoterminal (PTY) is not present. This means, for example, that you can pipe the output of these commands to a file:

  ```python
  modal shell -c 'uv pip list' > env.txt
  ```

### 0.67.39 (2024-12-09)

- It is now possible to delete named `NetworkFileSystem` objects via the CLI (`modal nfs delete ...`) or API `(modal.NetworkFileSystem.delete(...)`)

### 0.67.38 (2024-12-09)

- Sandboxes now support filesystem snapshots. Run `Sandbox.snapshot_filesystem()` to get an Image which can be used to spawn new Sandboxes.

### 0.67.28 (2024-12-05)

- Adds `Image.add_local_python_source` which works similarly to the old and soon-to-be-deprecated `Mount.from_local_python_packages` but for images. One notable difference is that the new `add_local_python_source` _only_ includes `.py`-files by default

### 0.67.23 (2024-12-04)

- Image build functions that use a `functools.wraps` decorator will now have their global variables included in the cache key. Previously, the cache would use global variables referenced within the wrapper itself. This will force a rebuild for Image layers defined using wrapped functions.

### 0.67.22 (2024-12-03)

- Fixed a bug introduced in v0.67.0 where it was impossible to call `modal.Cls` methods when passing a list of requested GPUs.

### 0.67.12 (2024-12-02)

- Fixed a bug that executes the wrong method when a Modal Cls overrides a `@modal.method` inherited from a parent.

### 0.67.7 (2024-11-29)

- Fixed a bug where pointing `modal run` at a method on a Modal Cls would fail if the method was inherited from a parent.

### 0.67.0 (2024-11-27)

New minor client version `0.67.x` comes with an internal data model change for how Modal creates functions for Modal classes. There are no breaking or backwards-incompatible changes with this release. All forward lookup scenarios (`.lookup()` of a `0.67` class from a pre `0.67` client) as well as backwards lookup scenarios (`.lookup()` of a pre `0.67` class from a `0.67` client) work, except for a `0.62` client looking up a `0.67` class (this maintains our current restriction of not being able to lookup a `0.63+` class from a `0.62` client).

## 0.66

### 0.66.49 (2024-11-26)

- `modal config set-environment` will now raise if the requested environment does not exist.

### 0.66.45 (2024-11-26)

- The `modal launch` CLI now accepts a `--detach` flag to run the App in detached mode, such that it will persist after the local client disconnects.

### 0.66.40 (2024-11-23)

- Adds `Image.add_local_file(..., copy=False)` and `Image.add_local_dir(..., copy=False)` as a unified replacement for the old `Image.copy_local_*()` and `Mount.add_local_*` methods.

### 0.66.30 (2024-11-21)

- Removed the `aiostream` package from the modal client library dependencies.

### 0.66.12 (2024-11-19)

`Sandbox.exec` now accepts arguments `text` and `bufsize` for streaming output, which controls text output and line buffering.

### 0.66.0 (2024-11-15)

- Modal no longer supports Python 3.8, which has reached its [official EoL](https://devguide.python.org/versions/).

## 0.65

### 0.65.55 (2024-11-13)

- Escalates stuck input cancellations to container death. This prevents unresponsive user code from holding up resources.
- Input timeouts no longer kill the entire container. Instead, they just cancel the timed-out input, leaving the container and other concurrent inputs running.

### 0.65.49 (2024-11-12)

- Fixed issue in `modal serve` where files used in `Image.copy_*` commands were not watched for changes

### 0.65.42 (2024-11-07)

- `Sandbox.exec` can now accept `timeout`, `workdir`, and `secrets`. See the `Sandbox.create` function for context on how to use these arguments.

### 0.65.33 (2024-11-06)

- Removed the `interactive` parameter from `function` and `cls` decorators. This parameter has been deprecated since May 2024. Instead of specifying Modal Functions as interactive, use `modal run --interactive` to activate interactive mode.

### 0.65.30 (2024-11-05)

- The `checkpointing_enabled` option, deprecated in March 2024, has now been removed.

### 0.65.9 (2024-10-31)

- Output from `Sandbox.exec` can now be directed to `/dev/null`, `stdout`, or stored for consumption. This behavior can be controlled via the new `StreamType` arguments.

### 0.65.8 (2024-10-31)

- Fixed a bug where the `Image.imports` context manager would not correctly propagate ImportError when using a `modal.Cls`.

### 0.65.2 (2024-10-30)

- Fixed an issue where `modal run` would pause for 10s before exiting if there was a failure during app creation.

## 0.64

### 0.64.227 (2024-10-25)

- The `modal container list` CLI command now shows the containers within a specific environment: the active profile's environment if there is one, otherwise the workspace's default environment. You can pass `--env` to list containers in other environments.

### 0.64.223 (2024-10-24)

- Fixed `modal serve` not showing progress when reloading apps on file changes since v0.63.79.

### 0.64.218 (2024-10-23)

- Fix a regression introduced in client version 0.64.209, which affects client authentication within a container.

### 0.64.198 (2024-10-18)

- Fixed a bug where `Queue.put` and `Queue.put_many` would throw `queue.Full` even if `timeout=None`.

### 0.64.194 (2024-10-18)

- The previously-deprecated `--confirm` flag has been removed from the `modal volume delete` CLI. Use `--yes` to force deletion without a confirmation prompt.

### 0.64.193 (2024-10-18)

- Passing `wait_for_response=False` in Modal webhook decorators is no longer supported. See [the docs](https://modal.com/docs/guide/webhook-timeouts#polling-solutions) for alternatives.

### 0.64.187 (2024-10-16)

- When writing to a `StreamWriter` that has already had EOF written, a `ValueError` is now raised instead of an `EOFError`.

### 0.64.185 (2024-10-15)

- Memory snapshotting can now be used with parametrized functions.

### 0.64.184 (2024-10-15)

- StreamWriters now accept strings as input.

### 0.64.182 (2024-10-15)

- Fixed a bug where App rollbacks would not restart a schedule that had been removed in an intervening deployment.

### 0.64.181 (2024-10-14)

- The `modal shell` CLI command now takes a container ID, allowing you to shell into a running container.

### 0.64.180 (2024-10-14)

- `modal shell --cmd` now can be shortened to `modal shell -c`. This means you can use it like `modal shell -c "uname -a"` to quickly run a command within the remote environment.

### 0.64.168 (2024-10-03)

- The `Image.conda`, `Image.conda_install`, and `Image.conda_update_from_environment` methods are now fully deprecated. We recommend using `micromamba` (via `Image.micromamba` and `Image.micromamba_install`) instead, or manually installing and using conda with `Image.run_commands` when strictly necessary.

### 0.64.153 (2024-09-30)

- **Breaking Change:** `Sandbox.tunnels()` now returns a `Dict` rather than a `List`. This dict is keyed by the container's port, and it returns a `Tunnel` object, just like `modal.forward` does.

### 0.64.142 (2024-09-25)

- `modal.Function` and `modal.Cls` now support specifying a `list` of GPU configurations, allowing the Function's container pool to scale across each GPU configuration in preference order.

### 0.64.139 (2024-09-25)

- The deprecated `_experimental_boost` argument is now removed. (Deprecated in late July.)

### 0.64.123 (2024-09-18)

- Sandboxes can now be created without an entrypoint command. If they are created like this, they will stay alive up until their set timeout. This is useful if you want to keep a long-lived sandbox and execute code in it later.

### 0.64.119 (2024-09-17)

- Sandboxes now have a `cidr_allowlist` argument, enabling controlled access to certain IP ranges. When not used (and with `block_network=False`), the sandbox process will have open network access.

### 0.64.118 (2024-09-17)

Introduce an experimental API to allow users to set the input concurrency for a container locally.

### 0.64.112 (2024-09-15)

- Creating sandboxes without an associated `App` is deprecated. If you are spawning a `Sandbox` outside a Modal container, you can lookup an `App` by name to attach to the `Sandbox`:

  ```python
  app = modal.App.lookup('my-app', create_if_missing=True)
  modal.Sandbox.create('echo', 'hi', app=app)
  ```

### 0.64.109 (2024-09-13)

- App handles can now be looked up by name with `modal.App.lookup(name)`. This can be useful for associating Sandboxes with Apps:

  ```python
  app = modal.App.lookup("my-app", create_if_missing=True)
  modal.Sandbox.create("echo", "hi", app=app)
  ```

### 0.64.100 (2024-09-11)

- The default timeout for `modal.Image.run_function` has been lowered to 1 hour. Previously it was 24 hours.

### 0.64.99 (2024-09-11)

- Fixes an issue that could cause containers using `enable_memory_snapshot=True` on Python 3.9 and below to shut down prematurely.

### 0.64.97 (2024-09-11)

- Added support for [ASGI lifespan protocol](https://asgi.readthedocs.io/en/latest/specs/lifespan.html):

  ```python
  @app.function()
  @modal.asgi_app()
  def func():
      from fastapi import FastAPI, Request

      def lifespan(wapp: FastAPI):
          print("Starting")
          yield {"foo": "bar"}
          print("Shutting down")

      web_app = FastAPI(lifespan=lifespan)

      @web_app.get("/")
      def get_state(request: Request):
          return {"message": f"This is the state: {request.state.foo}"}

      return web_app
  ```

  which enables support for `gradio>=v4` amongst other libraries using lifespans

### 0.64.87 (2024-09-05)

- Sandboxes now support port tunneling. Ports can be exposed via the `open_ports` argument, and a list of active tunnels can be retrieved via the `.tunnels()` method.

### 0.64.67 (2024-08-30)

- Fixed a regression in `modal launch` to resume displaying output when starting the container.

### 0.64.48 (2024-08-21)

- Introduces new dataclass-style syntax for class parametrization (see updated [docs](https://modal.com/docs/guide/parametrized-functions))

  ```python
  @app.cls()
  class MyCls:
      param_a: str = modal.parameter()

  MyCls(param_a="hello")  # synthesized constructor
  ```

- The new syntax enforces types (`str` or `int` for now) on all parameters

- _When the new syntax is used_, any web endpoints (`web_endpoint`, `asgi_app`, `wsgi_app` or `web_server`) on the app will now also support parametrization through the use of query parameters matching the parameter names, e.g. `https://myfunc.modal.run/?param_a="hello` in the above example.

- The old explicit `__init__` constructor syntax is still allowed, but could be deprecated in the future and doesn't work with web endpoint parametrization

### 0.64.38 (2024-08-16)

- Added a `modal app rollback` CLI command for rolling back an App deployment to a previous version.

### 0.64.33 (2024-08-16)

- Commands in the `modal app` CLI now accept an App name as a positional argument, in addition to an App ID:

  ```
  modal app history my-app
  ```

  Accordingly, the explicit `--name` option has been deprecated. Providing a name that can be confused with an App ID will also now raise an error.

### 0.64.32 (2024-08-16)

- Updated type stubs using generics to allow static type inferrence for functions calls, e.g. `function.remote(...)`.

### 0.64.26 (2024-08-15)

- `ContainerProcess` handles now support `wait()` and `poll()`, like `Sandbox` objects

### 0.64.24 (2024-08-14)

- Added support for dynamic batching. Functions or class methods decorated with `@modal.batched` will now automatically batch their invocations together, up to a specified `max_batch_size`. The batch will wait for a maximum of `wait_ms` for more invocations after the first invocation is made. See guide for more details.

  ```python
  @app.function()
  @modal.batched(max_batch_size=4, wait_ms=1000)
  async def batched_multiply(xs: list[int], ys: list[int]) -> list[int]:
      return [x * y for x, y in zip(xs, xs)]

  @app.cls()
  class BatchedClass():
      @modal.batched(max_batch_size=4, wait_ms=1000)
      async def batched_multiply(xs: list[int], ys: list[int]) -> list[int]:
          return [x * y for x, y in zip(xs, xs)]
  ```

  The batched function is called with individual inputs:

  ```python
  await batched_multiply.remote.aio(2, 3)
  ```

### 0.64.18 (2024-08-12)

- Sandboxes now have an `exec()` method that lets you execute a command inside the sandbox container. `exec` returns a `ContainerProcess` handle for input and output streaming.

  ```python
  sandbox = modal.Sandbox.create("sleep", "infinity")

  process = sandbox.exec("bash", "-c", "for i in $(seq 1 10); do echo foo $i; sleep 0.5; done")

  for line in process.stdout:
      print(line)
  ```

### 0.64.8 (2024-08-06)

- Removed support for the undocumented `modal.apps.list_apps()` function, which was internal and not intended to be part of public API.

### 0.64.7 (2024-08-05)

- Removed client check for CPU core request being at least 0.1, deferring to server-side enforcement.

### 0.64.2 (2024-08-02)

- Volumes can now be mounted to an ad hoc modal shell session:

  ```
  modal shell --volume my-vol-name
  ```

  When the shell starts, the volume will be mounted at `/mnt/my-vol-name`. This may be helpful for shell-based exploration or manipulation of volume contents.

  Note that the option can be used multiple times to mount additional models:

  ```
  modal shell --volume models --volume data
  ```

### 0.64.0 (2024-07-29)

- App deployment events are now atomic, reducing the risk that a failed deploy will leave the App in a bad state.

## 0.63

### 0.63.87 (2024-07-24)

- The `_experimental_boost` argument can now be removed. Boost is now enabled on all modal Functions.

### 0.63.77 (2024-07-18)

- Setting `_allow_background_volume_commits` is no longer necessary and has been deprecated. Remove this argument in your decorators.

### 0.63.36 (2024-07-05)

- Image layers defined with a `@modal.build` method will now include the values of any _class variables_ that are referenced within the method as part of the layer cache key. That means that the layer will rebuild when the class variables change or are overridden by a subclass.

### 0.63.22 (2024-07-01)

- Fixed an error when running `@modal.build` methods that was introduced in v0.63.19

### 0.63.20 (2024-07-01)

- Fixed bug where `self.method.local()` would re-trigger lifecycle methods in classes

### 0.63.14 (2024-06-28)

- Adds `Cls.lookup()` backwards compatibility with classes created by clients prior to `v0.63`.

  **Important**: When updating (to >=v0.63) an app with a Modal `class` that's accessed using `Cls.lookup()` - make sure to update the client of the app/service **using** `Cls.lookup()` first, and **then** update the app containing the class being looked up.

### 0.63.12 (2024-06-27)

- Fixed a bug introduced in 0.63.0 that broke `modal.Cls.with_options`

### 0.63.10 (2024-06-26)

- Adds warning about future deprecation of `retries` for generators. Retries are being deprecated as they can lead to nondetermistic generator behavior.

### 0.63.9 (2024-06-26)

- Fixed a bug in `Volume.copy_files()` where some source paths may be ignored if passed as `bytes`.
- `Volume.read_file`, `Volume.read_file_into_fileobj`, `Volume.remove_file`, and `Volume.copy_files` can no longer take both string or bytes for their paths. They now only accept `str`.

### 0.63.2 (2024-06-25)

- Fixes issue with `Cls.lookup` not working (at all) after upgrading to v0.63.0. **Note**: this doesn't fix the cross-version lookup incompatibility introduced in 0.63.0.

### 0.63.0 (2024-06-24)

- Changes how containers are associated with methods of `@app.cls()`-decorated Modal "classes".

  Previously each `@method` and web endpoint of a class would get its own set of isolated containers and never run in the same container as other sibling methods.
  Starting in this version, all `@methods` and web endpoints will be part of the same container pool. Notably, this means all methods will scale up/down together, and options like `keep_warm` and `concurrency_limit` will affect the total number of containers for all methods in the class combined, rather than individually.

  **Version incompatibility warning:** Older clients (below 0.63) can't use classes deployed by new clients (0.63 and above), and vice versa. Apps or standalone clients using `Cls.lookup(...)` to invoke Modal classes need to be upgraded to version `0.63` at the same time as the deployed app that's being called into.

- `keep_warm` for classes is now an attribute of the `@app.cls()` decorator rather than individual methods.

## 0.62

### 0.62.236 (2024-06-21)

- Added support for mounting Volume or CloudBucketMount storage in `Image.run_function`. Note that this is _typically_ not necessary, as data downloaded during the Image build can be stored directly in the Image filesystem.

### 0.62.230 (2024-06-18)

- It is now an error to create or lookup Modal objects (`Volume`, `Dict`, `Secret`, etc.) with an invalid name. Object names must be shorter than 64 characters and may contain only alphanumeric characters, dashes, periods, and underscores. The name check had inadvertently been removed for a brief time following an internal refactor and then reintroduced as a warning. It is once more a hard error. Please get in touch if this is blocking access to your data.

### 0.62.224 (2024-06-17)

- The `modal app list` command now reports apps created by `modal app run` or `modal app serve` as being in an "ephemeral" state rather than a "running" state to reduce confusion with deployed apps that are actively processing inputs.

### 0.62.223 (2024-06-14)

- All modal CLI commands now accept `-e` as a short-form of `--env`

### 0.62.220 (2024-06-12)

- Added support for entrypoint and shell for custom containers: `Image.debian_slim().entrypoint([])` can be used interchangeably with `.dockerfile_commands('ENTRYPOINT []')`, and `.shell(["/bin/bash", "-c"])` can be used interchangeably with `.dockerfile_commands('SHELL ["/bin/bash", "-c"]')`

### 0.62.219 (2024-06-12)

- Fix an issue with `@web_server` decorator not working on image builder version 2023.12

### 0.62.208 (2024-06-08)

- `@web_server` endpoints can now return HTTP headers of up to 64 KiB in length. Previously, they were limited to 8 KiB due to an implementation detail.

### 0.62.201 (2024-06-04)

- `modal deploy` now accepts a `--tag` optional parameter that allows you to specify a custom tag for the deployed version, making it easier to identify and manage different deployments of your app.

### 0.62.199 (2024-06-04)

- `web_endpoint`s now have the option to include interactive SwaggerUI/redoc docs by setting `docs=True`
- `web_endpoint`s no longer include an OpenAPI JSON spec route by default

### 0.62.190 (2024-05-29)

- `modal.Function` now supports requesting ephemeral disk (SSD) via the new `ephemeral_disk` parameter. Intended for use in doing large dataset ingestion and transform.

### 0.62.186 (2024-05-29)

- `modal.Volume` background commits are now enabled by default when using `spawn_sandbox`.

### 0.62.185 (2024-05-28)

- The `modal app stop` CLI command now accepts a `--name` (or `-n`) option to stop an App by name rather than by ID.

### 0.62.181 (2024-05-24)

- Background committing on `modal.Volume` mounts is now default behavior.

### 0.62.178 (2024-05-21)

- Added a `modal container stop` CLI command that will kill an active container and reassign its current inputs.

### 0.62.175 (2024-05-17)

- `modal.CloudBucketMount` now supports writing to Google Cloud Storage buckets.

### 0.62.174 (2024-05-17)

- Using `memory=` to specify the type of `modal.gpu.A100` is deprecated in favor of `size=`. Note that `size` accepts a string type (`"40GB"` or `"80GB"`) rather than an integer, as this is a request for a specific variant of the A100 GPU.

### 0.62.173 (2024-05-17)

- Added a `version` flag to the `modal.Volume` API and CLI, allow opting in to a new backend implementation.

### 0.62.172 (2024-05-17)

- Fixed a bug where other functions weren't callable from within an `asgi_app` or `wsgi_app` constructor function and side effects of `@enter` methods weren't available in that scope.

### 0.62.166 (2024-05-14)

- Disabling background commits on `modal.Volume` volumes is now deprecated. Background commits will soon become mandatory behavior.

### 0.62.165 (2024-05-13)

- Deprecated `wait_for_response=False` on web endpoints. See [the docs](https://modal.com/docs/guide/webhook-timeouts#polling-solutions) for alternatives.

### 0.62.162 (2024-05-13)

- A deprecation warning is now raised when using `modal.Stub`, which has been renamed to `modal.App`. Additionally, it is recommended to use `app` as the variable name rather than `stub`, which matters when using the automatic app discovery feature in the `modal run` CLI command.

### 0.62.159 (2024-05-10)

- Added a `--stream-logs` flag to `modal deploy` that, if True, begins streaming the app logs once deployment is complete.

### 0.62.156 (2024-05-09)

- Added support for looking up a deployed App by its deployment name in `modal app logs`

### 0.62.150 (2024-05-08)

- Added validation that App `name`, if provided, is a string.

### 0.62.149 (2024-05-08)

- The `@app.function` decorator now raises an error when it is used to decorate a class (this was always invalid, but previously produced confusing behavior).

### 0.62.148 (2024-05-08)

- The `modal app list` output has been improved in several ways:
  - Persistent storage objects like Volumes or Dicts are no longer included (these objects receive an app ID internally, but this is an implementation detail and subject to future change). You can use the dedicated CLI for each object (e.g. `modal volume list`) instead.
  - For Apps in a _stopped_ state, the output is now limited to those stopped within the past 2 hours.
  - The number of tasks running for each App is now shown.

### 0.62.146 (2024-05-07)

- Added the `region` parameter to the `modal.App.function` and `modal.App.cls` decorators. This feature allows the selection of specific regions for function execution. Note that it is available only on some plan types. See our [blog post](https://modal.com/blog/region-selection-launch) for more details.

### 0.62.144 (2024-05-06)

- Added deprecation warnings when using Python 3.8 locally or in a container. Python 3.8 is nearing EOL, and Modal will be dropping support for it soon.

### 0.62.141 (2024-05-03)

- Deprecated the `Image.conda` constructor and the `Image.conda_install` / `Image.conda_update_from_environment` methods. Conda-based images had a number of tricky issues and were generally slower and heavier than images based on `micromamba`, which offers a similar featureset and can install packages from the same repositories.
- Added the `spec_file` parameter to allow `Image.micromamba_install` to install dependencies from a local file. Note that `micromamba` supports conda yaml syntax along with simple text files.

### 0.62.131 (2024-05-01)

- Added a deprecation warning when object names are invalid. This applies to `Dict`, `NetworkFileSystem`, `Secret`, `Queue`, and `Volume` objects. Names must be shorter than 64 characters and may contain only alphanumeric characters, dashes, periods, and underscores. These rules were previously enforced, but the check had inadvertently been dropped in a recent refactor. Please update the names of your objects and transfer any data to retain access, as invalid names will become an error in a future release.

### 0.62.130 (2024-05-01)

- Added a command-line interface for interacting with `modal.Queue` objects. Run `modal queue --help` in your terminal to see what is available.

### 0.62.116 (2024-04-26)

- Added a command-line interface for interacting with `modal.Dict` objects. Run `modal dict --help` in your terminal to see what is available.

### 0.62.114 (2024-04-25)

- `Secret.from_dotenv` now accepts an optional filename keyword argument:

  ```python
  @app.function(secrets=[modal.Secret.from_dotenv(filename=".env-dev")])
  def run():
      ...
  ```

### 0.62.110 (2024-04-25)

- Passing a glob `**` argument to the `modal volume get` CLI has been deprecated — instead, simply download the desired directory path, or `/` for the entire volume.
- `Volume.listdir()` no longer takes trailing glob arguments. Use `recursive=True` instead.
- `modal volume get` and `modal nfs get` performance is improved when downloading a single file. They also now work with multiple files when outputting to stdout.
- Fixed a visual bug where `modal volume get` on a single file will incorrectly display the destination path.

### 0.62.109 (2024-04-24)

- Improved feedback for deserialization failures when objects are being transferred between local / remote environments.

### 0.62.108 (2024-04-24)

- Added `Dict.delete` and `Queue.delete` as API methods for deleting named storage objects:

  ```python
  import modal
  modal.Queue.delete("my-job-queue")
  ```

- Deprecated invoking `Volume.delete` as an instance method; it should now be invoked as a static method with the name of the Volume to delete, as with the other methods.

### 0.62.98 (2024-04-21)

- The `modal.Dict` object now implements a `keys`/`values`/`items` API. Note that there are a few differences when compared to standard Python dicts:
  - The return value is a simple iterator, whereas Python uses a dictionary view object with more features.
  - The results are unordered.
- Additionally, there was no key data stored for items added to a `modal.Dict` prior to this release, so empty strings will be returned for these entries.

### 0.62.81 (2024-04-18)

- We are introducing `modal.App` as a replacement for `modal.Stub` and encouraging the use of "app" terminology over "stub" to reduce confusion between concepts used in the SDK and the Dashboard. Support for `modal.Stub` will be gradually deprecated over the next few months.

### 0.62.72 (2024-04-16)

- Specifying a hard memory limit for a `modal.Function` is now supported. Pass a tuple of `memory=(request, limit)`. Above the `limit`, which is specified in MiB, a Function's container will be OOM killed.

### 0.62.70 (2024-04-16)

- `modal.CloudBucketMount` now supports read-only access to Google Cloud Storage

### 0.62.69 (2024-04-16)

- Iterators passed to `Function.map()` and similar parallel execution primitives are now executed on the main thread, preventing blocking iterators from possibly locking up background Modal API calls, and risking task shutdowns.

### 0.62.67 (2024-04-15)

- The return type of `Volume.listdir()`, `Volume.iterdir()`, `NetworkFileSystem.listdir()`, and `NetworkFileSystem.iterdir()` is now a `FileEntry` dataclass from the `modal.volume` module. The fields of this data class are the same as the old protobuf object returned by these methods, so it should be mostly backwards-compatible.

### 0.62.65 (2024-04-15)

- Cloudflare R2 bucket support added to `modal.CloudBucketMount`

### 0.62.55 (2024-04-11)

- When Volume reloads fail due to an open file, we now try to identify and report the relevant path. Note that there may be some circumstances in which we are unable to identify the specific file blocking a reload and will report a generic error message in that case.

### 0.62.53 (2024-04-10)

- Values in the `modal.toml` config file that are spelled as `0`, `false`, `"False"`, or `"false"` will now be coerced in Python to`False`, whereas previously only `"0"` (as a string) would have the intended effect.

### 0.62.25 (2024-04-01)

- Fixed a recent regression that caused functions using `modal.interact()` to crash.

### 0.62.15 (2024-03-29)

- Queue methods `put`, `put_many`, `get`, `get_many` and `len` now support an optional `partition` argument (must be specified as a `kwarg`). When specified, users read and write from new partitions of the queue independently. `partition=None` corresponds to the default partition of the queue.

### 0.62.3 (2024-03-27)

- User can now mount S3 buckets using [Requester Pays](https://docs.aws.amazon.com/AmazonS3/latest/userguide/RequesterPaysBuckets.html). This can be done with `CloudBucketMount(..., requester_pays=True)`.

### 0.62.1 (2024-03-27)

- Raise an error on `@web_server(startup_timeout=0)`, which is an invalid configuration.

### 0.62.0 (2024-03-26)

- The `.new()` method has now been deprecated on all Modal objects. It should typically be replaced with `.from_name(...)` in Modal app code, or `.ephemeral()` in scripts that use Modal
- Assignment of Modal objects to a `Stub` via subscription (`stub["object"]`) or attribute (`stub.object`) syntax is now deprecated. This syntax was only necessary when using `.new()`.

## 0.61

### 0.61.104 (2024-03-25)

- Fixed a bug where images based on `micromamba` could fail to build if requesting Python 3.12 when a different version of Python was being used locally.

### 0.61.76 (2024-03-19)

- The `Sandbox`'s `LogsReader` is now an asynchronous iterable. It supports the `async for` statement to stream data from the sandbox's `stdout/stderr`.

```python
@stub.function()
async def my_fn():
    sandbox = stub.spawn_sandbox(
      "bash",
      "-c",
      "while true; do echo foo; sleep 1; done"
    )
    async for message in sandbox.stdout:
        print(f"Message: {message}")
```

### 0.61.57 (2024-03-15)

- Add the `@web_server` decorator, which exposes a server listening on a container port as a web endpoint.

### 0.61.56 (2024-03-15)

- Allow users to write to the `Sandbox`'s `stdin` with `StreamWriter`.

```python
@stub.function()
def my_fn():
    sandbox = stub.spawn_sandbox(
        "bash",
        "-c",
        "while read line; do echo $line; done",
    )
    sandbox.stdin.write(b"foo\\n")
    sandbox.stdin.write(b"bar\\n")
    sandbox.stdin.write_eof()
    sandbox.stdin.drain()
    sandbox.wait()
```

### 0.61.53 (2024-03-15)

- Fixed an bug where` Mount` was failing to include symbolic links.

### 0.61.45 (2024-03-13)

When called from within a container, `modal.experimental.stop_fetching_inputs()` causes it to gracefully exit after the current input has been processed.

### 0.61.35 (2024-03-12)

- The `@wsgi_app()` decorator now uses a different backend based on `a2wsgi` that streams requests in chunks, rather than buffering the entire request body.

### 0.61.32 (2024-03-11)

- Stubs/apps can now be "composed" from several smaller stubs using `stub.include(...)`. This allows more ergonomic setup of multi-file Modal apps.

### 0.61.31 (2024-03-08)

- The `Image.extend` method has been deprecated. This is a low-level interface and can be replaced by other `Image` methods that offer more flexibility, such as `Image.from_dockerfile`, `Image.dockerfile_commands`, or `Image.run_commands`.

### 0.61.24 (2024-03-06)

- Fixes `modal volume put` to support uploading larger files, beyond 40 GiB.

### 0.61.22 (2024-03-05)

- Modal containers now display a warning message if lingering threads are present at container exit, which prevents runner shutdown.

### 0.61.17 (2024-03-05)

- Bug fix: Stopping an app while a container's `@exit()` lifecycle methods are being run no longer interrupts the lifecycle methods.
- Bug fix: Worker preemptions no longer interrupt a container's `@exit()` lifecycle method (until 30 seconds later).
- Bug fix: Async `@exit()` lifecycle methods are no longer skipped for sync functions.
- Bug fix: Stopping a sync function with `allow_concurrent_inputs>1` now actually stops the container. Previously, it would not propagate the signal to worker threads, so they would continue running.
- Bug fix: Input-level cancellation no longer skips the `@exit()` lifecycle method.
- Improve stability of container entrypoint against race conditions in task cancellation.

### 0.61.9 (2024-03-05)

- Fix issue with pdm where all installed packages would be automounted when using package cache (MOD-2485)

### 0.61.6 (2024-03-04)

- For modal functions/classes with `concurrency_limit < keep_warm`, we'll raise an exception now. Previously we (silently) respected the `concurrency_limit` parameter.

### 0.61.1 (2024-03-03)

`modal run --interactive` or `modal run -i` run the app in "interactive mode". This allows any remote code to connect to the user's local terminal by calling `modal.interact()`.

```python
@stub.function()
def my_fn(x):
    modal.interact()

    x = input()
    print(f"Your number is {x}")
```

This means that you can dynamically start an IPython shell if desired for debugging:

```python
@stub.function()
def my_fn(x):
    modal.interact()

    from IPython import embed
    embed()
```

For convenience, breakpoints automatically call `interact()`:

```python
@stub.function()
def my_fn(x):
    breakpoint()
```

## 0.60

### 0.60.0 (2024-02-29)

- `Image.run_function` now allows you to pass args and kwargs to the function. Usage:

```python
def my_build_function(name, size, *, variant=None):
    print(f"Building {name} {size} {variant}")

image = modal.Image.debian_slim().run_function(
    my_build_function, args=("foo", 10), kwargs={"variant": "bar"}
)
```

## 0.59

### 0.59.0 (2024-02-28)

- Mounted packages are now deduplicated across functions in the same stub
- Mounting of local Python packages are now marked as such in the mount creation output, e.g. `PythonPackage:my_package`
- Automatic mounting now includes packages outside of the function file's own directory. Mounted packages are mounted in /root/<module path>

## 0.58

### 0.58.92 (2024-02-27)

- Most errors raised through usage of the CLI will now print a simple error message rather than showing a traceback from inside the `modal` library.
- Tracebacks originating from user code will include fewer frames from within `modal` itself.
- The new `MODAL_TRACEBACK` environment variable (and `traceback` field in the Modal config file) can override these behaviors so that full tracebacks are always shown.

### 0.58.90 (2024-02-27)

- Fixed a bug that could cause `cls`-based functions to to ignore timeout signals.

### 0.58.88 (2024-02-26)

- `volume get` performance is improved for large (> 100MB) files

### 0.58.79 (2024-02-23)

- Support for function parameters in methods decorated with `@exit` has been deprecated. Previously, exit methods were required to accept three arguments containing exception information (akin to `__exit__` in the context manager protocol). However, due to a bug, these arguments were always null. Going forward, `@exit` methods are expected to have no parameters.

### 0.58.75 (2024-02-23)

- Function calls can now be cancelled without killing the container running the inputs. This allows new inputs by different function calls to the same function to be picked up immediately without having to cold-start new containers after cancelling calls.

## 0.57

### 0.57.62 (2024-02-21)

- An `InvalidError` is now raised when a lifecycle decorator (`@build`, `@enter`, or `@exit`) is used in conjunction with `@method`. Previously, this was undefined and could produce confusing failures.

### 0.57.61 (2024-02-21)

- Reduced the amount of context for frames in modal's CLI framework when showing a traceback.

### 0.57.60 (2024-02-21)

- The "dunder method" approach for class lifecycle management (`__build__`, `__enter__`, `__exit__`, etc.) is now deprecated in favor of the modal `@build`, `@enter`, and `@exit` decorators.

### 0.57.52 (2024-02-17)

- In `modal token new` and `modal token set`, the `--no-no-verify` flag has been removed in favor of a `--verify` flag. This remains the default behavior.

### 0.57.51 (2024-02-17)

- Fixes a regression from 0.57.40 where `@enter` methods used a separate event loop.

### 0.57.42 (2024-02-14)

- Adds a new environment variable/config setting, `MODAL_FORCE_BUILD`/`force_build`, that coerces all images to be built from scratch, rather than loaded from cache.

### 0.57.40 (2024-02-13)

- The `@enter()` lifecycle method can now be used to run additional setup code prior to function checkpointing (when the class is decorated with `stub.cls(enable_checkpointing=True)`. Note that there are currently some limitations on function checkpointing:
  - Checkpointing only works for CPU memory; any GPUs attached to the function will not available
  - Networking is disabled while the checkpoint is being created
- Please note that function checkpointing is still a beta feature.

### 0.57.31 (2024-02-12)

- Fixed an issue with displaying deprecation warnings on Windows systems.

### 0.57.22 (2024-02-09)

- Modal client deprecation warnings are now highlighted in the CLI

### 0.57.16 (2024-02-07)

- Fixes a regression in container scheduling. Users on affected versions (**0.57.5**—**0.57.15**) are encouraged to upgrade immediately.

### 0.57.15 (2024-02-07)

- The legacy `image_python_version` config option has been removed. Use the `python_version=` parameter on your image definition instead.

### 0.57.13 (2024-02-07)

- Adds support for mounting an S3 bucket as a volume.

### 0.57.9 (2024-02-07)

- Support for an implicit 'default' profile is now deprecated. If you have more than one profile in your Modal config file, one must be explicitly set to `active` (use `modal profile activate` or edit your `.modal.toml` file to resolve).
- An error is now raised when more than one profile is set to `active`.

### 0.57.2 (2024-02-06)

- Improve error message when generator functions are called with `.map(...)`.

### 0.57.0 (2024-02-06)

- Greatly improved streaming performance of generators and WebSocket web endpoints.
- **Breaking change:** You cannot use `.map()` to call a generator function. (In previous versions, this merged the results onto a single stream, but the behavior was undocumented and not widely used.)
- **Incompatibility:** Generator outputs are now on a different internal system. Modal code on client versions before 0.57 cannot trigger [deployed functions](https://modal.com/docs/guide/trigger-deployed-functions) with `.remote_gen()` that are on client version 0.57, and vice versa.

## 0.56

Note that in version 0.56 and prior, Modal used a different numbering system for patch releases.

### 0.56.4964 (2024-02-05)

- When using `modal token new` or `model token set`, the profile containing the new token will now be activated by default. Use the `--no-activate` switch to update the `modal.toml` file without activating the corresponding profile.

### 0.56.4953 (2024-02-05)

- The `modal profile list` output now indicates when the workspace is determined by a token stored in environment variables.

### 0.56.4952 (2024-02-05)

- Variadic parameters (e.g. \*args and \*\*kwargs) can now be used in scheduled functions as long as the function doesn't have any other parameters without a default value

### 0.56.4903 (2024-02-01)

- `modal container exec`'s `--no-tty` flag has been renamed to `--no-pty`.

### 0.56.4902 (2024-02-01)

- The singular form of the `secret` parameter in `Stub.function`, `Stub.cls`, and `Image.run_function` has been deprecated. Please update your code to use the plural form instead:`secrets=[Secret(...)]`.

### 0.56.4885 (2024-02-01)

- In `modal profile list`, the user's GitHub username is now shown as the name for the "Personal" workspace.

### 0.56.4874 (2024-01-31)

- The `modal token new` and `modal token set` commands now create profiles that are more closely associated with workspaces, and they have more explicit profile activation behavior:
  - By default, these commands will create/update a profile named after the workspace that the token points to, rather than a profile named "default"
  - Both commands now have an `--activate` flag that will activate the profile associated with the new token
  - If no other profiles exist at the time of creation, the new profile will have its `active` metadata set to True
- With these changes, we are moving away from the concept of a "default" profile. Implicit usage of the "default" profile will be deprecated in a future update.

### 0.56.4849 (2024-01-29)

- Adds tty support to `modal container exec` for fully-interactive commands. Example: `modal container exec [container-id] /bin/bash`

### 0.56.4792 (2024-01-26)

- The `modal profile list` command now shows the workspace associated with each profile.

### 0.56.4715 (2024-01-24)

- `Mount.from_local_python_packages` now places mounted packages at `/root` in the Modal runtime by default (used to be `/pkg`). To override this behavior, the function now takes a `remote_dir: Union[str, PurePosixPath]` argument.

### 0.56.4707 (2024-01-23)

- The Modal client library is now compatible with Python 3.12, although there are a few limitations:

  - Images that use Python 3.12 without explicitly specifing it through `python_version` or `add_python` will not build
    properly unless the modal client is also running on Python 3.12.
  - The `conda` and `microconda` base images currently do not support Python 3.12 because an upstream dependency is not yet compatible.

### 0.56.4700 (2024-01-22)

- `gpu.A100` class now supports specifying GiB memory configuration using a `size: str` parameter. The `memory: int` parameter is deprecated.

### 0.56.4693 (2024-01-22)

- You can now execute commands in running containers with `modal container exec [container-id] [command]`.

### 0.56.4691 (2024-01-22)

- The `modal` cli now works more like the `python` cli in regard to script/module loading:
  - Running `modal my_dir/my_script.py` now puts `my_dir` on the PYTHONPATH.
  - `modal my_package.my_module` will now mount to /root/my_package/my_module.py in your Modal container, regardless if using automounting or not (and any intermediary `__init__.py` files will also be mounted)

### 0.56.4687 (2024-01-20)

- Modal now uses the current profile if `MODAL_PROFILE` is set to the empty string.

### 0.56.4649 (2024-01-17)

- Dropped support for building Python 3.7 based `modal.Image`s. Python 3.7 is end-of-life since late June 2023.

### 0.56.4620 (2024-01-16)

- modal.Stub.function now takes a `block_network` argument.

### 0.56.4616 (2024-01-16)

- modal.Stub now takes a `volumes` argument for setting the default volumes of all the stub's functions, similarly to the `mounts` and `secrets` argument.

### 0.56.4590 (2024-01-13)

- `modal serve`: Setting MODAL_LOGLEVEL=DEBUG now displays which files cause an app reload during serve

### 0.56.4570 (2024-01-12)

- `modal run` cli command now properly propagates `--env` values to object lookups in global scope of user code

## Examples

Source: https://github.com/modal-labs/modal-examples

### Agent

# Build a coding agent with Modal Sandboxes and LangGraph

This example demonstrates how to build an LLM coding "agent" that can generate and execute Python code, using
documentation from the web to inform its approach.

Naturally, we use the agent to generate code that runs language models.

The agent is built with [LangGraph](https://github.com/langchain-ai/langgraph), a library for building
directed graphs of computation popular with AI agent developers,
and uses models from the OpenAI API.

## Setup

```python
import modal

from .src import edges, nodes, retrieval
from .src.common import COLOR, PYTHON_VERSION, image

```

You will need two [Modal Secrets](https://modal.com/docs/guide/secrets) to run this example:
one to access the OpenAI API and another to access the LangSmith API for logging the agent's behavior.

To create them, head to the [Secrets dashboard](https://modal.com/secrets), select "Create new secret",
and use the provided templates for OpenAI and LangSmith.

```python
app = modal.App(
    "example-agent",
    image=image,
    secrets=[
        modal.Secret.from_name("openai-secret", required_keys=["OPENAI_API_KEY"]),
        modal.Secret.from_name("langsmith-secret", required_keys=["LANGCHAIN_API_KEY"]),
    ],
)

```

## Creating a Sandbox

We execute the agent's code in a Modal [Sandbox](https://modal.com/docs/guide/sandbox), which allows us to
run arbitrary code in a safe environment. In this example, we will use the [`transformers`](https://huggingface.co/docs/transformers/index)
library to generate text with a pre-trained model. Let's create a Sandbox with the necessary dependencies.

```python
def create_sandbox(app) -> modal.Sandbox:
    # Change this image (and the retrieval logic in the retrieval module)
    # if you want the agent to give coding advice on other libraries!
    agent_image = modal.Image.debian_slim(python_version=PYTHON_VERSION).pip_install(
        "torch==2.5.0",
        "transformers==4.46.0",
    )

    return modal.Sandbox.create(
        image=agent_image,
        timeout=60 * 10,  # 10 minutes
        app=app,
        # Modal sandboxes support GPUs!
        gpu="T4",
        # you can also pass secrets here -- note that the main app's secrets are not shared
    )

```

We also need a way to run our code in the sandbox. For this, we'll write a simple wrapper
around the Modal Sandbox `exec` method. We use `exec` because it allows us to run code without spinning up a
new container. And we can reuse the same container for multiple runs, preserving state.

```python
def run(code: str, sb: modal.Sandbox) -> tuple[str, str]:
    print(
        f"{COLOR['HEADER']}📦: Running in sandbox{COLOR['ENDC']}",
        f"{COLOR['GREEN']}{code}{COLOR['ENDC']}",
        sep="\n",
    )

    exc = sb.exec("python", "-c", code)
    exc.wait()

    stdout = exc.stdout.read()
    stderr = exc.stderr.read()

    if exc.returncode != 0:
        print(
            f"{COLOR['HEADER']}📦: Failed with exitcode {sb.returncode}{COLOR['ENDC']}"
        )

    return stdout, stderr

```

## Constructing the agent's graph

Now that we have the sandbox to execute code in, we can construct our agent's graph. Our graph is
defined in the `edges` and `nodes` modules
[associated with this example](https://github.com/modal-labs/modal-examples/tree/main/13_sandboxes/codelangchain).
Nodes are actions that change the state. Edges are transitions between nodes.

The idea is simple: we start at the node `generate`, which invokes the LLM to generate code based off documentation.
The generated code is executed (in the sandbox) as part of an edge called `check_code_execution`
and then the outputs are passed to the LLM for evaluation (the `evaluate_execution` node).
If the LLM determines that the code has executed correctly -- which might mean that the code raised an exception! --
we pass along the `decide_to_finish` edge and finish.

```python
def construct_graph(sandbox: modal.Sandbox, debug: bool = False):
    from langgraph.graph import StateGraph

    from .src.common import GraphState

    # Crawl the transformers documentation to inform our code generation
    context = retrieval.retrieve_docs(debug=debug)

    graph = StateGraph(GraphState)

    # Attach our nodes to the graph
    graph_nodes = nodes.Nodes(context, sandbox, run, debug=debug)
    for key, value in graph_nodes.node_map.items():
        graph.add_node(key, value)

    # Construct the graph by adding edges
    graph = edges.enrich(graph)

    # Set the starting and ending nodes of the graph
    graph.set_entry_point(key="generate")
    graph.set_finish_point(key="finish")

    return graph

```

We now set up the graph and compile it. See the `src` module for details
on the content of the graph and the nodes we've defined.

```python
DEFAULT_QUESTION = "How do I generate Python code using a pre-trained model from the transformers library?"

@app.function()
def go(
    question: str = DEFAULT_QUESTION,
    debug: bool = False,
):
    """Compiles the Python code generation agent graph and runs it, returning the result."""
    sb = create_sandbox(app)

    graph = construct_graph(sb, debug=debug)
    runnable = graph.compile()
    result = runnable.invoke(
        {"keys": {"question": question, "iterations": 0}},
        config={"recursion_limit": 50},
    )

    sb.terminate()

    return result["keys"]["response"]

```

## Running the Graph

Now let's call the agent from the command line!

We define a `local_entrypoint` that runs locally and triggers execution on Modal.

You can invoke it by executing following command from a folder that contains the `codelangchain` directory
[from our examples repo](https://github.com/modal-labs/modal-examples/tree/main/13_sandboxes/codelangchain):

```bash
modal run -m codelangchain.agent --question "How do I run a pre-trained model from the transformers library?"
```

```python
@app.local_entrypoint()
def main(
    question: str = DEFAULT_QUESTION,
    debug: bool = False,
):
    """Sends a question to the Python code generation agent.

    Switch to debug mode for shorter context and smaller model."""
    if debug:
        if question == DEFAULT_QUESTION:
            question = "hi there, how are you?"

    print(go.remote(question, debug=debug))

```

If things are working properly, you should see output like the following:

```bash
$ modal run -m codelangchain.agent --question "generate some cool output with transformers"
---DECISION: FINISH---
---FINISHING---
To generate some cool output using transformers, we can use a pre-trained language model from the Hugging Face Transformers library. In this example, we'll use the GPT-2 model to generate text based on a given prompt. The GPT-2 model is a popular choice for text generation tasks due to its ability to produce coherent and contextually relevant text. We'll use the pipeline API from the Transformers library, which simplifies the process of using pre-trained models for various tasks, including text generation.

from transformers import pipeline
# Initialize the text generation pipeline with the GPT-2 model
generator = pipeline('text-generation', model='gpt2')

# Define a prompt for the model to generate text from
prompt = "Once upon a time in a land far, far away"

# Generate text using the model
output = generator(prompt, max_length=50, num_return_sequences=1)

# Print the generated text
print(output[0]['generated_text'])

Result of code execution:
Once upon a time in a land far, far away, and still inhabited even after all the human race, there would be one God: a perfect universal God who has always been and will ever be worshipped. All His acts and deeds are immutable,
```

### Algolia Indexer

# Algolia docsearch crawler

This tutorial shows you how to use Modal to run the [Algolia docsearch
crawler](https://docsearch.algolia.com/docs/legacy/run-your-own/) to index your
website and make it searchable. This is not just example code - we run the same
code in production to power search on this page (`Ctrl+K` to try it out!).

## Basic setup

Let's get the imports out of the way.

```python
import json
import os
import subprocess

import modal

```

Modal lets you [use and extend existing Docker images](https://modal.com/docs/guide/custom-container#use-an-existing-container-image-with-from_registry),
as long as they have `python` and `pip` available. We'll use the official crawler image built by Algolia, with a small
adjustment: since this image has `python` symlinked to `python3.6` and Modal is not compatible with Python 3.6, we
install Python 3.11 and symlink that as the `python` executable instead.

```python
algolia_image = modal.Image.from_registry(
    "algolia/docsearch-scraper:v1.16.0",
    add_python="3.11",
    setup_dockerfile_commands=["ENTRYPOINT []"],
)

app = modal.App("example-algolia-indexer")

```

## Configure the crawler

Now, let's configure the crawler with the website we want to index, and which
CSS selectors we want to scrape. Complete documentation for crawler configuration is available
[here](https://docsearch.algolia.com/docs/legacy/config-file).

```python
CONFIG = {
    "index_name": "modal_docs",
    "custom_settings": {
        "separatorsToIndex": "._",
        "synonyms": [["cls", "class"]],
    },
    "stop_urls": [
        "https://modal.com/docs/reference/modal.Stub",
        "https://modal.com/gpu-glossary",
        "https://modal.com/docs/reference/changelog",
    ],
    "start_urls": [
        {
            "url": "https://modal.com/docs/guide",
            "selectors_key": "default",
            "page_rank": 2,
        },
        {
            "url": "https://modal.com/docs/examples",
            "selectors_key": "examples",
            "page_rank": 1,
        },
        {
            "url": "https://modal.com/docs/reference",
            "selectors_key": "reference",
            "page_rank": 1,
        },
    ],
    "selectors": {
        "default": {
            "lvl0": {
                "selector": "header .navlink-active",
                "global": True,
            },
            "lvl1": "article h1",
            "lvl2": "article h2",
            "lvl3": "article h3",
            "text": "article p,article ol,article ul",
        },
        "examples": {
            "lvl0": {
                "selector": "header .navlink-active",
                "global": True,
            },
            "lvl1": "article h1",
            "text": "article p,article ol,article ul",
        },
        "reference": {
            "lvl0": {
                "selector": "//div[contains(@class, 'sidebar')]//a[contains(@class, 'active')]//preceding::a[contains(@class, 'header')][1]",
                "type": "xpath",
                "global": True,
                "default_value": "",
                "skip": {"when": {"value": ""}},
            },
            "lvl1": "article h1",
            "lvl2": "article h2",
            "lvl3": "article h3",
            "text": "article p,article ol,article ul",
        },
    },
}

```

## Create an API key

If you don't already have one, sign up for an account on [Algolia](https://www.algolia.com/). Set up
a project and create an API key with `write` access to your index, and with the ACL permissions
`addObject`, `editSettings` and `deleteIndex`. Now, create a Secret on the Modal [Secrets](https://modal.com/secrets)
page with the `API_KEY` and `APPLICATION_ID` you just created. You can name this anything you want,
but we named it `algolia-secret` and so that's what the code below expects.

## The actual function

We want to trigger our crawler from our CI/CD pipeline, so we're serving it as a
[web endpoint](https://modal.com/docs/guide/webhooks) that can be triggered by a `GET` request during deploy.
You could also consider running the crawler on a [schedule](https://modal.com/docs/guide/cron).

The Algolia crawler is written for Python 3.6 and needs to run in the `pipenv` created for it,
so we're invoking it using a subprocess.

```python
@app.function(
    image=algolia_image,
    secrets=[modal.Secret.from_name("algolia-secret")],
)
def crawl():
    # Installed with a 3.6 venv; Python 3.6 is unsupported by Modal, so use a subprocess instead.
    subprocess.run(
        ["pipenv", "run", "python", "-m", "src.index"],
        env={**os.environ, "CONFIG": json.dumps(CONFIG)},
    )

```

We want to be able to trigger this function through a webhook.

```python
@app.function(image=modal.Image.debian_slim().pip_install("fastapi[standard]"))
@modal.fastapi_endpoint()
def crawl_webhook():
    crawl.remote()
    return "Finished indexing docs"

```

## Deploy the indexer

That's all the code we need! To deploy your application, run

```shell
modal deploy algolia_indexer.py
```

If successful, this will print a URL for your new webhook, that you can hit using
`curl` or a browser. Logs from webhook invocations can be found from the [apps](https://modal.com/apps)
page.

The indexed contents can be found at https://www.algolia.com/apps/APP_ID/explorer/browse/, for your
APP_ID. Once you're happy with the results, you can [set up the `docsearch` package with your
website](https://docsearch.algolia.com/docs/docsearch-v3/), and create a search component that uses this index.

## Entrypoint for development

To make it easier to test this, we also have an entrypoint for when you run
`modal run algolia_indexer.py`

```python
@app.local_entrypoint()
def run():
    crawl.remote()

```

### Amazon Embeddings

# Embed 30 million Amazon reviews at 575k tokens per second with Qwen2-7B

This example demonstrates how to create embeddings for a large text dataset. This is
often necessary to enable semantic search, translation, and other language
processing tasks. Modal makes it easy to deploy large, capable embedding models and handles
all of the scaling to process very large datasets in parallel on many cloud GPUs.

We create a Modal Function that will handle all of the data loading and submit inputs to an
inference Cls that will automatically scale up to handle hundreds of large
batches in parallel.

Between the time a batch is submitted and the time it is fetched, it is stored via
Modal's `spawn` system, which can hold onto up to one million inputs for up to a week.

```python
import json
import subprocess
from pathlib import Path

import modal

app = modal.App(name="example-amazon-embeddings")
MINUTES = 60  # seconds
HOURS = 60 * MINUTES

```

We define our `main` function as a `local_entrypoint`. This is what we'll call locally
to start the job on Modal.

You can run it with the command

```bash
modal run --detach amazon_embeddings.py
```

By default we `down-scale` to 1/100th of the data for demonstration purposes.
To launch the full job, set the `--down-scale` parameter to `1`.
But note that this will cost you!

The entrypoint starts the job and gets back a `f`unction `c`all ID for each batch.
We can use these IDs to retrieve the embeddings once the job is finished.
Modal will keep the results around for up to 7 days after completion. Take a look at our
[job processing guide](https://modal.com/docs/guide/job-queue)
for more details.

```python
@app.local_entrypoint()
def main(
    dataset_name: str = "McAuley-Lab/Amazon-Reviews-2023",
    dataset_subset: str = "raw_review_Books",
    down_scale: float = 0.001,
):
    out_path = Path("/tmp") / "embeddings-example-fc-ids.json"
    function_ids = launch_job.remote(
        dataset_name=dataset_name, dataset_subset=dataset_subset, down_scale=down_scale
    )
    out_path.write_text(json.dumps(function_ids, indent=2) + "\n")
    print(f"output handles saved to {out_path}")

```

## Load the data and start the inference job

Next we define the Function that will do the data loading and feed it to our embedding model.
We define a container [Image](https://modal.com/docs/guide/images)
with the data loading dependencies.

In it, we download the data we need and cache it to the container's local disk,
which will disappear when the job is finished. We will be saving the review data
along with the embeddings, so we don't need to keep the dataset around.

Embedding a large dataset like this can take some time, but we don't need to wait
around for it to finish. We use `spawn` to invoke our embedding Function
and get back a handle with an ID that we can use to get the results later.
This can bottleneck on just sending data over the network for processing, so
we speed things up by using `ThreadPoolExecutor` to submit batches using multiple threads.

Once all of the batches have been sent for inference, we can return the function IDs
to the local client to save.

```python
@app.function(
    image=modal.Image.debian_slim().pip_install("datasets==3.5.1"), timeout=2 * HOURS
)
def launch_job(dataset_name: str, dataset_subset: str, down_scale: float):
    import time
    from concurrent.futures import ThreadPoolExecutor, as_completed

    from datasets import load_dataset
    from tqdm import tqdm

    print("Loading dataset...")
    dataset = load_dataset(
        dataset_name,
        dataset_subset,
        split="full",
        trust_remote_code=True,
    )

    data_subset = dataset.select(range(int(len(dataset) * down_scale)))

    tei = TextEmbeddingsInference()
    batches = generate_batches_of_chunks(data_subset)

    start = time.perf_counter()
    with ThreadPoolExecutor() as executor:
        futures = [executor.submit(tei.embed.spawn, batch) for batch in tqdm(batches)]
        function_ids = []
        for future in tqdm(as_completed(futures), total=len(futures)):
            function_ids.append(future.result().object_id)

    print(f"Finished submitting job: {time.perf_counter() - start:.2f}s")

    return function_ids

```

## Massively scaling up and scaling out embedding inference on many beefy GPUs

We're going to spin up many containers to run inference, and we don't want each
one to have to download the embedding model from Hugging Face. We can download and save it to a
Modal [Volume](https://modal.com/docs/guide/volumes)
during the image build step using `run_function`.

We'll use the
[GTE-Qwen2-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct)
model from Alibaba, which performs well on the
[Massive Text Embedding Benchmark](https://huggingface.co/spaces/mteb/leaderboard).

```python
MODEL_ID = "Alibaba-NLP/gte-Qwen2-7B-instruct"
MODEL_DIR = "/model"
MODEL_CACHE_VOLUME = modal.Volume.from_name(
    "embeddings-example-model-cache", create_if_missing=True
)

def download_model():
    from huggingface_hub import snapshot_download

    snapshot_download(MODEL_ID, cache_dir=MODEL_DIR)

```

For inference, we will use Hugging Face's
[Text Embeddings Inference](https://github.com/huggingface/text-embeddings-inference)
framework for embedding model deployment.

Running lots of separate machines is "scaling out". But we can also "scale up"
by running on large, high-performance machines.

We'll use L40S GPUs for a good balance between cost and performance. Hugging Face has
prebuilt Docker images we can use as a base for our Modal Image.
We'll use the one built for the L40S's
[SM89/Ada Lovelace architecture](https://modal.com/gpu-glossary/device-hardware/streaming-multiprocessor-architecture)
and install the rest of our dependencies on top.

```python
tei_image = "ghcr.io/huggingface/text-embeddings-inference:89-1.7"

inference_image = (
    modal.Image.from_registry(tei_image, add_python="3.12")
    .dockerfile_commands("ENTRYPOINT []")
    .pip_install(
        "httpx==0.28.1",
        "huggingface_hub[hf_transfer]==0.30.2",
        "numpy==2.2.5",
        "tqdm==4.67.1",
    )
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1", "HF_HOME": MODEL_DIR})
    .run_function(download_model, volumes={MODEL_DIR: MODEL_CACHE_VOLUME})
)

```

Next we define our inference class. Modal will auto-scale the number of
containers ready to handle inputs based on the parameters we set in the `@app.cls`
and `@modal.concurrent` decorators. Here we limit the total number of containers to
100 and the maximum number of concurrent inputs to 10, which caps us at 1000 concurrent batches.
On Modal's Starter (free) and Team plans, the maximum number of concurrent GPUs is lower,
reducing the total number of concurrent batches and so the throughput.

Customers on Modal's Enterprise Plan regularly scale up another order of magnitude above this.
If you're interested in running on thousands of GPUs,
[get in touch](https://form.fillout.com/t/onUBuQZ5vCus).

Here we also specify the GPU type and attach the Modal Volume where we saved the
embedding model.

This class will spawn a local Text Embeddings Inference server when the container
starts, and process each batch by receiving the text data over HTTP, returning a list of
tuples with the batch text data and embeddings.

```python
@app.cls(
    image=inference_image,
    gpu="L40S",
    volumes={MODEL_DIR: MODEL_CACHE_VOLUME},
    max_containers=100,
    scaledown_window=5 * MINUTES,  # idle for 5 min without inputs before scaling down
    retries=3,  # handle transient failures and storms in the cloud
    timeout=2 * HOURS,  # run for at most 2 hours
)
@modal.concurrent(max_inputs=10)
class TextEmbeddingsInference:
    @modal.enter()
    def open_connection(self):
        from httpx import AsyncClient

        print("Starting text embedding inference server...")
        self.process = spawn_server()
        self.client = AsyncClient(base_url="http://127.0.0.1:8000", timeout=30)

    @modal.exit()
    def terminate_connection(self):
        self.process.terminate()

    @modal.method()
    async def embed(self, batch):
        texts = [chunk[-1] for chunk in batch]
        res = await self.client.post("/embed", json={"inputs": texts})
        return [chunk + (embedding,) for chunk, embedding in zip(batch, res.json())]

```

## Helper Functions

The book review dataset contains ~30M reviews with ~12B total characters,
indicating an average review length of ~500 characters. Some are much longer.
Embedding models have a limit on the number of tokens they can process in a single
input. We will need to split each review into chunks that are under this limit.

The proper way to split text data is to use a tokenizer to ensure that any
single request is under the models token limit, and to overlap chunks to provide
semantic context and preserve information. For the sake of this example, we're going
just to split by a set character length (`CHUNK_SIZE`).

While the embedding model has a limit on the number of input tokens for a single
embedding, the number of chunks that we can process in a single batch is limited by
the VRAM of the GPU. We set the `BATCH_SIZE` accordingly.

```python
BATCH_SIZE = 256
CHUNK_SIZE = 512

def generate_batches_of_chunks(
    dataset, chunk_size: int = CHUNK_SIZE, batch_size: int = BATCH_SIZE
):
    """Creates batches of chunks by naively slicing strings according to CHUNK_SIZE."""
    batch = []
    for entry_index, data in enumerate(dataset):
        product_id = data["asin"]
        user_id = data["user_id"]
        timestamp = data["timestamp"]
        title = data["title"]
        text = data["text"]
        for chunk_index, chunk_start in enumerate(range(0, len(text), chunk_size)):
            batch.append(
                (
                    entry_index,
                    chunk_index,
                    product_id,
                    user_id,
                    timestamp,
                    title,
                    text[chunk_start : chunk_start + chunk_size],
                )
            )
            if len(batch) == batch_size:
                yield batch
                batch = []
    if batch:
        yield batch

def spawn_server(
    model_id: str = MODEL_ID,
    port: int = 8000,
    max_client_batch_size: int = BATCH_SIZE,
    max_batch_tokens: int = BATCH_SIZE * CHUNK_SIZE,
    huggingface_hub_cache: str = MODEL_DIR,
):
    """Starts a text embedding inference server in a subprocess."""
    import socket

    LAUNCH_FLAGS = [
        "--model-id",
        model_id,
        "--port",
        str(port),
        "--max-client-batch-size",
        str(max_client_batch_size),
        "--max-batch-tokens",
        str(max_batch_tokens),
        "--huggingface-hub-cache",
        huggingface_hub_cache,
    ]

    process = subprocess.Popen(["text-embeddings-router"] + LAUNCH_FLAGS)
    # Poll until webserver at 127.0.0.1:8000 accepts connections before running inputs.
    while True:
        try:
            socket.create_connection(("127.0.0.1", port), timeout=1).close()
            print("Inference server ready!")
            return process
        except (socket.timeout, ConnectionRefusedError):
            retcode = process.poll()  # Check if the process has terminated.
            if retcode is not None:
                raise RuntimeError(f"Launcher exited unexpectedly with code {retcode}")

```

### Anthropic Computer Use

# Run Anthropic's computer use demo in a Modal Sandbox

This example demonstrates how to run Anthropic's [Computer Use demo](https://github.com/anthropics/anthropic-quickstarts/tree/main/computer-use-demo)
in a Modal [Sandbox](https://modal.com/docs/guide/sandbox).

## Sandbox Setup

All Sandboxes are associated with an App.

We start by looking up an existing App by name, or creating one if it doesn't exist.

```python
import time
import urllib.request

import modal
import modal.experimental

app = modal.App.lookup("example-anthropic-computer-use", create_if_missing=True)

```

The Computer Use [quickstart](https://github.com/anthropics/anthropic-quickstarts/tree/main/computer-use-demo)
provides a prebuilt Docker image. We use this hosted image to create our sandbox environment.

```python
sandbox_image = (
    modal.experimental.raw_registry_image(
        "ghcr.io/anthropics/anthropic-quickstarts:computer-use-demo-latest",
    )
    .env({"WIDTH": "1920", "HEIGHT": "1080"})
    .workdir("/home/computeruse")
    .entrypoint([])
)

```

We'll provide the Anthropic API key via a Modal [Secret](https://modal.com/docs/guide/secrets)
which the sandbox can access at runtime.

```python
secret = modal.Secret.from_name("anthropic-secret", required_keys=["ANTHROPIC_API_KEY"])

```

Now, we can start our Sandbox.
We use `modal.enable_output()` to print the Sandbox's image build logs to the console.
We'll also expose the ports required for the demo's interfaces:

- Port 8501 serves the Streamlit UI for interacting with the agent loop
- Port 6080 serves the VNC desktop view via a browser-based noVNC client

```python
with modal.enable_output():
    sandbox = modal.Sandbox.create(
        "sudo",
        "--preserve-env=ANTHROPIC_API_KEY,DISPLAY_NUM,WIDTH,HEIGHT,PATH",
        "-u",
        "computeruse",
        "./entrypoint.sh",
        app=app,
        image=sandbox_image,
        secrets=[secret],
        encrypted_ports=[8501, 6080],
        timeout=60 * 60,  # stay alive for one hour, maximum one day
    )

print(f"🏖️  Sandbox ID: {sandbox.object_id}")

```

After starting the sandbox, we retrieve the public URLs for the exposed ports.

```python
tunnels = sandbox.tunnels()
for port, tunnel in tunnels.items():
    print(f"Waiting for service on port {port} to start at {tunnel.url}")

```

We can check on each server's status by making an HTTP request to the server's URL
and verifying that it responds with a 200 status code.

```python
def is_server_up(url):
    try:
        response = urllib.request.urlopen(url)
        return response.getcode() == 200
    except Exception:
        return False

timeout = 60  # seconds
start_time = time.time()
up_ports = set()
while time.time() - start_time < timeout:
    for port, tunnel in tunnels.items():
        if port not in up_ports and is_server_up(tunnel.url):
            print(f"🏖️  Server is up and running on port {port}!")
            up_ports.add(port)
    if len(up_ports) == len(tunnels):
        break
    time.sleep(1)
else:
    print("🏖️  Timed out waiting for server to start.")

```

You can now open the URLs in your browser to interact with the demo!
Note: The sandbox logs may mention `localhost:8080`.
Ignore this and use the printed tunnel URLs instead.

When finished, you can terminate the sandbox from your [Modal dashboard](https://modal.com/containers)
or by running `Sandbox.from_id(sandbox.object_id).terminate()`.
The Sandbox will also spin down after one hour.

### App

## Demo Streamlit application.

This application is the example from https://docs.streamlit.io/library/get-started/create-an-app.

Streamlit is designed to run its apps as Python scripts, not functions, so we separate the Streamlit
code into this module, away from the Modal application code.

```python
def main():
    import numpy as np
    import pandas as pd
    import streamlit as st

    st.title("Uber pickups in NYC!")

    DATE_COLUMN = "date/time"
    DATA_URL = (
        "https://s3-us-west-2.amazonaws.com/"
        "streamlit-demo-data/uber-raw-data-sep14.csv.gz"
    )

    @st.cache_data
    def load_data(nrows):
        data = pd.read_csv(DATA_URL, nrows=nrows)

        def lowercase(x):
            return str(x).lower()

        data.rename(lowercase, axis="columns", inplace=True)
        data[DATE_COLUMN] = pd.to_datetime(data[DATE_COLUMN])
        return data

    data_load_state = st.text("Loading data...")
    data = load_data(10000)
    data_load_state.text("Done! (using st.cache_data)")

    if st.checkbox("Show raw data"):
        st.subheader("Raw data")
        st.write(data)

    st.subheader("Number of pickups by hour")
    hist_values = np.histogram(data[DATE_COLUMN].dt.hour, bins=24, range=(0, 24))[0]
    st.bar_chart(hist_values)

    # Some number in the range 0-23
    hour_to_filter = st.slider("hour", 0, 23, 17)
    filtered_data = data[data[DATE_COLUMN].dt.hour == hour_to_filter]

    st.subheader("Map of all pickups at %s:00" % hour_to_filter)
    st.map(filtered_data)

if __name__ == "__main__":
    main()

```

### Badges

# Serve a dynamic SVG badge

In this example, we use Modal's [webhook](https://modal.com/docs/guide/webhooks) capability to host a dynamic SVG badge that shows
you the current number of downloads for a Python package.

First let's start off by creating a Modal app, and defining an image with the Python packages we're going to be using:

```python
import modal

image = modal.Image.debian_slim().pip_install(
    "fastapi[standard]", "pybadges", "pypistats"
)

app = modal.App("example-badges", image=image)

```

## Defining the web endpoint

In addition to using `@app.function()` to decorate our function, we use the
[`@modal.fastapi_endpoint` decorator](https://modal.com/docs/guide/webhooks)
which instructs Modal to create a REST endpoint that serves this function.
Note that the default method is `GET`, but this can be overridden using the `method` argument.

```python
@app.function()
@modal.fastapi_endpoint()
async def package_downloads(package_name: str):
    import json

    import pypistats
    from fastapi import Response
    from pybadges import badge

    stats = json.loads(pypistats.recent(package_name, format="json"))
    svg = badge(
        left_text=f"{package_name} downloads",
        right_text=str(stats["data"]["last_month"]),
        right_color="blue",
    )

    return Response(content=svg, media_type="image/svg+xml")

```

In this function, we use `pypistats` to query the most recent stats for our package, and then
use that as the text for a SVG badge, rendered using `pybadges`.
Since Modal web endpoints are FastAPI functions under the hood, we return this SVG wrapped in a FastAPI response with the correct media type.
Also note that FastAPI automatically interprets `package_name` as a [query param](https://fastapi.tiangolo.com/tutorial/query-params/).

## Running and deploying

We can now run an ephemeral app on the command line using:

```shell
modal serve badges.py
```

This will create a short-lived web url that exists until you terminate the script.
It will also hot-reload the code if you make changes to it.

If you want to create a persistent URL, you have to deploy the script.
To deploy using the Modal CLI by running `modal deploy badges.py`,

Either way, as soon as we run this command, Modal gives us the link to our brand new
web endpoint in the output:

![web badge deployment](./badges_deploy.png)

We can now visit the link using a web browser, using a `package_name` of our choice in the URL query params.
For example:
- `https://YOUR_SUBDOMAIN.modal.run/?package_name=synchronicity`
- `https://YOUR_SUBDOMAIN.modal.run/?package_name=torch`

### Basic Grid Search

# Hyperparameter search

This example showcases a simple grid search in one dimension, where we try different
parameters for a model and pick the one with the best results on a holdout set.

## Defining the image

First, let's build a custom image and install scikit-learn in it.

```python
import modal

app = modal.App(
    "example-basic-grid-search",
    image=modal.Image.debian_slim().pip_install("scikit-learn~=1.5.0"),
)

```

## The Modal function

Next, define the function. Note that we use the custom image with scikit-learn in it.
We also take the hyperparameter `k`, which is how many nearest neighbors we use.

```python
@app.function()
def fit_knn(k):
    from sklearn.datasets import load_digits
    from sklearn.model_selection import train_test_split
    from sklearn.neighbors import KNeighborsClassifier

    X, y = load_digits(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

    clf = KNeighborsClassifier(k)
    clf.fit(X_train, y_train)
    score = float(clf.score(X_test, y_test))
    print("k = %3d, score = %.4f" % (k, score))
    return score, k

```

## Parallel search

To do a hyperparameter search, let's map over this function with different values
for `k`, and then select for the best score on the holdout set:

```python
@app.local_entrypoint()
def main():
    # Do a basic hyperparameter search
    best_score, best_k = max(fit_knn.map(range(1, 100)))
    print("Best k = %3d, score = %.4f" % (best_k, best_score))

```

### Basic Web

# Hello world wide web!

Modal makes it easy to turn your Python functions into serverless web services:
access them via a browser or call them from any client that speaks HTTP, all
without having to worry about setting up servers or managing infrastructure.

This tutorial shows the path with the shortest ["time to 200"](https://shkspr.mobi/blog/2021/05/whats-your-apis-time-to-200/):
[`modal.fastapi_endpoint`](https://modal.com/docs/reference/modal.fastapi_endpoint).

On Modal, web endpoints have all the superpowers of Modal Functions:
they can be [accelerated with GPUs](https://modal.com/docs/guide/gpu),
they can access [Secrets](https://modal.com/docs/guide/secrets) or [Volumes](https://modal.com/docs/guide/volumes),
and they [automatically scale](https://modal.com/docs/guide/cold-start) to handle more traffic.

Under the hood, we use the [FastAPI library](https://fastapi.tiangolo.com/),
which has [high-quality documentation](https://fastapi.tiangolo.com/tutorial/),
linked throughout this tutorial.

## Turn a Modal Function into an API endpoint with a single decorator

Modal Functions are already accessible remotely -- when you add the `@app.function` decorator to a Python function
and run `modal deploy`, you make it possible for your [other Python functions to call it](https://modal.com/docs/guide/trigger-deployed-functions).

That's great, but it's not much help if you want to share what you've written with someone running code in a different language --
or not running code at all!

And that's where most of the power of the Internet comes from: sharing information and functionality across different computer systems.

So we provide the `fastapi_endpoint` decorator to wrap your Modal Functions in the lingua franca of the web: HTTP.
Here's what that looks like:

```python
import modal

image = modal.Image.debian_slim().pip_install("fastapi[standard]")
app = modal.App(name="example-basic-web", image=image)

@app.function()
@modal.fastapi_endpoint(
    docs=True  # adds interactive documentation in the browser
)
def hello():
    return "Hello world!"

```

You can turn this function into a web endpoint by running `modal serve basic_web.py`.
In the output, you should see a URL that ends with `hello-dev.modal.run`.
If you navigate to this URL, you should see the `"Hello world!"` message appear in your browser.

You can also find interactive documentation, powered by OpenAPI and Swagger,
if you add `/docs` to the end of the URL.
From this documentation, you can interact with your endpoint, sending HTTP requests and receiving HTTP responses.
For more details, see the [FastAPI documentation](https://fastapi.tiangolo.com/features/#automatic-docs).

By running the endpoint with `modal serve`, you created a temporary endpoint that will disappear if you interrupt your terminal.
These temporary endpoints are great for debugging -- when you save a change to any of your dependent files, the endpoint will redeploy.
Try changing the message to something else, hitting save, and then hitting refresh in your browser or re-sending
the request from `/docs` or the command line. You should see the new message, along with logs in your terminal showing the redeploy and the request.

When you're ready to deploy this endpoint permanently, run `modal deploy basic_web.py`.
Now, your function will be available even when you've closed your terminal or turned off your computer.

## Send data to a web endpoint

The web endpoint above was a bit silly: it always returns the same message.

Most endpoints need an input to be useful. There are two ways to send data to a web endpoint:
- in the URL as a [query parameter](#sending-data-in-query-parameters)
- in the [body of the request](#sending-data-in-the-request-body) as JSON

### Sending data in query parameters

By default, your function's arguments are treated as query parameters:
they are extracted from the end of the URL, where they should be added in the form
`?arg1=foo&arg2=bar`.

From the Python side, there's hardly anything to do:

```python
@app.function()
@modal.fastapi_endpoint(docs=True)
def greet(user: str) -> str:
    return f"Hello {user}!"

```

If you are already running `modal serve basic_web.py`, this endpoint will be available at a URL, printed in your terminal, that ends with `greet-dev.modal.run`.

We provide Python type-hints to get type information in the docs and
[automatic validation](https://fastapi.tiangolo.com/tutorial/query-params-str-validations/).
For example, if you navigate directly to the URL for `greet`, you will get a detailed error message
indicating that the `user` parameter is missing. Navigate instead to `/docs` to see how to invoke the endpoint properly.

You can read more about query parameters in the [FastAPI documentation](https://fastapi.tiangolo.com/tutorial/query-params/).

### Sending data in the request body

For larger and more complex data, it is generally preferrable to send data in the body of the HTTP request.
This body is formatted as [JSON](https://developer.mozilla.org/en-US/docs/Learn/JavaScript/Objects/JSON),
the most common data interchange format on the web.

To set up an endpoint that accepts JSON data, add an argument with a `dict` type-hint to your function.
This argument will be populated with the data sent in the request body.

```python
@app.function()
@modal.fastapi_endpoint(method="POST", docs=True)
def goodbye(data: dict) -> str:
    name = data.get("name") or "world"
    return f"Goodbye {name}!"

```

Note that we gave a value of `"POST"` for the `method` argument here.
This argument defines the HTTP request method that the endpoint will respond to,
and it defaults to `"GET"`.
If you head to the URL for the `goodbye` endpoint in your browser,
you will get a 405 Method Not Allowed error, because browsers only send GET requests by default.
While this is technically a separate concern from query parameters versus request bodies
and you can define an endpoint that accepts GET requests and uses data from the body,
it is [considered bad form](https://stackoverflow.com/a/983458).

Navigate to `/docs` for more on how to invoke the endpoint properly.
You will need to send a POST request with a JSON body containing a `name` key.
To get the same typing and validation benefits as with query parameters,
use a [Pydantic model](https://fastapi.tiangolo.com/tutorial/body/)
for this argument.

You can read more about request bodies in the [FastAPI documentation](https://fastapi.tiangolo.com/tutorial/body/).

## Handle expensive startup with `modal.Cls`

Sometimes your endpoint needs to do something before it can handle its first request,
like get a value from a database or set the value of a variable.
If that step is expensive, like [loading a large ML model](https://modal.com/docs/guide/model-weights),
it'd be a shame to have to do it every time a request comes in!

Web endpoints can be methods on a [`modal.Cls`](https://modal.com/docs/guide/lifecycle-functions#container-lifecycle-functions-and-parameters),
which allows you to manage the container's lifecycle independently from processing individual requests.

This example will only set the `start_time` instance variable once, on container startup.

```python
@app.cls()
class WebApp:
    @modal.enter()
    def startup(self):
        from datetime import datetime, timezone

        print("🏁 Starting up!")
        self.start_time = datetime.now(timezone.utc)

    @modal.fastapi_endpoint(docs=True)
    def web(self):
        from datetime import datetime, timezone

        current_time = datetime.now(timezone.utc)
        return {"start_time": self.start_time, "current_time": current_time}

```

## Protect web endpoints with proxy authentication

Sharing your Python functions on the web is great, but it's not always a good idea
to make those functions available to just anyone.

For example, you might have a function like the one below that
is more expensive to run than to call (and so might be abused by your enemies)
or reveals information that you would rather keep secret.

To protect your Modal web endpoints so that they can't be triggered except
by members of your [Modal workspace](https://modal.com/docs/guide/workspaces),
add the `requires_proxy_auth=True` flag to the `fastapi_endpoint` decorator.

```python
@app.function(gpu="h100")
@modal.fastapi_endpoint(requires_proxy_auth=True, docs=False)
def expensive_secret():
    return "I didn't care for 'The Godfather'. It insists upon itself."

```

The `expensive-secret` endpoint URL will still be printed to the output when you `modal serve` or `modal deploy`,
along with a "🔑" emoji indicating that it is secured with proxy authentication.
If you head to that URL via the browser, you will get a
[`401 Unauthorized`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/401) error code in response.
You should also check the dashboard page for this app (at the URL printed at the very top of the `modal` command output)
so you can see that no containers were spun up to handle the request -- this authorization is handled entirely inside Modal's infrastructure.

You can trigger the web endpoint by [creating a Proxy Auth Token](https://modal.com/settings/proxy-auth-tokens)
and then including the token ID and secret in the `Modal-Key` and `Modal-Secret` headers.

From the command line, that might look like

```shell
export TOKEN_ID=wk-1234abcd
export TOKEN_SECRET=ws-1234abcd
curl -H "Modal-Key: $TOKEN_ID" \
     -H "Modal-Secret: $TOKEN_SECRET" \
     https://your-workspace-name--expensive-secret.modal.run
```

For more details, see the
[guide to proxy authentication](https://modal.com/docs/guide/webhook-proxy-auth).

## What next?

Modal's `fastapi_endpoint` decorator is opinionated and designed for relatively simple web applications --
one or a few independent Python functions that you want to expose to the web.

Three additional decorators allow you to serve more complex web applications with greater control:
- [`asgi_app`](https://modal.com/docs/guide/webhooks#asgi) to serve applications compliant with the ASGI standard,
like [FastAPI](https://fastapi.tiangolo.com/)
- [`wsgi_app`](https://modal.com/docs/guide/webhooks#wsgi) to serve applications compliant with the WSGI standard,
like [Flask](https://flask.palletsprojects.com/)
- [`web_server`](https://modal.com/docs/guide/webhooks#non-asgi-web-servers) to serve any application that listens on a port

### Batched Whisper

# Fast Whisper inference using dynamic batching

In this example, we demonstrate how to run [dynamically batched inference](https://modal.com/docs/guide/dynamic-batching)
for OpenAI's speech recognition model, [Whisper](https://openai.com/index/whisper/), on Modal.
Batching multiple audio samples together or batching chunks of a single audio sample can help to achieve a 2.8x increase
in inference throughput on an A10G!

We will be running the [Whisper Large V3](https://huggingface.co/openai/whisper-large-v3) model.
To run [any of the other HuggingFace Whisper models](https://huggingface.co/models?search=openai/whisper),
simply replace the `MODEL_NAME` and `MODEL_REVISION` variables.

## Setup

Let's start by importing the Modal client and defining the model that we want to serve.

```python
from typing import Optional

import modal

MODEL_DIR = "/model"
MODEL_NAME = "openai/whisper-large-v3"
MODEL_REVISION = "afda370583db9c5359511ed5d989400a6199dfe1"

```

## Define a container image

We’ll start with Modal's baseline `debian_slim` image and install the relevant libraries.

```python
image = (
    modal.Image.debian_slim(python_version="3.11")
    .pip_install(
        "torch==2.5.1",
        "transformers==4.47.1",
        "hf-transfer==0.1.8",
        "huggingface_hub==0.27.0",
        "librosa==0.10.2",
        "soundfile==0.12.1",
        "accelerate==1.2.1",
        "datasets==3.2.0",
    )
    # Use the barebones `hf-transfer` package for maximum download speeds. No progress bar, but expect 700MB/s.
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1", "HF_HUB_CACHE": MODEL_DIR})
)

model_cache = modal.Volume.from_name("hf-hub-cache", create_if_missing=True)
app = modal.App(
    "example-batched-whisper",
    image=image,
    volumes={MODEL_DIR: model_cache},
)

```

## Caching the model weights

We'll define a function to download the model and cache it in a volume.
You can `modal run` against this function prior to deploying the App.

```python
@app.function()
def download_model():
    from huggingface_hub import snapshot_download
    from transformers.utils import move_cache

    snapshot_download(
        MODEL_NAME,
        ignore_patterns=["*.pt", "*.bin"],  # Using safetensors
        revision=MODEL_REVISION,
    )
    move_cache()

```

## The model class

The inference function is best represented using Modal's [class syntax](https://modal.com/docs/guide/lifecycle-functions).

We define a `@modal.enter` method to load the model when the container starts, before it picks up any inputs.
The weights will be loaded from the Hugging Face cache volume so that we don't need to download them when
we start a new container. For more on storing model weights on Modal, see
[this guide](https://modal.com/docs/guide/model-weights).

We also define a `transcribe` method that uses the `@modal.batched` decorator to enable dynamic batching.
This allows us to invoke the function with individual audio samples, and the function will automatically batch them
together before running inference. Batching is critical for making good use of the GPU, since GPUs are designed
for running parallel operations at high throughput.

The `max_batch_size` parameter limits the maximum number of audio samples combined into a single batch.
We used a `max_batch_size` of `64`, the largest power-of-2 batch size that can be accommodated by the 24 A10G GPU memory.
This number will vary depending on the model and the GPU you are using.

The `wait_ms` parameter sets the maximum time to wait for more inputs before running the batched transcription.
To tune this parameter, you can set it to the target latency of your application minus the execution time of an inference batch.
This allows the latency of any request to stay within your target latency.

```python
@app.cls(
    gpu="a10g",  # Try using an A100 or H100 if you've got a large model or need big batches!
    max_containers=10,  # default max GPUs for Modal's free tier
)
class Model:
    @modal.enter()
    def load_model(self):
        import torch
        from transformers import (
            AutoModelForSpeechSeq2Seq,
            AutoProcessor,
            pipeline,
        )

        self.processor = AutoProcessor.from_pretrained(MODEL_NAME)
        self.model = AutoModelForSpeechSeq2Seq.from_pretrained(
            MODEL_NAME,
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
            use_safetensors=True,
        ).to("cuda")

        self.model.generation_config.language = "<|en|>"

        # Create a pipeline for preprocessing and transcribing speech data
        self.pipeline = pipeline(
            "automatic-speech-recognition",
            model=self.model,
            tokenizer=self.processor.tokenizer,
            feature_extractor=self.processor.feature_extractor,
            torch_dtype=torch.float16,
            device="cuda",
        )

    @modal.batched(max_batch_size=64, wait_ms=1000)
    def transcribe(self, audio_samples):
        import time

        start = time.monotonic_ns()
        print(f"Transcribing {len(audio_samples)} audio samples")
        transcriptions = self.pipeline(audio_samples, batch_size=len(audio_samples))
        end = time.monotonic_ns()
        print(
            f"Transcribed {len(audio_samples)} samples in {round((end - start) / 1e9, 2)}s"
        )
        return transcriptions

```

## Transcribe a dataset

In this example, we use the [librispeech_asr_dummy dataset](https://huggingface.co/datasets/hf-internal-testing/librispeech_asr_dummy)
from Hugging Face's Datasets library to test the model.

We use [`map.aio`](https://modal.com/docs/reference/modal.Function#map) to asynchronously map over the audio files.
This allows us to invoke the batched transcription method on each audio sample in parallel.

```python
@app.function()
async def transcribe_hf_dataset(dataset_name):
    from datasets import load_dataset

    print("📂 Loading dataset", dataset_name)
    ds = load_dataset(dataset_name, "clean", split="validation")
    print("📂 Dataset loaded")
    batched_whisper = Model()
    print("📣 Sending data for transcription")
    async for transcription in batched_whisper.transcribe.map.aio(ds["audio"]):
        yield transcription

```

## Run the model

We define a [`local_entrypoint`](https://modal.com/docs/guide/apps#entrypoints-for-ephemeral-apps)
to run the transcription. You can run this locally with `modal run batched_whisper.py`.

```python
@app.local_entrypoint()
async def main(dataset_name: Optional[str] = None):
    if dataset_name is None:
        dataset_name = "hf-internal-testing/librispeech_asr_dummy"
    for result in transcribe_hf_dataset.remote_gen(dataset_name):
        print(result["text"])

```

### Blender Video

# Render a video with Blender on many GPUs or CPUs in parallel

This example shows how you can render an animated 3D scene using
[Blender](https://www.blender.org/)'s Python interface.

You can run it on CPUs to scale out on one hundred containers
or run it on GPUs to get higher throughput per node.
Even for this simple scene, GPUs render >10x faster than CPUs.

The final render looks something like this:

<center>
<video controls autoplay loop muted>
<source src="https://modal-cdn.com/modal-blender-video.mp4" type="video/mp4">
</video>
</center>

## Defining a Modal app

```python
from pathlib import Path

import modal

```

Modal runs your Python functions for you in the cloud.
You organize your code into apps, collections of functions that work together.

```python
app = modal.App("example-blender-video")

```

We need to define the environment each function runs in --  its container image.
The block below defines a container image, starting from a basic Debian Linux image
adding Blender's system-level dependencies
and then installing the `bpy` package, which is Blender's Python API.

```python
rendering_image = (
    modal.Image.debian_slim(python_version="3.11")
    .apt_install("xorg", "libxkbcommon0")  # X11 (Unix GUI) dependencies
    .pip_install("bpy==4.1.0")  # Blender as a Python package
)

```

## Rendering a single frame

We define a function that renders a single frame. We'll scale this function out on Modal later.

Functions in Modal are defined along with their hardware and their dependencies.
This function can be run with GPU acceleration or without it, and we'll use a global flag in the code to switch between the two.

```python
WITH_GPU = (
    True  # try changing this to False to run rendering massively in parallel on CPUs!
)

```

We decorate the function with `@app.function` to define it as a Modal function.
Note that in addition to defining the hardware requirements of the function,
we also specify the container image that the function runs in (the one we defined above).

The details of the scene aren't too important for this example, but we'll load
a .blend file that we created earlier. This scene contains a rotating
Modal logo made of a transmissive ice-like material, with a generated displacement map. The
animation keyframes were defined in Blender.

```python
@app.function(
    gpu="L40S" if WITH_GPU else None,
    # default limits on Modal free tier
    max_containers=10 if WITH_GPU else 100,
    image=rendering_image,
)
def render(blend_file: bytes, frame_number: int = 0) -> bytes:
    """Renders the n-th frame of a Blender file as a PNG."""
    import bpy

    input_path = "/tmp/input.blend"
    output_path = f"/tmp/output-{frame_number}.png"

    # Blender requires input as a file.
    Path(input_path).write_bytes(blend_file)

    bpy.ops.wm.open_mainfile(filepath=input_path)
    bpy.context.scene.frame_set(frame_number)
    bpy.context.scene.render.filepath = output_path
    configure_rendering(bpy.context, with_gpu=WITH_GPU)
    bpy.ops.render.render(write_still=True)

    # Blender renders image outputs to a file as well.
    return Path(output_path).read_bytes()

```

### Rendering with acceleration

We can configure the rendering process to use GPU acceleration with NVIDIA CUDA.
We select the [Cycles rendering engine](https://www.cycles-renderer.org/), which is compatible with CUDA,
and then activate the GPU.

```python
def configure_rendering(ctx, with_gpu: bool):
    # configure the rendering process
    ctx.scene.render.engine = "CYCLES"
    ctx.scene.render.resolution_x = 3000
    ctx.scene.render.resolution_y = 2000
    ctx.scene.render.resolution_percentage = 50
    ctx.scene.cycles.samples = 128

    cycles = ctx.preferences.addons["cycles"]

    # Use GPU acceleration if available.
    if with_gpu:
        cycles.preferences.compute_device_type = "CUDA"
        ctx.scene.cycles.device = "GPU"

        # reload the devices to update the configuration
        cycles.preferences.get_devices()
        for device in cycles.preferences.devices:
            device.use = True

    else:
        ctx.scene.cycles.device = "CPU"

    # report rendering devices -- a nice snippet for debugging and ensuring the accelerators are being used
    for dev in cycles.preferences.devices:
        print(f"ID:{dev['id']} Name:{dev['name']} Type:{dev['type']} Use:{dev['use']}")

```

## Combining frames into a video

Rendering 3D images is fun, and GPUs can make it faster, but rendering 3D videos is better!
We add another function to our app, running on a different, simpler container image
and different hardware, to combine the frames into a video.

```python
combination_image = modal.Image.debian_slim(python_version="3.11").apt_install("ffmpeg")

```

The function to combine the frames into a video takes a sequence of byte sequences, one for each rendered frame,
and converts them into a single sequence of bytes, the MP4 file.

```python
@app.function(image=combination_image)
def combine(frames_bytes: list[bytes], fps: int = 60) -> bytes:
    import subprocess
    import tempfile

    with tempfile.TemporaryDirectory() as tmpdir:
        for i, frame_bytes in enumerate(frames_bytes):
            frame_path = Path(tmpdir) / f"frame_{i:05}.png"
            frame_path.write_bytes(frame_bytes)
        out_path = Path(tmpdir) / "output.mp4"
        subprocess.run(
            f"ffmpeg -framerate {fps} -pattern_type glob -i '{tmpdir}/*.png' -c:v libx264 -pix_fmt yuv420p {out_path}",
            shell=True,
        )
        return out_path.read_bytes()

```

## Rendering in parallel in the cloud from the comfort of the command line

With these two functions defined, we need only a few more lines to run our rendering at scale on Modal.

First, we need a function that coordinates our functions to `render` frames and `combine` them.
We decorate that function with `@app.local_entrypoint` so that we can run it with `modal run blender_video.py`.

In that function, we use `render.map` to map the `render` function over the range of frames.

We give the `local_entrypoint` two parameters to control the render -- the number of frames to render and how many frames to skip.
These demonstrate a basic pattern for controlling Functions on Modal from a local client.

We collect the bytes from each frame into a `list` locally and then send it to `combine` with `.remote`.

The bytes for the video come back to our local machine, and we write them to a file.

The whole rendering process (for four seconds of 1080p 60 FPS video) takes about three minutes to run on 10 L40S GPUs,
with a per-frame latency of about six seconds, and about five minutes to run on 100 CPUs, with a per-frame latency of about one minute.

```python
@app.local_entrypoint()
def main(frame_count: int = 250, frame_skip: int = 1):
    output_directory = Path("/tmp") / "render"
    output_directory.mkdir(parents=True, exist_ok=True)

    input_path = Path(__file__).parent / "IceModal.blend"
    blend_bytes = input_path.read_bytes()
    args = [(blend_bytes, frame) for frame in range(1, frame_count + 1, frame_skip)]
    images = list(render.starmap(args))
    for i, image in enumerate(images):
        frame_path = output_directory / f"frame_{i + 1}.png"
        frame_path.write_bytes(image)
        print(f"Frame saved to {frame_path}")

    video_path = output_directory / "output.mp4"
    video_bytes = combine.remote(images)
    video_path.write_bytes(video_bytes)
    print(f"Video saved to {video_path}")

```

### Boltz Predict

# Fold proteins with Boltz-2

<figure style="width: 70%; margin: 0 auto; display: block;">
<img src="https://modal-cdn.com/cdnbot/boltz_examplecd5u3m0j_9fa47e43.webp" alt="Boltz-2" />
<figcaption style="text-align: center"><em>Example of Boltz-2 protein structure prediction
of a <a style="text-decoration: underline;" href="https://github.com/jwohlwend/boltz/blob/main/examples/affinity.yaml" target="_blank">protein-ligand complex</a></em></figcaption>
</figure>

Boltz-2 is an open source molecular structure prediction model.
In contrast to previous models like Boltz-1, [Chai-1](https://modal.com/docs/examples/chai1), and AlphaFold-3, it not only predicts protein structures but also the [binding affinities](https://en.wikipedia.org/wiki/Ligand_(biochemistry)#Receptor/ligand_binding_affinity) between proteins and [ligands](https://en.wikipedia.org/wiki/Ligand_(biochemistry)).
It was created by the [MIT Jameel Clinic](https://jclinic.mit.edu/boltz-2/).
For details, see [their technical report](https://jeremywohlwend.com/assets/boltz2.pdf).

Here, we demonstrate how to run Boltz-2 on Modal.

## Setup

```python
from pathlib import Path
from typing import Optional

import modal

here = Path(__file__).parent  # the directory of this file

MINUTES = 60  # seconds

app = modal.App(name="example-boltz-predict")

```

## Fold a protein from the command line

The logic for running Boltz-2 is encapsulated in the function below,
which you can trigger from the command line by running

```shell
modal run boltz_predict.py
```

This will set up the environment for running Boltz-2 inference in Modal's cloud,
run it, and then save the results locally as a [tarball](https://computing.help.inf.ed.ac.uk/FAQ/whats-tarball-or-how-do-i-unpack-or-create-tgz-or-targz-file).
That tarball archive contains, among other things, the predicted structure as a
[Crystallographic Information File](https://en.wikipedia.org/wiki/Crystallographic_Information_File),
which you can render with the online [Molstar Viewer](https://molstar.org/viewer).

You can pass any options for the [`boltz predict` command line tool](https://github.com/jwohlwend/boltz/blob/main/docs/prediction.md)
as a string, like

``` shell
modal run boltz_predict.py --args "--sampling_steps 10"
```

To see more options, run the command with the `--help` flag.

To learn how it works, read on!

```python
@app.local_entrypoint()
def main(
    force_download: bool = False, input_yaml_path: Optional[str] = None, args: str = ""
):
    print("🧬 loading model remotely")
    download_model.remote(force_download)

    if input_yaml_path is None:
        input_yaml_path = here / "data" / "boltz_affinity.yaml"
    input_yaml = input_yaml_path.read_text()

    print(f"🧬 running boltz with input from {input_yaml_path}")
    output = boltz_inference.remote(input_yaml)

    output_path = Path("/tmp") / "boltz" / "boltz_result.tar.gz"
    output_path.parent.mkdir(exist_ok=True, parents=True)
    print(f"🧬 writing output to {output_path}")
    output_path.write_bytes(output)

```

## Installing Boltz-2 Python dependencies on Modal

Code running on Modal runs inside containers built from [container images](https://modal.com/docs/guide/images)
that include that code's dependencies.

Because Modal images include [GPU drivers](https://modal.com/docs/guide/cuda) by default,
installation of higher-level packages like `boltz` that require GPUs is painless.

Here, we do it in a few lines, using the `uv` package manager for extra speed.

```python
image = modal.Image.debian_slim(python_version="3.12").uv_pip_install("boltz==2.1.1")

```

## Storing Boltz-2 model weights on Modal with Volumes

Not all "dependencies" belong in a container image. Boltz-2, for example, depends on
the weights of the model and a [Chemical Component Dictionary](https://www.wwpdb.org/data/ccd) (CCD) file.

Rather than loading them dynamically at run-time (which would add several minutes of GPU time to each inference),
or installing them into the image (which would require they be re-downloaded any time the other dependencies changed),
we load them onto a [Modal Volume](https://modal.com/docs/guide/volumes).
A Modal Volume is a file system that all of your code running on Modal (or elsewhere!) can access.
For more on storing model weights on Modal, see [this guide](https://modal.com/docs/guide/model-weights).
For details on how we download the weights in this case, see the [Addenda](#addenda).

```python
boltz_model_volume = modal.Volume.from_name("boltz-models", create_if_missing=True)
models_dir = Path("/models/boltz")

```

## Running Boltz-2 on Modal

To run inference on Modal we wrap our function in a decorator, `@app.function`.
We provide that decorator with some arguments that describe the infrastructure our code needs to run:
the Volume we created, the Image we defined, and of course a fast GPU!

Note that the `boltz` command-line tool we use takes the path to a
[specially-formatted YAML file](https://github.com/jwohlwend/boltz/blob/main/docs/prediction.md#yaml-format)
that includes definitions of molecules to predict the structures of and optionally paths to
[Multiple Sequence Alignment](https://en.wikipedia.org/wiki/Multiple_sequence_alignment) (MSA) files
for any protein molecules. We pass the [--use_msa_server](https://github.com/jwohlwend/boltz/blob/main/docs/prediction.md) flag to auto-generate the MSA using the mmseqs2 server.

```python
@app.function(
    image=image,
    volumes={models_dir: boltz_model_volume},
    timeout=10 * MINUTES,
    gpu="H100",
)
def boltz_inference(boltz_input_yaml: str, args="") -> bytes:
    import shlex
    import subprocess

    input_path = Path("input.yaml")
    input_path.write_text(boltz_input_yaml)

    args = shlex.split(args)

    print(f"🧬 predicting structure using boltz model from {models_dir}")
    subprocess.run(
        ["boltz", "predict", input_path, "--use_msa_server", "--cache", str(models_dir)]
        + args,
        check=True,
    )

    print("🧬 packaging up outputs")
    output_bytes = package_outputs(f"boltz_results_{input_path.with_suffix('').name}")

    return output_bytes

```

## Addenda

Above, we glossed over just how we got hold of the model weights --
the `local_entrypoint` just called a function named `download_model`.

Here's the implementation of that function. For details, see our
[guide to storing model weights on Modal](https://modal.com/docs/guide/model-weights).

```python
download_image = (
    modal.Image.debian_slim()
    .pip_install("huggingface_hub[hf_transfer]==0.26.3")
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})  # and enable it
)

@app.function(
    volumes={models_dir: boltz_model_volume},
    timeout=20 * MINUTES,
    image=download_image,
)
def download_model(
    force_download: bool = False,
    revision: str = "6fdef46d763fee7fbb83ca5501ccceff43b85607",
):
    from huggingface_hub import snapshot_download

    snapshot_download(
        repo_id="boltz-community/boltz-2",
        revision=revision,
        local_dir=models_dir,
        force_download=force_download,
    )
    boltz_model_volume.commit()

    print(f"🧬 model downloaded to {models_dir}")

```

We package the outputs into a tarball which contains the predicted structure as a
[Crystallographic Information File](https://en.wikipedia.org/wiki/Crystallographic_Information_File)
and the binding affinity as a JSON file.
You can render the structure with the online [Molstar Viewer](https://molstar.org/viewer).

```python
def package_outputs(output_dir: str) -> bytes:
    import io
    import tarfile

    tar_buffer = io.BytesIO()

    with tarfile.open(fileobj=tar_buffer, mode="w:gz") as tar:
        tar.add(output_dir, arcname=output_dir)

    return tar_buffer.getvalue()

```

### Cbx Load Test

# Example (cbx_load_test.py)

This is the source code for **07_web_endpoints.fasthtml-checkboxes.cbx_load_test**.
```python
import os
from datetime import datetime
from pathlib import Path

import modal

if modal.is_local():
    workspace = modal.config._profile or ""
    environment = modal.config.config["environment"] or ""
else:
    workspace = os.environ["MODAL_WORKSPACE"] or ""
    environment = os.environ["MODAL_ENVIRONMENT"] or ""

image = (
    modal.Image.debian_slim(python_version="3.12")
    .pip_install("locust~=2.29.1", "beautifulsoup4~=4.12.3", "lxml~=5.3.0")
    .env({"MODAL_WORKSPACE": workspace, "MODAL_ENVIRONMENT": environment})
    .add_local_file(
        Path(__file__).parent / "cbx_locustfile.py",
        remote_path="/root/locustfile.py",
    )
    .add_local_file(
        Path(__file__).parent / "constants.py",
        remote_path="/root/constants.py",
    )
)
volume = modal.Volume.from_name("example-cbx-load-test-results", create_if_missing=True)
remote_path = Path("/root") / "loadtests"
OUT_DIRECTORY = remote_path / datetime.utcnow().replace(microsecond=0).isoformat()

app = modal.App("example-cbx-load-test", image=image, volumes={remote_path: volume})

workers = 8
host = f"https://{workspace}{'-' + environment if environment else ''}--example-fasthtml-checkboxes-web.modal.run"
csv_file = OUT_DIRECTORY / "stats.csv"
default_args = [
    "-H",
    host,
    "--processes",
    str(workers),
    "--csv",
    csv_file,
]

MINUTES = 60  # seconds

@app.function(cpu=workers)
@modal.concurrent(max_inputs=1000)
@modal.web_server(port=8089)
def serve():
    run_locust.local(default_args)

@app.function(cpu=workers, timeout=60 * MINUTES)
def run_locust(args: list, wait=False):
    import subprocess

    process = subprocess.Popen(["locust"] + args)
    if wait:
        process.wait()
        return process.returncode

@app.local_entrypoint()
def main(
    r: float = 1.0,
    u: int = 36,
    t: str = "1m",  # no more than the timeout of run_locust, one hour
):
    args = default_args + [
        "--spawn-rate",
        str(r),
        "--users",
        str(u),
        "--run-time",
        t,
    ]

    html_report_file = OUT_DIRECTORY / "report.html"
    args += [
        "--headless",  # run without browser UI
        "--autostart",  # start test immediately
        "--autoquit",  # stop once finished...
        "10",  # ...but wait ten seconds
        "--html",  # output an HTML-formatted report
        html_report_file,  # to this location
    ]

    if exit_code := run_locust.remote(args, wait=True):
        SystemExit(exit_code)
    else:
        print("finished successfully")

```

### Cbx Locustfile

# Example (cbx_locustfile.py)

This is the source code for **07_web_endpoints.fasthtml-checkboxes.cbx_locustfile**.

```python
import random

from bs4 import BeautifulSoup
from constants import N_CHECKBOXES
from locust import HttpUser, between, task

class CheckboxesUser(HttpUser):
    wait_time = between(0.01, 0.1)  # Simulates a wait time between requests

    def load_homepage(self):
        """
        Simulates a user loading the homepage and fetching the state of the checkboxes.
        """
        response = self.client.get("/")
        soup = BeautifulSoup(response.text, "lxml")
        main_div = soup.find("main")
        self.id = main_div["hx-get"].split("/")[-1]

    @task(10)
    def toggle_random_checkboxes(self):
        """
        Simulates a user toggling a random checkbox.
        """
        n_checkboxes = random.binomialvariate(  # approximately poisson at 10
            n=100,
            p=0.1,
        )
        for _ in range(min(n_checkboxes, 1)):
            checkbox_id = int(
                N_CHECKBOXES * random.random() ** 2
            )  # Choose a random checkbox between 0 and 9999, more likely to be closer to 0
            self.client.post(
                f"/checkbox/toggle/{checkbox_id}",
                name="/checkbox/toggle",
            )

    @task(1)
    def poll_for_diffs(self):
        """
        Simulates a user polling for any outstanding diffs.
        """
        self.client.get(f"/diffs/{self.id}", name="/diffs")

    def on_start(self):
        """
        Called when a simulated user starts, typically used to initialize or login a user.
        """
        self.id = str(random.randint(1, 9999))
        self.load_homepage()

```

### Chai1

# Fold proteins with Chai-1

In biology, function follows form quite literally:
the physical shapes of proteins dictate their behavior.
Measuring those shapes directly is difficult
and first-principles physical simulation prohibitively expensive.

And so predicting protein shape from content --
determining how the one-dimensional chain of amino acids encoded by DNA _folds_ into a 3D object --
has emerged as a key application for machine learning and neural networks in biology.

In this example, we demonstrate how to run the open source [Chai-1](https://github.com/chaidiscovery/chai-lab/)
protein structure prediction model on Modal's flexible serverless infrastructure.
For details on how the Chai-1 model works and what it can be used for,
see the authors' [technical report on bioRxiv](https://www.biorxiv.org/content/10.1101/2024.10.10.615955).

This simple script is meant as a starting point showing how to handle fiddly bits
like installing dependencies, loading weights, and formatting outputs so that you can get on with the fun stuff.
To experience the full power of Modal, try scaling inference up and running on hundreds or thousands of structures!

<center>
<a href="https://molstar.org/viewer" aria-label="Open the Mol* viewer"> <video controls autoplay loop muted> <source src="https://modal-cdn.com/example-chai1-folding.mp4" type="video/mp4"> </video> </a>
</center>

## Setup

```python
import hashlib
import json
from pathlib import Path
from typing import Optional
from uuid import uuid4

import modal

here = Path(__file__).parent  # the directory of this file

MINUTES = 60  # seconds

app = modal.App(name="example-chai1")

```

## Fold a protein from the command line

The logic for running Chai-1 is encapsulated in the function below,
which you can trigger from the command line by running

```shell
modal run chai1
```

This will set up the environment for running Chai-1 inference in Modal's cloud,
run it, and then save the results remotely and locally. The results are returned in the
[Crystallographic Information File](https://en.wikipedia.org/wiki/Crystallographic_Information_File) format,
which you can render with the online [Molstar Viewer](https://molstar.org/).

To see more options, run the command with the `--help` flag.

To learn how it works, read on!

```python
@app.local_entrypoint()
def main(
    force_redownload: bool = False,
    fasta_file: Optional[str] = None,
    inference_config_file: Optional[str] = None,
    output_dir: Optional[str] = None,
    run_id: Optional[str] = None,
):
    print("🧬 checking inference dependencies")
    download_inference_dependencies.remote(force=force_redownload)

    if fasta_file is None:
        fasta_file = here / "data" / "chai1_default_input.fasta"
    print(f"🧬 running Chai inference on {fasta_file}")
    fasta_content = Path(fasta_file).read_text()

    if inference_config_file is None:
        inference_config_file = here / "data" / "chai1_default_inference.json"
    print(f"🧬 loading Chai inference config from {inference_config_file}")
    inference_config = json.loads(Path(inference_config_file).read_text())

    if run_id is None:
        run_id = hashlib.sha256(uuid4().bytes).hexdigest()[:8]  # short id
    print(f"🧬 running inference with {run_id=}")

    results = chai1_inference.remote(fasta_content, inference_config, run_id)

    if output_dir is None:
        output_dir = Path("/tmp/chai1")
        output_dir.mkdir(parents=True, exist_ok=True)

    print(f"🧬 saving results to disk locally in {output_dir}")
    for ii, (scores, cif) in enumerate(results):
        (Path(output_dir) / f"{run_id}-scores.model_idx_{ii}.npz").write_bytes(scores)
        (Path(output_dir) / f"{run_id}-preds.model_idx_{ii}.cif").write_text(cif)

```

## Installing Chai-1 Python dependencies on Modal

Code running on Modal runs inside containers built from [container images](https://modal.com/docs/guide/images)
that include that code's dependencies.

Because Modal images include [GPU drivers](https://modal.com/docs/guide/cuda) by default,
installation of higher-level packages like `chai_lab` that require GPUs is painless.

Here, we do it with one line, using the `uv` package manager for extra speed.

```python
image = (
    modal.Image.debian_slim(python_version="3.12")
    .uv_pip_install(
        "chai_lab==0.5.0",
        "hf_transfer==0.1.8",
    )
    .uv_pip_install(
        "torch==2.7.1",
        index_url="https://download.pytorch.org/whl/cu128",
    )
)

```

## Storing Chai-1 model weights on Modal with Volumes

Not all "dependencies" belong in a container image. Chai-1, for example, depends on
the weights of several models.

Rather than loading them dynamically at run-time (which would add several minutes of GPU time to each inference),
or installing them into the image (which would require they be re-downloaded any time the other dependencies changed),
we load them onto a [Modal Volume](https://modal.com/docs/guide/volumes).
A Modal Volume is a file system that all of your code running on Modal (or elsewhere!) can access.
For more on storing model weights on Modal, see [this guide](https://modal.com/docs/guide/model-weights).

```python
chai_model_volume = (
    modal.Volume.from_name(  # create distributed filesystem for model weights
        "chai1-models",
        create_if_missing=True,
    )
)
models_dir = Path("/models/chai1")

```

The details of how we handle the download here (e.g. running concurrently for extra speed)
are in the [Addenda](#addenda).

```python
image = image.env(  # update the environment variables in the image to...
    {
        "CHAI_DOWNLOADS_DIR": str(models_dir),  # point the chai code to it
        "HF_HUB_ENABLE_HF_TRANSFER": "1",  # speed up downloads
    }
)

```

## Storing Chai-1 outputs on Modal Volumes

Chai-1 produces its outputs by writing to disk --
the model's scores for the structure and the structure itself along with rich metadata.

But Modal is a _serverless_ platform, and the filesystem your Modal Functions write to
is not persistent. Any file can be converted into bytes and sent back from a Modal Function
-- and we mean any! You can send files that are gigabytes in size that way.
So we do that below.

But for larger jobs, like folding every protein in the PDB, storing bytes on a local client
like a laptop won't cut it.

So we again lean on Modal Volumes, which can store thousands of files each.
We attach a Volume to a Modal Function that runs Chai-1 and the inference code
saves the results to distributed storage, without any fuss or source code changes.

```python
chai_preds_volume = modal.Volume.from_name("chai1-preds", create_if_missing=True)
preds_dir = Path("/preds")

```

## Running Chai-1 on Modal

Now we're ready to define a Modal Function that runs Chai-1.

We put our function on Modal by wrapping it in a decorator, `@app.function`.
We provide that decorator with some arguments that describe the infrastructure our code needs to run:
the Volumes we created, the Image we defined, and of course a fast GPU!

Note that Chai-1 takes a file path as input --
specifically, a path to a file in the [FASTA format](https://en.wikipedia.org/wiki/FASTA_format).
We pass the file contents to the function as a string and save them to disk so they can be picked up by the inference code.

Because Modal is serverless, we don't need to worry about cleaning up these resources:
the disk is ephemeral and the GPU only costs you money when you're using it.

```python
@app.function(
    timeout=15 * MINUTES,
    gpu="H100",
    volumes={models_dir: chai_model_volume, preds_dir: chai_preds_volume},
    image=image,
)
def chai1_inference(
    fasta_content: str, inference_config: dict, run_id: str
) -> list[(bytes, str)]:
    from pathlib import Path

    import torch
    from chai_lab import chai1

    N_DIFFUSION_SAMPLES = 5  # hard-coded in chai-1

    fasta_file = Path("/tmp/inputs.fasta")
    fasta_file.write_text(fasta_content.strip())

    output_dir = Path("/preds") / run_id

    chai1.run_inference(
        fasta_file=fasta_file,
        output_dir=output_dir,
        device=torch.device("cuda"),
        **inference_config,
    )

    print(
        f"🧬 done, results written to /{output_dir.relative_to('/preds')} on remote volume"
    )

    results = []
    for ii in range(N_DIFFUSION_SAMPLES):
        scores = (output_dir / f"scores.model_idx_{ii}.npz").read_bytes()
        cif = (output_dir / f"pred.model_idx_{ii}.cif").read_text()

        results.append((scores, cif))

    return results

```

## Addenda

Above, we glossed over just how we got hold of the model weights --
the `local_entrypoint` just called a function named `download_inference_dependencies`.

Here's that function's implementation.

A few highlights:

- This Modal Function can access the model weights Volume, like the inference Function,
but it can't access the model predictions Volume.

- This Modal Function has a different Image (the default!) and doesn't use a GPU. Modal helps you
separate the concerns, and the costs, of your infrastructure's components.

- We use the `async` keyword here so that we can run the download for each model file
as a separate task, concurrently. We don't need to worry about this use of `async`
spreading to the rest of our code -- Modal launches just this Function in an async runtime.

```python
@app.function(volumes={models_dir: chai_model_volume})
async def download_inference_dependencies(force=False):
    import asyncio

    import aiohttp

    base_url = "https://chaiassets.com/chai1-inference-depencencies/"  # sic
    inference_dependencies = [
        "conformers_v1.apkl",
        "models_v2/trunk.pt",
        "models_v2/token_embedder.pt",
        "models_v2/feature_embedding.pt",
        "models_v2/diffusion_module.pt",
        "models_v2/confidence_head.pt",
    ]

    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
    }

    # launch downloads concurrently
    async with aiohttp.ClientSession(headers=headers) as session:
        tasks = []
        for dep in inference_dependencies:
            local_path = models_dir / dep
            if force or not local_path.exists():
                url = base_url + dep
                print(f"🧬 downloading {dep}")
                tasks.append(download_file(session, url, local_path))

        # run all of the downloads and await their completion
        await asyncio.gather(*tasks)

    chai_model_volume.commit()  # ensures models are visible on remote filesystem before exiting, otherwise takes a few seconds, racing with inference

async def download_file(session, url: str, local_path: Path):
    async with session.get(url) as response:
        response.raise_for_status()
        local_path.parent.mkdir(parents=True, exist_ok=True)
        with open(local_path, "wb") as f:
            while chunk := await response.content.read(8192):
                f.write(chunk)

```

### Chat With Pdf Vision

# Chat with PDF: RAG with ColQwen2

In this example, we demonstrate how to use the the [ColQwen2](https://huggingface.co/vidore/colqwen2-v0.1) model to build a simple
"Chat with PDF" retrieval-augmented generation (RAG) app.
The ColQwen2 model is based on [ColPali](https://huggingface.co/blog/manu/colpali) but uses the
[Qwen2-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) vision-language model.
ColPali is in turn based on the late-interaction embedding approach pioneered in [ColBERT](https://dl.acm.org/doi/pdf/10.1145/3397271.3401075).

Vision-language models with high-quality embeddings obviate the need for complex pre-processing pipelines.
See [this blog post from Jo Bergum of Vespa](https://blog.vespa.ai/announcing-colbert-embedder-in-vespa/) for more.

## Setup

First, we’ll import the libraries we need locally and define some constants.

```python
from pathlib import Path
from typing import Optional
from urllib.request import urlopen
from uuid import uuid4

import modal

MINUTES = 60  # seconds

app = modal.App("example-chat-with-pdf-vision")

```

## Setting up dependenices

In Modal, we define [container images](https://modal.com/docs/guide/custom-container) that run our serverless workloads.
We install the packages required for our application in those images.

```python
CACHE_DIR = "/hf-cache"

model_image = (
    modal.Image.debian_slim(python_version="3.12")
    .apt_install("git")
    .pip_install(
        [
            "git+https://github.com/illuin-tech/colpali.git@782edcd50108d1842d154730ad3ce72476a2d17d",  # we pin the commit id
            "hf_transfer==0.1.8",
            "qwen-vl-utils==0.0.8",
            "torchvision==0.19.1",
        ]
    )
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1", "HF_HUB_CACHE": CACHE_DIR})
)

```

These dependencies are only installed remotely, so we can't import them locally.
Use the `.imports` context manager to import them only on Modal instead.

```python
with model_image.imports():
    import torch
    from colpali_engine.models import ColQwen2, ColQwen2Processor
    from qwen_vl_utils import process_vision_info
    from transformers import AutoProcessor, Qwen2VLForConditionalGeneration

```

## Specifying the ColQwen2 model

Vision-language models (VLMs) for embedding and generation add another layer of simplification
to RAG apps based on vector search: we only need one model.

```python
MODEL_NAME = "Qwen/Qwen2-VL-2B-Instruct"
MODEL_REVISION = "aca78372505e6cb469c4fa6a35c60265b00ff5a4"

```

## Managing state with Modal Volumes and Dicts

Chat services are stateful:
the response to an incoming user message depends on past user messages in a session.

RAG apps add even more state:
the documents being retrieved from and the index over those documents,
e.g. the embeddings.

Modal Functions are stateless in and of themselves.
They don't retain information from input to input.
That's what enables Modal Functions to automatically scale up and down
[based on the number of incoming requests](https://modal.com/docs/guide/cold-start).

### Managing chat sessions with Modal Dicts

In this example, we use a [`modal.Dict`](https://modal.com/docs/guide/dicts-and-queues)
to store state information between Function calls.

Modal Dicts behave similarly to Python dictionaries,
but they are backed by remote storage and accessible to all of your Modal Functions.
They can contain any Python object
that can be serialized using [`cloudpickle`](https://github.com/cloudpipe/cloudpickle).

A Dict can hold a few gigabytes across keys of size up to 100 MiB,
so it works well for our chat session state, which is a few KiB per session,
and for our embeddings, which are a few hundred KiB per PDF page,
up to about 100,000 pages of PDFs.

At a larger scale, we'd need to replace this with a database, like Postgres,
or push more state to the client.

```python
sessions = modal.Dict.from_name("colqwen-chat-sessions", create_if_missing=True)

class Session:
    def __init__(self):
        self.images = None
        self.messages = []
        self.pdf_embeddings = None

```

### Storing PDFs on a Modal Volume

Images extracted from PDFs are larger than our session state or embeddings
-- low tens of MiB per page.

So we store them on a [Modal Volume](https://modal.com/docs/guide/volumes),
which can store terabytes (or more!) of data across tens of thousands of files.

Volumes behave like a remote file system:
we read and write from them much like a local file system.

```python
pdf_volume = modal.Volume.from_name("colqwen-chat-pdfs", create_if_missing=True)
PDF_ROOT = Path("/vol/pdfs/")

```

### Caching the model weights

We'll also use a Volume to cache the model weights.

```python
cache_volume = modal.Volume.from_name("hf-hub-cache", create_if_missing=True)

```

Running this function will download the model weights to the cache volume.
Otherwise, the model weights will be downloaded on the first query. For more on storing model weights on Modal, see
[this guide](https://modal.com/docs/guide/model-weights).

```python
@app.function(
    image=model_image, volumes={CACHE_DIR: cache_volume}, timeout=20 * MINUTES
)
def download_model():
    from huggingface_hub import snapshot_download

    result = snapshot_download(
        MODEL_NAME,
        revision=MODEL_REVISION,
        ignore_patterns=["*.pt", "*.bin"],  # using safetensors
    )
    print(f"Downloaded model weights to {result}")

```

## Defining a Chat with PDF service

To deploy an autoscaling "Chat with PDF" vision-language model service on Modal,
we just need to wrap our Python logic in a [Modal App](https://modal.com/docs/guide/apps):

It uses [Modal `@app.cls`](https://modal.com/docs/guide/lifecycle-functions) decorators
to organize the "lifecycle" of the app:
loading the model on container start (`@modal.enter`) and running inference on request (`@modal.method`).

We include in the arguments to the `@app.cls` decorator
all the information about this service's infrastructure:
the container image, the remote storage, and the GPU requirements.

```python
@app.cls(
    image=model_image,
    gpu="A100-80GB",
    scaledown_window=10 * MINUTES,  # spin down when inactive
    volumes={"/vol/pdfs/": pdf_volume, CACHE_DIR: cache_volume},
)
class Model:
    @modal.enter()
    def load_models(self):
        self.colqwen2_model = ColQwen2.from_pretrained(
            "vidore/colqwen2-v0.1",
            torch_dtype=torch.bfloat16,
            device_map="cuda:0",
        )
        self.colqwen2_processor = ColQwen2Processor.from_pretrained(
            "vidore/colqwen2-v0.1"
        )
        self.qwen2_vl_model = Qwen2VLForConditionalGeneration.from_pretrained(
            MODEL_NAME,
            revision=MODEL_REVISION,
            torch_dtype=torch.bfloat16,
        )
        self.qwen2_vl_model.to("cuda:0")
        self.qwen2_vl_processor = AutoProcessor.from_pretrained(
            "Qwen/Qwen2-VL-2B-Instruct", trust_remote_code=True
        )

    @modal.method()
    def index_pdf(self, session_id, target: bytes | list):
        # We store concurrent user chat sessions in a modal.Dict

        # For simplicity, we assume that each user only runs one session at a time

        session = sessions.get(session_id)
        if session is None:
            session = Session()

        if isinstance(target, bytes):
            images = convert_pdf_to_images.remote(target)
        else:
            images = target

        # Store images on a Volume for later retrieval
        session_dir = PDF_ROOT / f"{session_id}"
        session_dir.mkdir(exist_ok=True, parents=True)
        for ii, image in enumerate(images):
            filename = session_dir / f"{str(ii).zfill(3)}.jpg"
            image.save(filename)

        # Generated embeddings from the image(s)
        BATCH_SZ = 4
        pdf_embeddings = []
        batches = [images[i : i + BATCH_SZ] for i in range(0, len(images), BATCH_SZ)]
        for batch in batches:
            batch_images = self.colqwen2_processor.process_images(batch).to(
                self.colqwen2_model.device
            )
            pdf_embeddings += list(self.colqwen2_model(**batch_images).to("cpu"))

        # Store the image embeddings in the session, for later retrieval
        session.pdf_embeddings = pdf_embeddings

        # Write embeddings back to the modal.Dict
        sessions[session_id] = session

    @modal.method()
    def respond_to_message(self, session_id, message):
        session = sessions.get(session_id)
        if session is None:
            session = Session()

        pdf_volume.reload()  # make sure we have the latest data

        images = (PDF_ROOT / str(session_id)).glob("*.jpg")
        images = list(sorted(images, key=lambda p: int(p.stem)))

        # Nothing to chat about without a PDF!
        if not images:
            return "Please upload a PDF first"
        elif session.pdf_embeddings is None:
            return "Indexing PDF..."

        # RAG, Retrieval-Augmented Generation, is two steps:

        # _Retrieval_ of the most relevant data to answer the user's query
        relevant_image = self.get_relevant_image(message, session, images)

        # _Generation_ based on the retrieved data
        output_text = self.generate_response(message, session, relevant_image)

        # Update session state for future chats
        append_to_messages(message, session, user_type="user")
        append_to_messages(output_text, session, user_type="assistant")
        sessions[session_id] = session

        return output_text

    # Retrieve the most relevant image from the PDF for the input query
    def get_relevant_image(self, message, session, images):
        import PIL

        batch_queries = self.colqwen2_processor.process_queries([message]).to(
            self.colqwen2_model.device
        )
        query_embeddings = self.colqwen2_model(**batch_queries)

        # This scores our query embedding against the image embeddings from index_pdf
        scores = self.colqwen2_processor.score_multi_vector(
            query_embeddings, session.pdf_embeddings
        )[0]

        # Select the best matching image
        max_index = max(range(len(scores)), key=lambda index: scores[index])
        return PIL.Image.open(images[max_index])

    # Pass the query and retrieved image along with conversation history into the VLM for a response
    def generate_response(self, message, session, image):
        chatbot_message = get_chatbot_message_with_image(message, image)
        query = self.qwen2_vl_processor.apply_chat_template(
            [*session.messages, chatbot_message],
            tokenize=False,
            add_generation_prompt=True,
        )
        image_inputs, _ = process_vision_info([chatbot_message])
        inputs = self.qwen2_vl_processor(
            text=[query],
            images=image_inputs,
            padding=True,
            return_tensors="pt",
        )
        inputs = inputs.to("cuda:0")

        generated_ids = self.qwen2_vl_model.generate(**inputs, max_new_tokens=512)
        generated_ids_trimmed = [
            out_ids[len(in_ids) :]
            for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
        ]
        output_text = self.qwen2_vl_processor.batch_decode(
            generated_ids_trimmed,
            skip_special_tokens=True,
            clean_up_tokenization_spaces=False,
        )[0]
        return output_text

```

## Loading PDFs as images

Vision-Language Models operate on images, not PDFs directly,
so we need to convert our PDFs into images first.

We separate this from our indexing and chatting logic --
we run on a different container with different dependencies.

```python
pdf_image = (
    modal.Image.debian_slim(python_version="3.12")
    .apt_install("poppler-utils")
    .pip_install("pdf2image==1.17.0", "pillow==10.4.0")
)

@app.function(image=pdf_image)
def convert_pdf_to_images(pdf_bytes):
    from pdf2image import convert_from_bytes

    images = convert_from_bytes(pdf_bytes, fmt="jpeg")
    return images

```

## Chatting with a PDF from the terminal

Before deploying in a UI, we can test our service from the terminal.

Just run
```bash
modal run chat_with_pdf_vision.py
```

and optionally pass in a path to or URL of a PDF with the `--pdf-path` argument
and specify a question with the `--question` argument.

Continue a previous chat by passing the session ID printed to the terminal at start
with the `--session-id` argument.

```python
@app.local_entrypoint()
def main(
    question: Optional[str] = None,
    pdf_path: Optional[str] = None,
    session_id: Optional[str] = None,
):
    model = Model()
    if session_id is None:
        session_id = str(uuid4())
        print("Starting a new session with id", session_id)

        if pdf_path is None:
            pdf_path = "https://arxiv.org/pdf/1706.03762"  # all you need

        if pdf_path.startswith("http"):
            pdf_bytes = urlopen(pdf_path).read()
        else:
            pdf_bytes = Path(pdf_path).read_bytes()

        print("Indexing PDF from", pdf_path)
        model.index_pdf.remote(session_id, pdf_bytes)
    else:
        if pdf_path is not None:
            raise ValueError("Start a new session to chat with a new PDF")
        print("Resuming session with id", session_id)

    if question is None:
        question = "What is this document about?"

    print("QUESTION:", question)
    print(model.respond_to_message.remote(session_id, question))

```

## A hosted Gradio interface

With the [Gradio](https://gradio.app) library, we can create a simple web interface around our class in Python,
then use Modal to host it for anyone to try out.

To deploy your own, run

```bash
modal deploy chat_with_pdf_vision.py
```

and navigate to the URL that appears in your teriminal.
If you’re editing the code, use `modal serve` instead to see changes hot-reload.

```python
web_image = pdf_image.pip_install(
    "fastapi[standard]==0.115.4",
    "pydantic==2.9.2",
    "starlette==0.41.2",
    "gradio==4.44.1",
    "pillow==10.4.0",
    "gradio-pdf==0.0.15",
    "pdf2image==1.17.0",
)

@app.function(
    image=web_image,
    # gradio requires sticky sessions
    # so we limit the number of concurrent containers to 1
    # and allow it to scale to 1000 concurrent inputs
    max_containers=1,
)
@modal.concurrent(max_inputs=1000)
@modal.asgi_app()
def ui():
    import uuid

    import gradio as gr
    from fastapi import FastAPI
    from gradio.routes import mount_gradio_app
    from gradio_pdf import PDF
    from pdf2image import convert_from_path

    web_app = FastAPI()

    # Since this Gradio app is running from its own container,
    # allowing us to run the inference service via .remote() methods.
    model = Model()

    def upload_pdf(path, session_id):
        if session_id == "" or session_id is None:
            # Generate session id if new client
            session_id = str(uuid.uuid4())

        images = convert_from_path(path)
        # Call to our remote inference service to index the PDF
        model.index_pdf.remote(session_id, images)

        return session_id

    def respond_to_message(message, _, session_id):
        # Call to our remote inference service to run RAG
        return model.respond_to_message.remote(session_id, message)

    with gr.Blocks(theme="soft") as demo:
        session_id = gr.State("")

        gr.Markdown("# Chat with PDF")
        with gr.Row():
            with gr.Column(scale=1):
                gr.ChatInterface(
                    fn=respond_to_message,
                    additional_inputs=[session_id],
                    retry_btn=None,
                    undo_btn=None,
                    clear_btn=None,
                )
            with gr.Column(scale=1):
                pdf = PDF(
                    label="Upload a PDF",
                )
                pdf.upload(upload_pdf, [pdf, session_id], session_id)

    return mount_gradio_app(app=web_app, blocks=demo, path="/")

```

## Addenda

The remainder of this code consists of utility functions and boiler plate used in the
main code above.

```python
def get_chatbot_message_with_image(message, image):
    return {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": message},
        ],
    }

def append_to_messages(message, session, user_type="user"):
    session.messages.append(
        {
            "role": user_type,
            "content": {"type": "text", "text": message},
        }
    )

```

### Chatterbox Tts

# Create a Chatterbox TTS API on Modal

This example demonstrates how to deploy a text-to-speech (TTS) API using the Chatterbox TTS model on Modal.
The API accepts text prompts and returns generated audio as WAV files through a FastAPI endpoint.
We use Modal's class-based approach with GPU acceleration to provide fast, scalable TTS inference.

## Setup

Import the necessary modules for Modal deployment and TTS functionality.

```python
import io

import modal

```

## Define a container image

We start with Modal's baseline `debian_slim` image and install the required packages.
- `chatterbox-tts`: The TTS model library
- `fastapi`: Web framework for creating the API endpoint

```python
image = modal.Image.debian_slim(python_version="3.12").pip_install(
    "chatterbox-tts==0.1.1", "fastapi[standard]"
)
app = modal.App("example-chatterbox-tts", image=image)

```

Import the required libraries within the image context to ensure they're available
when the container runs. This includes audio processing and the TTS model itself.

```python
with image.imports():
    import torchaudio as ta
    from chatterbox.tts import ChatterboxTTS
    from fastapi.responses import StreamingResponse

```

## The TTS model class

The TTS service is implemented using Modal's class syntax with GPU acceleration.
We configure the class to use an A10G GPU with additional parameters:

- `scaledown_window=60 * 5`: Keep containers alive for 5 minutes after last request
- `enable_memory_snapshot=True`: Enable [memory snapshots](https://modal.com/docs/guide/memory-snapshot) to optimize cold boot times
- `@modal.concurrent(max_inputs=10)`: Allow up to 10 concurrent requests per container

```python
@app.cls(gpu="a10g", scaledown_window=60 * 5, enable_memory_snapshot=True)
@modal.concurrent(max_inputs=10)
class Chatterbox:
    @modal.enter()
    def load(self):
        self.model = ChatterboxTTS.from_pretrained(device="cuda")

    @modal.fastapi_endpoint(docs=True, method="POST")
    def generate(self, prompt: str):
        # Generate audio waveform from the input text
        wav = self.model.generate(prompt)

        # Create an in-memory buffer to store the WAV file
        buffer = io.BytesIO()

        # Save the generated audio to the buffer in WAV format
        # Uses the model's sample rate and WAV format
        ta.save(buffer, wav, self.model.sr, format="wav")

        # Reset buffer position to the beginning for reading
        buffer.seek(0)

        # Return the audio as a streaming response with appropriate MIME type.
        # This allows for browsers to playback audio directly.
        return StreamingResponse(
            io.BytesIO(buffer.read()),
            media_type="audio/wav",
        )

```

Now deploy the Chatterbox API with:

```shell
modal deploy chatterbox_tts.py
```

And query the endpoint with:

```shell
mkdir -p /tmp/chatterbox-tts  # create tmp directory

curl -X POST --get "<YOUR-ENDPOINT-URL>" \
  --data-urlencode "prompt=Chatterbox running on Modal"
  --output /tmp/chatterbox-tts/output.wav
```

You'll receive a WAV file named `/tmp/chatterbox-tts/output.wav` containing the generated audio.

This app takes about 30 seconds to cold boot, mostly dominated by loading
the Chatterbox model into GPU memory. It takes 2-3s to generate a 5s audio clip.

### Cloud Bucket Mount Loras

# LoRAs Galore: Create a LoRA Playground with Modal, Gradio, and S3

This example shows how to mount an S3 bucket in a Modal app using [`CloudBucketMount`](https://modal.com/docs/reference/modal.CloudBucketMount).
We will download a bunch of LoRA adapters from the [HuggingFace Hub](https://huggingface.co/models) into our S3 bucket
then read from that bucket, on the fly, when doing inference.

By default, we use the [IKEA instructions LoRA](https://huggingface.co/ostris/ikea-instructions-lora-sdxl) as an example,
which produces the following image when prompted to generate "IKEA instructions for building a GPU rig for deep learning":

![IKEA instructions for building a GPU rig for deep learning](./ikea-instructions-for-building-a-gpu-rig-for-deep-learning.png)

By the end of this example, we've deployed a "playground" app where anyone with a browser can try
out these custom models. That's the power of Modal: custom, autoscaling AI applications, deployed in seconds.
You can try out our deployment [here](https://modal-labs-examples--example-cloud-bucket-mount-loras-ui.modal.run).

## Basic setup

```python
import io
import os
from pathlib import Path
from typing import Optional

import modal

```

You will need to have an S3 bucket and AWS credentials to run this example. Refer to the documentation
for the detailed [IAM permissions](https://modal.com/docs/guide/cloud-bucket-mounts#iam-permissions) those credentials will need.

After you are done creating a bucket and configuring IAM settings,
you now need to create a [Modal Secret](https://modal.com/docs/guide/secrets). Navigate to the "Secrets" tab and
click on the AWS card, then fill in the fields with the AWS key and secret created
previously. Name the Secret `s3-bucket-secret`.

```python
bucket_secret = modal.Secret.from_name(
    "s3-bucket-secret",
    required_keys=["AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY"],
)

MOUNT_PATH: Path = Path("/mnt/bucket")
LORAS_PATH: Path = MOUNT_PATH / "loras/v5"

BASE_MODEL = "stabilityai/stable-diffusion-xl-base-1.0"
CACHE_DIR = "/hf-cache"

```

Modal runs serverless functions inside containers.
The environments those functions run in are defined by
the container `Image`. The line below constructs an image
with the dependencies we need -- no need to install them locally.

```python
image = (
    modal.Image.debian_slim(python_version="3.12")
    .pip_install(
        "huggingface_hub==0.21.4",
        "transformers==4.38.2",
        "diffusers==0.26.3",
        "peft==0.9.0",
        "accelerate==0.27.2",
    )
    .env({"HF_HUB_CACHE": CACHE_DIR})
)

with image.imports():
    # we import these dependencies only inside the container
    import diffusers
    import huggingface_hub
    import torch

```

We attach the S3 bucket to all the Modal functions in this app by mounting it on the filesystem they see,
passing a `CloudBucketMount` to the `volumes` dictionary argument. We can read and write to this mounted bucket
(almost) as if it were a local directory.

```python
app = modal.App(
    "example-cloud-bucket-mount-loras",
    image=image,
    volumes={
        MOUNT_PATH: modal.CloudBucketMount(
            "modal-s3mount-test-bucket",
            secret=bucket_secret,
        )
    },
)

```

For the base model, we'll use a modal.Volume to store the Hugging Face cache.

```python
cache_volume = modal.Volume.from_name("hf-hub-cache", create_if_missing=True)

@app.function(image=image, volumes={CACHE_DIR: cache_volume})
def download_model():
    loc = huggingface_hub.snapshot_download(repo_id=BASE_MODEL)
    print(f"Saved model to {loc}")

```

## Acquiring LoRA weights

`search_loras()` will use the Hub API to search for LoRAs. We limit LoRAs
to a maximum size to avoid downloading very large model weights.
We went with 800 MiB, but feel free to adapt to what works best for you.

```python
@app.function(secrets=[bucket_secret])
def search_loras(limit: int, max_model_size: int = 1024 * 1024 * 1024):
    api = huggingface_hub.HfApi()

    model_ids: list[str] = []
    for model in api.list_models(
        tags=["lora", f"base_model:{BASE_MODEL}"],
        library="diffusers",
        sort="downloads",  # sort by most downloaded
    ):
        try:
            model_size = 0
            for file in api.list_files_info(model.id):
                model_size += file.size

        except huggingface_hub.utils.GatedRepoError:
            print(f"gated model ({model.id}); skipping")
            continue

        # Skip models that are larger than file limit.
        if model_size > max_model_size:
            print(f"model {model.id} is too large; skipping")
            continue

        model_ids.append(model.id)
        if len(model_ids) >= limit:
            return model_ids

    return model_ids

```

We want to take the LoRA weights we found and move them from Hugging Face onto S3,
where they'll be accessible, at short latency and high throughput, for our Modal functions.
Downloading files in this mount will automatically upload files to S3.
To speed things up, we will run this function in parallel using Modal's
[`map`](https://modal.com/docs/reference/modal.Function#map).

```python
@app.function()
def download_lora(repository_id: str) -> Optional[str]:
    os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"

    # CloudBucketMounts will report 0 bytes of available space leading to many
    # unnecessary warnings, so we patch the method that emits those warnings.
    from huggingface_hub import file_download

    file_download._check_disk_space = lambda x, y: False

    repository_path = LORAS_PATH / repository_id
    try:
        # skip models we've already downloaded
        if not repository_path.exists():
            huggingface_hub.snapshot_download(
                repository_id,
                local_dir=repository_path.as_posix().replace(".", "_"),
                allow_patterns=["*.safetensors"],
            )
        downloaded_lora = len(list(repository_path.rglob("*.safetensors"))) > 0
    except OSError:
        downloaded_lora = False
    except FileNotFoundError:
        downloaded_lora = False
    if downloaded_lora:
        return repository_id
    else:
        return None

```

## Inference with LoRAs

We define a `StableDiffusionLoRA` class to organize our inference code.
We load Stable Diffusion XL 1.0 as a base model, then, when doing inference,
we load whichever LoRA the user specifies from the S3 bucket.
For more on the decorators we use on the methods below to speed up building and booting,
check out the [container lifecycle hooks guide](https://modal.com/docs/guide/lifecycle-functions).

```python
@app.cls(
    gpu="a10g",  # A10G GPUs are great for inference
    volumes={CACHE_DIR: cache_volume},  # We cache the base model
)
class StableDiffusionLoRA:
    @modal.enter()  # when a new container starts, we load the base model into the GPU
    def load(self):
        self.pipe = diffusers.DiffusionPipeline.from_pretrained(
            BASE_MODEL, torch_dtype=torch.float16
        ).to("cuda")

    @modal.method()  # at inference time, we pull in the LoRA weights and pass the final model the prompt
    def run_inference_with_lora(
        self, lora_id: str, prompt: str, seed: int = 8888
    ) -> bytes:
        for file in (LORAS_PATH / lora_id).rglob("*.safetensors"):
            self.pipe.load_lora_weights(lora_id, weight_name=file.name)
            break

        lora_scale = 0.9
        image = self.pipe(
            prompt,
            num_inference_steps=10,
            cross_attention_kwargs={"scale": lora_scale},
            generator=torch.manual_seed(seed),
        ).images[0]

        buffer = io.BytesIO()
        image.save(buffer, format="PNG")

        return buffer.getvalue()

```

## Try it locally!

To use our inference code from our local command line, we add a `local_entrypoint` to our `app`.
Run it using `modal run cloud_bucket_mount_loras.py`, and pass `--help`
to see the available options.

The inference code will run on our machines, but the results will be available on yours.

```python
@app.local_entrypoint()
def main(
    limit: int = 100,
    example_lora: str = "ostris/ikea-instructions-lora-sdxl",
    prompt: str = "IKEA instructions for building a GPU rig for deep learning",
    seed: int = 8888,
):
    # Download LoRAs in parallel.
    lora_model_ids = [example_lora]
    lora_model_ids += search_loras.remote(limit)

    downloaded_loras = []
    for model in download_lora.map(lora_model_ids):
        if model:
            downloaded_loras.append(model)

    print(f"downloaded {len(downloaded_loras)} loras => {downloaded_loras}")

    # Run inference using one of the downloaded LoRAs.
    byte_stream = StableDiffusionLoRA().run_inference_with_lora.remote(
        example_lora, prompt, seed
    )
    dir = Path("/tmp/stable-diffusion-xl")
    if not dir.exists():
        dir.mkdir(exist_ok=True, parents=True)

    output_path = dir / f"{as_slug(prompt.lower())}.png"
    print(f"Saving it to {output_path}")
    with open(output_path, "wb") as f:
        f.write(byte_stream)

```

## LoRA Exploradora: A hosted Gradio interface

Command line tools are cool, but we can do better!
With the Gradio library by Hugging Face, we can create a simple web interface
around our Python inference function, then use Modal to host it for anyone to try out.

To set up your own, run `modal deploy cloud_bucket_mount_loras.py` and navigate to the URL it prints out.
If you're playing with the code, use `modal serve` instead to see changes live.

```python
web_image = modal.Image.debian_slim(python_version="3.12").pip_install(
    "fastapi[standard]==0.115.4",
    "gradio~=5.7.1",
    "pillow~=10.2.0",
)

@app.function(
    image=web_image,
    min_containers=1,
    scaledown_window=60 * 20,
    # gradio requires sticky sessions
    # so we limit the number of concurrent containers to 1
    # and allow it to scale to 100 concurrent inputs
    max_containers=1,
)
@modal.concurrent(max_inputs=100)
@modal.asgi_app()
def ui():
    """A simple Gradio interface around our LoRA inference."""
    import io

    import gradio as gr
    from fastapi import FastAPI
    from gradio.routes import mount_gradio_app
    from PIL import Image

    # determine which loras are available
    lora_ids = [
        f"{lora_dir.parent.stem}/{lora_dir.stem}" for lora_dir in LORAS_PATH.glob("*/*")
    ]

    # pick one to be default, set a default prompt
    default_lora_id = (
        "ostris/ikea-instructions-lora-sdxl"
        if "ostris/ikea-instructions-lora-sdxl" in lora_ids
        else lora_ids[0]
    )
    default_prompt = (
        "IKEA instructions for building a GPU rig for deep learning"
        if default_lora_id == "ostris/ikea-instructions-lora-sdxl"
        else "text"
    )

    # the simple path to making an app on Gradio is an Interface: a UI wrapped around a function.
    def go(lora_id: str, prompt: str, seed: int) -> Image:
        return Image.open(
            io.BytesIO(
                StableDiffusionLoRA().run_inference_with_lora.remote(
                    lora_id, prompt, seed
                )
            ),
        )

    iface = gr.Interface(
        go,
        inputs=[  # the inputs to go/our inference function
            gr.Dropdown(choices=lora_ids, value=default_lora_id, label="👉 LoRA ID"),
            gr.Textbox(default_prompt, label="🎨 Prompt"),
            gr.Number(value=8888, label="🎲 Random Seed"),
        ],
        outputs=gr.Image(label="Generated Image"),
        # some extra bits to make it look nicer
        title="LoRAs Galore",
        description="# Try out some of the top custom SDXL models!"
        "\n\nPick a LoRA finetune of SDXL from the dropdown, then prompt it to generate an image."
        "\n\nCheck out [the code on GitHub](https://github.com/modal-labs/modal-examples/blob/main/10_integrations/cloud_bucket_mount_loras.py)"
        " if you want to create your own version or just see how it works."
        "\n\nPowered by [Modal](https://modal.com) 🚀",
        theme="soft",
        allow_flagging="never",
    )

    return mount_gradio_app(app=FastAPI(), blocks=iface, path="/")

def as_slug(name):
    """Converts a string, e.g. a prompt, into something we can use as a filename."""
    import re

    s = str(name).strip().replace(" ", "-")
    s = re.sub(r"(?u)[^-\w.]", "", s)
    return s

```

### Coco

This script demonstrates ingestion of the [COCO](https://cocodataset.org/#download) (Common Objects in Context)
dataset.

It is recommended to iterate on this code from a modal.Function running Jupyter server.
This better supports experimentation and maintains state in the face of errors:
11_notebooks/jupyter_inside_modal.py

```python
import os
import pathlib
import shutil
import subprocess
import sys
import threading
import time
import zipfile

import modal

bucket_creds = modal.Secret.from_name(
    "aws-s3-modal-examples-datasets", environment_name="main"
)
bucket_name = "modal-examples-datasets"
volume = modal.CloudBucketMount(
    bucket_name,
    secret=bucket_creds,
)
image = modal.Image.debian_slim().apt_install("wget").pip_install("tqdm")
app = modal.App(
    "example-coco",
    image=image,
    secrets=[],
)

def start_monitoring_disk_space(interval: int = 120) -> None:
    """Start monitoring the disk space in a separate thread."""
    task_id = os.environ["MODAL_TASK_ID"]

    def log_disk_space(interval: int) -> None:
        while True:
            statvfs = os.statvfs("/")
            free_space = statvfs.f_frsize * statvfs.f_bavail
            print(
                f"{task_id} free disk space: {free_space / (1024**3):.2f} GB",
                file=sys.stderr,
            )
            time.sleep(interval)

    monitoring_thread = threading.Thread(target=log_disk_space, args=(interval,))
    monitoring_thread.daemon = True
    monitoring_thread.start()

def extractall(fzip, dest, desc="Extracting"):
    from tqdm.auto import tqdm
    from tqdm.utils import CallbackIOWrapper

    dest = pathlib.Path(dest).expanduser()
    with (
        zipfile.ZipFile(fzip) as zipf,
        tqdm(
            desc=desc,
            unit="B",
            unit_scale=True,
            unit_divisor=1024,
            total=sum(getattr(i, "file_size", 0) for i in zipf.infolist()),
        ) as pbar,
    ):
        for i in zipf.infolist():
            if not getattr(i, "file_size", 0):  # directory
                zipf.extract(i, os.fspath(dest))
            else:
                full_path = dest / i.filename
                full_path.parent.mkdir(exist_ok=True, parents=True)
                with zipf.open(i) as fi, open(full_path, "wb") as fo:
                    shutil.copyfileobj(CallbackIOWrapper(pbar.update, fi), fo)

def copy_concurrent(src: pathlib.Path, dest: pathlib.Path) -> None:
    from multiprocessing.pool import ThreadPool

    class MultithreadedCopier:
        def __init__(self, max_threads):
            self.pool = ThreadPool(max_threads)

        def copy(self, source, dest):
            self.pool.apply_async(shutil.copy2, args=(source, dest))

        def __enter__(self):
            return self

        def __exit__(self, exc_type, exc_val, exc_tb):
            self.pool.close()
            self.pool.join()

    with MultithreadedCopier(max_threads=48) as copier:
        shutil.copytree(src, dest, copy_function=copier.copy, dirs_exist_ok=True)

```

This script uses wget to download ZIP files over HTTP because while the official
website recommends using gsutil to download from a bucket (https://cocodataset.org/#download)
that bucket no longer exists.

```python
@app.function(
    volumes={"/vol/": volume},
    timeout=60 * 60 * 5,  # 5 hours
    ephemeral_disk=600 * 1024,  # 600 GiB,
)
def _do_part(url: str) -> None:
    start_monitoring_disk_space()
    part = url.replace("http://images.cocodataset.org/", "")
    name = pathlib.Path(part).name.replace(".zip", "")
    zip_path = pathlib.Path("/tmp/") / pathlib.Path(part).name
    extract_tmp_path = pathlib.Path("/tmp", name)
    dest_path = pathlib.Path("/vol/coco/", name)

    print(f"Downloading {name} from {url}")
    command = f"wget {url} -O {zip_path}"
    subprocess.run(command, shell=True, check=True)
    print(f"Download of {name} completed successfully.")
    extract_tmp_path.mkdir()
    extractall(
        zip_path, extract_tmp_path, desc=f"Extracting {name}"
    )  # extract into /tmp/
    zip_path.unlink()  # free up disk space by deleting the zip
    print(f"Copying extract {name} data to volume.")
    copy_concurrent(extract_tmp_path, dest_path)  # copy from /tmp/ into mounted volume

```

We can process each part of the dataset in parallel, using a 'parent' Function just to execute
the map and wait on completion of all children.

```python
@app.function(
    timeout=60 * 60 * 5,  # 5 hours
)
def import_transform_load() -> None:
    print("Starting import, transform, and load of COCO dataset")
    list(
        _do_part.map(
            [
                "http://images.cocodataset.org/zips/train2017.zip",
                "http://images.cocodataset.org/zips/val2017.zip",
                "http://images.cocodataset.org/zips/test2017.zip",
                "http://images.cocodataset.org/zips/unlabeled2017.zip",
                "http://images.cocodataset.org/annotations/annotations_trainval2017.zip",
                "http://images.cocodataset.org/annotations/stuff_annotations_trainval2017.zip",
                "http://images.cocodataset.org/annotations/image_info_test2017.zip",
                "http://images.cocodataset.org/annotations/image_info_unlabeled2017.zip",
            ]
        )
    )
    print("✅ Done")

```

### Comfyapp

# Run Flux on ComfyUI as an API

In this example, we show you how to turn a [ComfyUI](https://github.com/comfyanonymous/ComfyUI) workflow into a scalable API endpoint.

## Quickstart

To run this simple text-to-image [Flux Schnell workflow](https://github.com/modal-labs/modal-examples/blob/main/06_gpu_and_ml/comfyui/workflow_api.json) as an API:

1. Deploy ComfyUI behind a web endpoint:

```bash
modal deploy 06_gpu_and_ml/comfyui/comfyapp.py
```

2. In another terminal, run inference:

```bash
python 06_gpu_and_ml/comfyui/comfyclient.py --modal-workspace $(modal profile current) --prompt "Surreal dreamscape with floating islands, upside-down waterfalls, and impossible geometric structures, all bathed in a soft, ethereal light"
```

![example comfyui image](https://modal-cdn.com/cdnbot/flux_gen_imagesenr_0w3_209b7170.webp)

The first inference will take ~1m since the container needs to launch the ComfyUI server and load Flux into memory. Successive calls on a warm container should take a few seconds.

## Installing ComfyUI

We use [comfy-cli](https://github.com/Comfy-Org/comfy-cli) to install ComfyUI and its dependencies.

```python
import json
import subprocess
import uuid
from pathlib import Path
from typing import Dict

import modal
import modal.experimental

image = (  # build up a Modal Image to run ComfyUI, step by step
    modal.Image.debian_slim(  # start from basic Linux with Python
        python_version="3.11"
    )
    .apt_install("git")  # install git to clone ComfyUI
    .pip_install("fastapi[standard]==0.115.4")  # install web dependencies
    .pip_install("comfy-cli==1.4.1")  # install comfy-cli
    .run_commands(  # use comfy-cli to install ComfyUI and its dependencies
        "comfy --skip-prompt install --fast-deps --nvidia --version 0.3.41"
    )
)

```

## Downloading custom nodes

We'll also use `comfy-cli` to download custom nodes, in this case the popular [WAS Node Suite](https://github.com/WASasquatch/was-node-suite-comfyui).

Use the [ComfyUI Registry](https://registry.comfy.org/) to find the specific custom node name to use with this command.

```python
image = (
    image.run_commands(  # download a custom node
        "comfy node install --fast-deps was-node-suite-comfyui@1.0.2"
    )
    # Add .run_commands(...) calls for any other custom nodes you want to download
)

```

See [this post](https://modal.com/blog/comfyui-custom-nodes) for more examples
on how to install popular custom nodes like ComfyUI Impact Pack and ComfyUI IPAdapter Plus.

## Downloading models

`comfy-cli` also supports downloading models, but we've found it's faster to use
[`hf_hub_download`](https://huggingface.co/docs/huggingface_hub/en/guides/download#download-a-single-file)
directly by:

1. Enabling [faster downloads](https://huggingface.co/docs/huggingface_hub/en/guides/download#faster-downloads)
2. Mounting the cache directory to a [Volume](https://modal.com/docs/guide/volumes)

By persisting the cache to a Volume, you avoid re-downloading the models every time you rebuild your image.
For more on storing model weights on Modal, see
[this guide](https://modal.com/docs/guide/model-weights).

```python
def hf_download():
    from huggingface_hub import hf_hub_download

    flux_model = hf_hub_download(
        repo_id="Comfy-Org/flux1-schnell",
        filename="flux1-schnell-fp8.safetensors",
        cache_dir="/cache",
    )

    # symlink the model to the right ComfyUI directory
    subprocess.run(
        f"ln -s {flux_model} /root/comfy/ComfyUI/models/checkpoints/flux1-schnell-fp8.safetensors",
        shell=True,
        check=True,
    )

vol = modal.Volume.from_name("hf-hub-cache", create_if_missing=True)

image = (
    # install huggingface_hub with hf_transfer support to speed up downloads
    image.pip_install("huggingface_hub[hf_transfer]==0.30.0")
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
    .run_function(
        hf_download,
        # persist the HF cache to a Modal Volume so future runs don't re-download models
        volumes={"/cache": vol},
    )
)

```

Lastly, copy the ComfyUI workflow JSON to the container.

```python
image = image.add_local_file(
    Path(__file__).parent / "workflow_api.json", "/root/workflow_api.json"
)

```

## Running ComfyUI interactively

Spin up an interactive ComfyUI server by wrapping the `comfy launch` command in a Modal Function
and serving it as a [web server](https://modal.com/docs/guide/webhooks#non-asgi-web-servers).

```python
app = modal.App(name="example-comfyapp", image=image)

@app.function(
    max_containers=1,  # limit interactive session to 1 container
    gpu="L40S",  # good starter GPU for inference
    volumes={"/cache": vol},  # mounts our cached models
)
@modal.concurrent(
    max_inputs=10
)  # required for UI startup process which runs several API calls concurrently
@modal.web_server(8000, startup_timeout=60)
def ui():
    subprocess.Popen("comfy launch -- --listen 0.0.0.0 --port 8000", shell=True)

```

At this point you can run `modal serve 06_gpu_and_ml/comfyui/comfyapp.py` and open the UI in your browser for the classic ComfyUI experience.

Remember to **close your UI tab** when you are done developing.
This will close the connection with the container serving ComfyUI and you will stop being charged.

## Running ComfyUI as an API

To run a workflow as an API:

1. Stand up a "headless" ComfyUI server in the background when the app starts.

2. Define an `infer` method that takes in a workflow path and runs the workflow on the ComfyUI server.

3. Create a web handler `api` as a web endpoint, so that we can run our workflow as a service and accept inputs from clients.

We group all these steps into a single Modal `cls` object, which we'll call `ComfyUI`.

```python
@app.cls(
    scaledown_window=300,  # 5 minute container keep alive after it processes an input
    gpu="L40S",
    volumes={"/cache": vol},
)
@modal.concurrent(max_inputs=5)  # run 5 inputs per container
class ComfyUI:
    port: int = 8000

    @modal.enter()
    def launch_comfy_background(self):
        # launch the ComfyUI server exactly once when the container starts
        cmd = f"comfy launch --background -- --port {self.port}"
        subprocess.run(cmd, shell=True, check=True)

    @modal.method()
    def infer(self, workflow_path: str = "/root/workflow_api.json"):
        # sometimes the ComfyUI server stops responding (we think because of memory leaks), so this makes sure it's still up
        self.poll_server_health()

        # runs the comfy run --workflow command as a subprocess
        cmd = f"comfy run --workflow {workflow_path} --wait --timeout 1200 --verbose"
        subprocess.run(cmd, shell=True, check=True)

        # completed workflows write output images to this directory
        output_dir = "/root/comfy/ComfyUI/output"

        # looks up the name of the output image file based on the workflow
        workflow = json.loads(Path(workflow_path).read_text())
        file_prefix = [
            node.get("inputs")
            for node in workflow.values()
            if node.get("class_type") == "SaveImage"
        ][0]["filename_prefix"]

        # returns the image as bytes
        for f in Path(output_dir).iterdir():
            if f.name.startswith(file_prefix):
                return f.read_bytes()

    @modal.fastapi_endpoint(method="POST")
    def api(self, item: Dict):
        from fastapi import Response

        workflow_data = json.loads(
            (Path(__file__).parent / "workflow_api.json").read_text()
        )

        # insert the prompt
        workflow_data["6"]["inputs"]["text"] = item["prompt"]

        # give the output image a unique id per client request
        client_id = uuid.uuid4().hex
        workflow_data["9"]["inputs"]["filename_prefix"] = client_id

        # save this updated workflow to a new file
        new_workflow_file = f"{client_id}.json"
        json.dump(workflow_data, Path(new_workflow_file).open("w"))

        # run inference on the currently running container
        img_bytes = self.infer.local(new_workflow_file)

        return Response(img_bytes, media_type="image/jpeg")

    def poll_server_health(self) -> Dict:
        import socket
        import urllib

        try:
            # check if the server is up (response should be immediate)
            req = urllib.request.Request(f"http://127.0.0.1:{self.port}/system_stats")
            urllib.request.urlopen(req, timeout=5)
            print("ComfyUI server is healthy")
        except (socket.timeout, urllib.error.URLError) as e:
            # if no response in 5 seconds, stop the container
            print(f"Server health check failed: {str(e)}")
            modal.experimental.stop_fetching_inputs()

            # all queued inputs will be marked "Failed", so you need to catch these errors in your client and then retry
            raise Exception("ComfyUI server is not healthy, stopping container")

```

This serves the `workflow_api.json` in this repo. When deploying your own workflows, make sure you select the "Export (API)" option in the ComfyUI menu:

![comfyui menu](https://modal-cdn.com/cdnbot/comfyui_menugo5j8ahx_27d72c45.webp)

## More resources
- Use [memory snapshots](https://modal.com/docs/guide/memory-snapshot) to speed up cold starts (check out the `memory_snapshot` directory on [Github](https://github.com/modal-labs/modal-examples/tree/main/06_gpu_and_ml/comfyui))
- Run a ComfyUI workflow as a [Python script](https://modal.com/blog/comfyui-prototype-to-production)

- When to use [A1111 vs ComfyUI](https://modal.com/blog/a1111-vs-comfyui)

- Understand tradeoffs of parallel processing strategies when
[scaling ComfyUI](https://modal.com/blog/scaling-comfyui)

### Comfyclient

```python
import argparse
import json
import pathlib
import sys
import time
import urllib.request

OUTPUT_DIR = pathlib.Path("/tmp/comfyui")
OUTPUT_DIR.mkdir(exist_ok=True, parents=True)

def main(args: argparse.Namespace):
    url = f"https://{args.modal_workspace}--example-comfyapp-comfyui-api{'-dev' if args.dev else ''}.modal.run/"
    data = json.dumps({"prompt": args.prompt}).encode("utf-8")
    print(f"Sending request to {url} with prompt: {args.prompt}")
    print("Waiting for response...")
    start_time = time.time()
    req = urllib.request.Request(
        url, data=data, headers={"Content-Type": "application/json"}
    )
    try:
        with urllib.request.urlopen(req) as response:
            assert response.status == 200, response.status
            elapsed = round(time.time() - start_time, 1)
            print(f"Image finished generating in {elapsed} seconds!")
            filename = OUTPUT_DIR / f"{slugify(args.prompt)}.png"
            filename.write_bytes(response.read())
            print(f"Saved to '{filename}'")
    except urllib.error.HTTPError as e:
        if e.code == 404:
            print(f"Workflow API not found at {url}")

def parse_args(arglist: list[str]) -> argparse.Namespace:
    parser = argparse.ArgumentParser()

    parser.add_argument(
        "--modal-workspace",
        type=str,
        required=True,
        help="Name of the Modal workspace with the deployed app. Run `modal profile current` to check.",
    )
    parser.add_argument(
        "--prompt",
        type=str,
        required=True,
        help="Prompt for the image generation model.",
    )
    parser.add_argument(
        "--dev",
        action="store_true",
        help="use this flag when running the ComfyUI server in development mode with `modal serve`",
    )

    return parser.parse_args(arglist[1:])

def slugify(s: str) -> str:
    return s.lower().replace(" ", "-").replace(".", "-").replace("/", "-")[:32]

if __name__ == "__main__":
    args = parse_args(sys.argv)
    main(args)

```

### Constants

# Example (constants.py)

This is the source code for **07_web_endpoints.fasthtml-checkboxes.constants**.

```python
N_CHECKBOXES = 100_000  # feel free to increase, if you dare!

```

### Controlnet Gradio Demos

# Play with the ControlNet demos

This example allows you to play with all 10 demonstration Gradio apps from the new and amazing ControlNet project.
ControlNet provides a minimal interface allowing users to use images to constrain StableDiffusion's generation process.
With ControlNet, users can easily condition the StableDiffusion image generation with different spatial contexts
including a depth maps, segmentation maps, scribble drawings, and keypoints!

<center>
<video controls autoplay loop muted>
<source src="https://user-images.githubusercontent.com/12058921/222927911-3ab52dd1-f2ee-4fb8-97e8-dafbf96ed5c5.mp4" type="video/mp4">
</video>
</center>

## Imports and config preamble

```python
import importlib
import os
import pathlib
from dataclasses import dataclass, field

import modal
from fastapi import FastAPI

```

Below are the configuration objects for all **10** demos provided in the original [lllyasviel/ControlNet](https://github.com/lllyasviel/ControlNet) repo.
The demos each depend on their own custom pretrained StableDiffusion model, and these models are 5-6GB each.
We can only run one demo at a time, so this module avoids downloading the model and 'detector' dependencies for
all 10 demos and instead uses the demo configuration object to download only what's necessary for the chosen demo.

Even just limiting our dependencies setup to what's required for one demo, the resulting container image is *huge*.

```python
@dataclass(frozen=True)
class DemoApp:
    """Config object defining a ControlNet demo app's specific dependencies."""

    name: str
    model_files: list[str]
    detector_files: list[str] = field(default_factory=list)

demos = [
    DemoApp(
        name="canny2image",
        model_files=[
            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/models/control_sd15_canny.pth"
        ],
    ),
    DemoApp(
        name="depth2image",
        model_files=[
            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/models/control_sd15_depth.pth"
        ],
        detector_files=[
            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/annotator/ckpts/dpt_hybrid-midas-501f0c75.pt"
        ],
    ),
    DemoApp(
        name="fake_scribble2image",
        model_files=[
            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/models/control_sd15_scribble.pth"
        ],
        detector_files=[
            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/annotator/ckpts/network-bsds500.pth"
        ],
    ),
    DemoApp(
        name="hed2image",
        model_files=[
            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/models/control_sd15_hed.pth"
        ],
        detector_files=[
            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/annotator/ckpts/network-bsds500.pth"
        ],
    ),
    DemoApp(
        name="hough2image",
        model_files=[
            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/models/control_sd15_mlsd.pth"
        ],
        detector_files=[
            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/annotator/ckpts/mlsd_large_512_fp32.pth",
            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/annotator/ckpts/mlsd_tiny_512_fp32.pth",
        ],
    ),
    DemoApp(
        name="normal2image",
        model_files=[
            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/models/control_sd15_normal.pth"
        ],
    ),
    DemoApp(
        name="pose2image",
        model_files=[
            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/models/control_sd15_openpose.pth"
        ],
        detector_files=[
            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/annotator/ckpts/body_pose_model.pth",
            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/annotator/ckpts/hand_pose_model.pth",
        ],
    ),
    DemoApp(
        name="scribble2image",
        model_files=[
            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/models/control_sd15_scribble.pth"
        ],
    ),
    DemoApp(
        name="scribble2image_interactive",
        model_files=[
            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/models/control_sd15_scribble.pth"
        ],
    ),
    DemoApp(
        name="seg2image",
        model_files=[
            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/models/control_sd15_seg.pth"
        ],
        detector_files=[
            "https://huggingface.co/lllyasviel/ControlNet/resolve/main/annotator/ckpts/upernet_global_small.pth"
        ],
    ),
]
demos_map: dict[str, DemoApp] = {d.name: d for d in demos}

```

## Pick a demo, any demo

Simply by changing the `DEMO_NAME` below, you can change which ControlNet demo app is setup
and run by this Modal script.

```python
DEMO_NAME = "scribble2image"  # Change this value to change the active demo app.
selected_demo = demos_map[DEMO_NAME]

```

## Setting up the dependencies

ControlNet requires *a lot* of dependencies which could be fiddly to setup manually, but Modal's programmatic
container image building Python APIs handle this complexity straightforwardly and automatically.

To run any of the 10 demo apps, we need the following:

1. a base Python 3 Linux image (we use Debian Slim)
2. a bunch of third party PyPi packages
3. `git`, so that we can download the ControlNet source code (there's no `controlnet` PyPi package)
4. some image process Linux system packages, including `ffmpeg`
5. and demo specific pre-trained model and detector `.pth` files

That's a lot! Fortunately, the code below is already written for you that stitches together a working container image
ready to produce remarkable ControlNet images.

**Note:** a ControlNet model pipeline is [now available in Huggingface's `diffusers` package](https://huggingface.co/blog/controlnet). But this does not contain the demo apps.

```python
def download_file(url: str, output_path: pathlib.Path):
    import httpx
    from tqdm import tqdm

    with open(output_path, "wb") as download_file:
        with httpx.stream("GET", url, follow_redirects=True) as response:
            total = int(response.headers["Content-Length"])
            with tqdm(
                total=total, unit_scale=True, unit_divisor=1024, unit="B"
            ) as progress:
                num_bytes_downloaded = response.num_bytes_downloaded
                for chunk in response.iter_bytes():
                    download_file.write(chunk)
                    progress.update(
                        response.num_bytes_downloaded - num_bytes_downloaded
                    )
                    num_bytes_downloaded = response.num_bytes_downloaded

def download_demo_files() -> None:
    """
    The ControlNet repo instructs: 'Make sure that SD models are put in "ControlNet/models".'
    'ControlNet' is just the repo root, so we place in /root/models.

    The ControlNet repo also instructs: 'Make sure that... detectors are put in "ControlNet/annotator/ckpts".'
    'ControlNet' is just the repo root, so we place in /root/annotator/ckpts.
    """
    demo = demos_map[os.environ["DEMO_NAME"]]
    models_dir = pathlib.Path("/root/models")
    for url in demo.model_files:
        filepath = pathlib.Path(url).name
        download_file(url=url, output_path=models_dir / filepath)
        print(f"download complete for {filepath}")

    detectors_dir = pathlib.Path("/root/annotator/ckpts")
    for url in demo.detector_files:
        filepath = pathlib.Path(url).name
        download_file(url=url, output_path=detectors_dir / filepath)
        print(f"download complete for {filepath}")
    print("🎉 finished baking demo file(s) into image.")

image = (
    modal.Image.debian_slim(python_version="3.10")
    .pip_install(
        "fastapi[standard]==0.115.4",
        "pydantic==2.9.1",
        "starlette==0.41.2",
        "gradio==3.16.2",
        "albumentations==1.3.0",
        "opencv-contrib-python",
        "imageio==2.9.0",
        "imageio-ffmpeg==0.4.2",
        "pytorch-lightning==1.5.0",
        "omegaconf==2.1.1",
        "test-tube>=0.7.5",
        "streamlit==1.12.1",
        "einops==0.3.0",
        "transformers==4.19.2",
        "webdataset==0.2.5",
        "kornia==0.6",
        "open_clip_torch==2.0.2",
        "invisible-watermark>=0.1.5",
        "streamlit-drawable-canvas==0.8.0",
        "torchmetrics==0.6.0",
        "timm==0.6.12",
        "addict==2.4.0",
        "yapf==0.32.0",
        "prettytable==3.6.0",
        "safetensors==0.2.7",
        "basicsr==1.4.2",
        "tqdm~=4.64.1",
    )
    # xformers library offers performance improvement.
    .pip_install("xformers", pre=True)
    .apt_install("git")
    # Here we place the latest ControlNet repository code into /root.
    # Because /root is almost empty, but not entirely empty, `git clone` won't work,
    # so this `init` then `checkout` workaround is used.
    .run_commands(
        "cd /root && git init .",
        "cd /root && git remote add --fetch origin https://github.com/lllyasviel/ControlNet.git",
        "cd /root && git checkout main",
    )
    .apt_install("ffmpeg", "libsm6", "libxext6")
    .run_function(
        download_demo_files,
        secrets=[modal.Secret.from_dict({"DEMO_NAME": DEMO_NAME})],
    )
)
app = modal.App(name="example-controlnet-gradio-demos", image=image)

web_app = FastAPI()

```

## Serving the Gradio web UI

Each ControlNet gradio demo module exposes a `block` Gradio interface running in queue-mode,
which is initialized in module scope on import and served on `0.0.0.0`. We want the block interface object,
but the queueing and launched webserver aren't compatible with Modal's serverless web endpoint interface,
so in the `import_gradio_app_blocks` function we patch out these behaviors.

```python
def import_gradio_app_blocks(demo: DemoApp):
    from gradio import blocks

    # The ControlNet repo demo scripts are written to be run as
    # standalone scripts, and have a lot of code that executes
    # in global scope on import, including the launch of a Gradio web server.
    # We want Modal to control the Gradio web app serving, so we
    # monkeypatch the .launch() function to be a no-op.
    blocks.Blocks.launch = lambda self, server_name: print(
        "launch() has been monkeypatched to do nothing."
    )

    # each demo app module is a file like gradio_{name}.py
    module_name = f"gradio_{demo.name}"
    mod = importlib.import_module(module_name)
    blocks = mod.block
    # disable queueing mode, which is incompatible with our Modal web app setup.
    blocks.enable_queue = False
    return blocks

```

Because the ControlNet gradio apps are so time and compute intensive to cold-start,
the web app function is limited to running just 1 warm container (max_containers=1).
This way, while playing with the demos we can pay the cold-start cost once and have
all web requests hit the same warm container.
Spinning up extra containers to handle additional requests would not be efficient
given the cold-start time.
We set the scaledown_window to 600 seconds so the container will be kept
running for 10 minutes after the last request, to keep the app responsive in case
of continued experimentation.

```python
@app.function(
    gpu="A10G",
    max_containers=1,
    scaledown_window=600,
)
@modal.asgi_app()
def run():
    from gradio.routes import mount_gradio_app

    # mount for execution on Modal
    return mount_gradio_app(
        app=web_app,
        blocks=import_gradio_app_blocks(demo=selected_demo),
        path="/",
    )

```

## Have fun!

Serve your chosen demo app with `modal serve controlnet_gradio_demos.py`. If you don't have any images ready at hand,
try one that's in the `06_gpu_and_ml/controlnet/demo_images/` folder.

StableDiffusion was already impressive enough, but ControlNet's ability to so accurately and intuitively constrain
the image generation process is sure to put a big, dumb grin on your face.

### Count Faces

# Run OpenCV face detection on an image

This example shows how you can use OpenCV on Modal to detect faces in an image. We use
the `opencv-python` package to load the image and the `opencv` library to
detect faces. The function `count_faces` takes an image as input and returns
the number of faces detected in the image.

The code below also shows how you can create wrap this function
in a simple FastAPI server to create a web interface.

```python
import os

import modal

app = modal.App("example-count-faces")

open_cv_image = (
    modal.Image.debian_slim(python_version="3.11")
    .apt_install("python3-opencv")
    .pip_install(
        "fastapi[standard]==0.115.4",
        "opencv-python~=4.10.0",
        "numpy<2",
    )
)

@app.function(image=open_cv_image)
def count_faces(image_bytes: bytes) -> int:
    import cv2
    import numpy as np

    # Example borrowed from https://towardsdatascience.com/face-detection-in-2-minutes-using-opencv-python-90f89d7c0f81
    # Load the cascade
    face_cascade = cv2.CascadeClassifier(
        os.path.join(cv2.data.haarcascades, "haarcascade_frontalface_default.xml")
    )
    # Read the input image
    np_bytes = np.frombuffer(image_bytes, dtype=np.uint8)
    img = cv2.imdecode(np_bytes, cv2.IMREAD_COLOR)
    # Convert into grayscale
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    # Detect faces
    faces = face_cascade.detectMultiScale(gray, 1.1, 4)
    return len(faces)

@app.function(
    image=modal.Image.debian_slim(python_version="3.11").pip_install("inflect")
)
@modal.asgi_app()
def web():
    import inflect
    from fastapi import FastAPI, File, HTTPException, UploadFile
    from fastapi.responses import HTMLResponse

    app = FastAPI()

    @app.get("/", response_class=HTMLResponse)
    async def index():
        """
        Render an HTML form for file upload.
        """
        return """
        <html>
            <head>
                <title>Face Counter</title>
            </head>
            <body>
                <h1>Upload an Image to Count Faces</h1>
                <form action="/process" method="post" enctype="multipart/form-data">
                    <input type="file" name="file" id="file" accept="image/*" required />
                    <button type="submit">Upload</button>
                </form>
            </body>
        </html>
        """

    @app.post("/process", response_class=HTMLResponse)
    async def process(file: UploadFile = File(...)):
        """
        Process the uploaded image and return the number of faces detected.
        """
        try:
            file_content = await file.read()
            num_faces = await count_faces.remote.aio(file_content)
            return f"""
            <html>
                <head>
                    <title>Face Counter Result</title>
                </head>
                <body>
                    <h1>{inflect.engine().number_to_words(num_faces).title()} {"Face" if num_faces == 1 else "Faces"} Detected</h1>
                    <h2>{"😀" * num_faces}</h2>
                    <a href="/">Go back</a>
                </body>
            </html>
            """
        except Exception as e:
            raise HTTPException(
                status_code=400, detail=f"Error processing image: {str(e)}"
            )

    return app

```

### Cron Datasette

# Publish interactive datasets with Datasette

![Datasette user interface](https://modal-cdn.com/cdnbot/imdb_datasetteqzaj3q9d_a83d82fd.webp)

Build and deploy an interactive movie database that automatically updates daily with the latest IMDb data.
This example shows how to serve a Datasette application on Modal with millions of movie and TV show records.

Try it out for yourself [here](https://modal-labs-examples--example-cron-datasette-ui.modal.run).

Along the way, we will learn how to use the following Modal features:

* [Volumes](https://modal.com/docs/guide/volumes): a persisted volume lets us store and grow the published dataset over time.

* [Scheduled functions](https://modal.com/docs/guide/cron): the underlying dataset is refreshed daily, so we schedule a function to run daily.

* [Web endpoints](https://modal.com/docs/guide/webhooks): exposes the Datasette application for web browser interaction and API requests.

## Basic setup

Let's get started writing code.
For the Modal container image we need a few Python packages.

```python
import asyncio
import gzip
import pathlib
import shutil
import tempfile
from datetime import datetime
from urllib.request import urlretrieve

import modal

app = modal.App("example-cron-datasette")
cron_image = modal.Image.debian_slim(python_version="3.12").pip_install(
    "datasette==0.65.1", "sqlite-utils==3.38", "tqdm~=4.67.1", "setuptools<80"
)

```

## Persistent dataset storage

To separate database creation and maintenance from serving, we'll need the underlying
database file to be stored persistently. To achieve this we use a
[Volume](https://modal.com/docs/guide/volumes).

```python
volume = modal.Volume.from_name(
    "example-cron-datasette-cache-vol", create_if_missing=True
)
DB_FILENAME = "imdb.db"
VOLUME_DIR = "/cache-vol"
DATA_DIR = pathlib.Path(VOLUME_DIR, "imdb-data")
DB_PATH = pathlib.Path(VOLUME_DIR, DB_FILENAME)

```

## Getting a dataset

[IMDb Datasets](https://datasets.imdbws.com/) are available publicly and are updated daily.
We will download the title.basics.tsv.gz file which contains basic information about all titles (movies, TV shows, etc.).
Since we are serving an interactive database which updates daily, we will download the files into a temporary directory and then move them to the volume to prevent downtime.

```python
BASE_URL = "https://datasets.imdbws.com/"
IMDB_FILES = [
    "title.basics.tsv.gz",
]

@app.function(
    image=cron_image,
    volumes={VOLUME_DIR: volume},
    retries=2,
    timeout=1800,
)
def download_dataset(force_refresh=False):
    """Download IMDb dataset files."""
    if DATA_DIR.exists() and not force_refresh:
        print(
            f"Dataset already present and force_refresh={force_refresh}. Skipping download."
        )
        return

    TEMP_DATA_DIR = pathlib.Path(VOLUME_DIR, "imdb-data-temp")
    if TEMP_DATA_DIR.exists():
        shutil.rmtree(TEMP_DATA_DIR)

    TEMP_DATA_DIR.mkdir(parents=True, exist_ok=True)

    print("Downloading IMDb dataset...")

    try:
        for filename in IMDB_FILES:
            print(f"Downloading {filename}...")
            url = BASE_URL + filename
            output_path = TEMP_DATA_DIR / filename

            urlretrieve(url, output_path)
            print(f"Successfully downloaded {filename}")

        if DATA_DIR.exists():
            # move the current data to a backup location
            OLD_DATA_DIR = pathlib.Path(VOLUME_DIR, "imdb-data-old")
            if OLD_DATA_DIR.exists():
                shutil.rmtree(OLD_DATA_DIR)
            shutil.move(DATA_DIR, OLD_DATA_DIR)

            # move the new data into place
            shutil.move(TEMP_DATA_DIR, DATA_DIR)

            # clean up the old data
            shutil.rmtree(OLD_DATA_DIR)
        else:
            shutil.move(TEMP_DATA_DIR, DATA_DIR)

        volume.commit()
        print("Finished downloading dataset.")

    except Exception as e:
        print(f"Error during download: {e}")
        if TEMP_DATA_DIR.exists():
            shutil.rmtree(TEMP_DATA_DIR)
        raise

```

## Data processing

This dataset is no swamp, but a bit of data cleaning is still in order.
The following function reads a .tsv file, cleans the data and yields batches of records.

```python
def parse_tsv_file(filepath, batch_size=50000, filter_year=None):
    """Parse a gzipped TSV file and yield batches of records."""
    import csv

    with gzip.open(filepath, "rt", encoding="utf-8") as gz_file:
        reader = csv.DictReader(gz_file, delimiter="\t")
        batch = []
        total_processed = 0

        for row in reader:
            # map missing values to None
            row = {k: (None if v == "\\N" else v) for k, v in row.items()}

            # remove nsfw data
            if row.get("isAdult") == "1":
                continue

            if filter_year:
                start_year = int(row.get("startYear", 0) or 0)
                if start_year < filter_year:
                    continue

            batch.append(row)
            total_processed += 1

            if len(batch) >= batch_size:
                yield batch
                batch = []

        # Yield any remaining records
        if batch:
            yield batch

        print(f"Finished processing {total_processed:,} titles.")

```

## Inserting into SQLite

With the TSV processing out of the way, we’re ready to create a SQLite database and feed data into it.

Importantly, the `prep_db` function mounts the same volume used by `download_dataset`, and rows are batch inserted with progress logged after each batch,
as the full IMDb dataset has millions of rows and does take some time to be fully inserted.

A more sophisticated implementation would only load new data instead of performing a full refresh,
but we’re keeping things simple for this example!
We will also create indexes for the titles table to speed up queries.

```python
@app.function(
    image=cron_image,
    volumes={VOLUME_DIR: volume},
    timeout=900,
)
def prep_db(filter_year=None):
    """Process IMDb data files and create SQLite database."""
    import sqlite_utils
    import tqdm

    volume.reload()

    # Create database in a temporary directory first
    with tempfile.TemporaryDirectory() as tmpdir:
        tmpdir_path = pathlib.Path(tmpdir)
        tmp_db_path = tmpdir_path / DB_FILENAME

        db = sqlite_utils.Database(tmp_db_path)

        # Process title.basics.tsv.gz
        titles_file = DATA_DIR / "title.basics.tsv.gz"

        if titles_file.exists():
            titles_table = db["titles"]
            batch_count = 0
            total_processed = 0

            with tqdm.tqdm(desc="Processing titles", unit="batch", leave=True) as pbar:
                for i, batch in enumerate(
                    parse_tsv_file(
                        titles_file, batch_size=50000, filter_year=filter_year
                    )
                ):
                    titles_table.insert_all(batch, batch_size=50000, truncate=(i == 0))
                    batch_count += len(batch)
                    total_processed += len(batch)
                    pbar.update(1)
                    pbar.set_postfix({"titles": f"{total_processed:,}"})

            print(f"Total titles in database: {batch_count:,}")

            # Create indexes for titles so we can query the database faster
            print("Creating indexes...")
            titles_table.create_index(["tconst"], if_not_exists=True, unique=True)
            titles_table.create_index(["primaryTitle"], if_not_exists=True)
            titles_table.create_index(["titleType"], if_not_exists=True)
            titles_table.create_index(["startYear"], if_not_exists=True)
            titles_table.create_index(["genres"], if_not_exists=True)
            print("Created indexes for titles table")

        db.close()

        # Copy the database to the volume
        DB_PATH.parent.mkdir(parents=True, exist_ok=True)
        shutil.copyfile(tmp_db_path, DB_PATH)

    print("Syncing DB with volume.")
    volume.commit()
    print("Volume changes committed.")

```

## Keep it fresh

IMDb updates their data daily, so we set up
a [scheduled](https://modal.com/docs/guide/cron) function to automatically refresh the database
every 24 hours.

```python
@app.function(schedule=modal.Period(hours=24), timeout=4000)
def refresh_db():
    """Scheduled function to refresh the database daily."""
    print(f"Running scheduled refresh at {datetime.now()}")
    download_dataset.remote(force_refresh=True)
    prep_db.remote()

```

## Web endpoint

Hooking up the SQLite database to a Modal webhook is as simple as it gets.
The Modal `@asgi_app` decorator wraps a few lines of code: one `import` and a few
lines to instantiate the `Datasette` instance and return its app server.

First, let's define a metadata object for the database.
This will be used to configure Datasette to display a custom UI with some pre-defined queries.

```python
columns = {
    "tconst": "Unique identifier",
    "titleType": "Type (movie, tvSeries, short, etc.)",
    "primaryTitle": "Main title",
    "originalTitle": "Original language title",
    "startYear": "Release year",
    "endYear": "End year (for TV series)",
    "runtimeMinutes": "Runtime in minutes",
    "genres": "Comma-separated genres",
}

queries = {
    "movies_2024": {
        "sql": """
                        SELECT
                            primaryTitle as title,
                            genres,
                            runtimeMinutes as runtime
                        FROM titles
                        WHERE titleType = 'movie'
                        AND startYear = 2024
                        ORDER BY primaryTitle
                        LIMIT 100
                    """,
        "title": "Movies Released in 2024",
    },
    "longest_movies": {
        "sql": """
                        SELECT
                            primaryTitle as title,
                            startYear as year,
                            runtimeMinutes as runtime,
                            genres
                        FROM titles
                        WHERE titleType = 'movie'
                        AND runtimeMinutes IS NOT NULL
                        AND runtimeMinutes > 180
                        ORDER BY runtimeMinutes DESC
                        LIMIT 50
                    """,
        "title": "Longest Movies (3+ hours)",
    },
    "genre_breakdown": {
        "sql": """
                        SELECT
                            genres,
                            COUNT(*) as count
                        FROM titles
                        WHERE titleType = 'movie'
                        AND genres IS NOT NULL
                        GROUP BY genres
                        ORDER BY count DESC
                        LIMIT 25
                    """,
        "title": "Popular Genres",
    },
}

metadata = {
    "title": "IMDb Database Explorer",
    "description": "Explore IMDb movie and TV show data",
    "databases": {
        "imdb": {
            "tables": {
                "titles": {
                    "description": "Basic information about all titles (movies, TV shows, etc.)",
                    "columns": columns,
                }
            },
            "queries": {
                "movies_2024": queries["movies_2024"],
                "longest_movies": queries["longest_movies"],
                "genre_breakdown": queries["genre_breakdown"],
            },
        }
    },
}

```

Now we can define the web endpoint that will serve the Datasette application

```python
@app.function(
    image=cron_image,
    volumes={VOLUME_DIR: volume},
)
@modal.concurrent(max_inputs=16)
@modal.asgi_app()
def ui():
    """Web endpoint for Datasette UI."""
    from datasette.app import Datasette

    ds = Datasette(
        files=[DB_PATH],
        settings={
            "sql_time_limit_ms": 60000,
            "max_returned_rows": 10000,
            "allow_download": True,
            "facet_time_limit_ms": 5000,
            "allow_facet": True,
        },
        metadata=metadata,
    )
    asyncio.run(ds.invoke_startup())
    return ds.app()

```

## Publishing to the web

Run this script using `modal run cron_datasette.py` and it will create the database under 5 minutes!

If you would like to force a refresh of the dataset, you can use:

`modal run cron_datasette.py --force-refresh`

If you would like to filter the data to be after a specific year, you can use:

`modal run cron_datasette.py --filter-year year`

You can then use `modal serve cron_datasette.py` to create a short-lived web URL
that exists until you terminate the script.

When publishing the interactive Datasette app you'll want to create a persistent URL.
Just run `modal deploy cron_datasette.py` and your app will be deployed in seconds!

```python
@app.local_entrypoint()
def run(force_refresh: bool = False, filter_year: int = None):
    if force_refresh:
        print("Force refreshing the dataset...")

    if filter_year:
        print(f"Filtering data to be after {filter_year}")

    print("Downloading IMDb dataset...")
    download_dataset.remote(force_refresh=force_refresh)
    print("Processing data and creating SQLite DB...")
    prep_db.remote(filter_year=filter_year)
    print("\nDatabase ready! You can now run:")
    print("  modal serve cron_datasette.py  # For development")
    print("  modal deploy cron_datasette.py  # For production deployment")

```

You can explore the data at the [deployed web endpoint](https://modal-labs-examples--example-cron-datasette-ui.modal.run).

### Db To Sheet

# Write to Google Sheets from Postgres

In this tutorial, we'll show how to use Modal to schedule a daily report in a spreadsheet on Google Sheets
that combines data from a PostgreSQL database with data from an external API.

In particular, we'll extract the city of each user from the database, look up the current weather in that city,
and then build a count/histogram of how many users are experiencing each type of weather.

## Entering credentials

We begin by setting up some credentials that we'll need in order to access our database and output
spreadsheet. To do that in a secure manner, we log in to our Modal account on the web and go to
the [Secrets](https://modal.com/secrets) section.

### Database

First we will enter our database credentials. The easiest way to do this is to click **New
secret** and select the **Postgres compatible** Secret preset and fill in the requested
information. Then we press **Next** and name our Secret `postgres-secret` and click **Create**.

### Google Sheets/GCP

We'll now add another Secret for Google Sheets access through Google Cloud Platform. Click **New
secret** and select the Google Sheets preset.

In order to access the Google Sheets API, we'll need to create a *Service Account* in Google Cloud
Platform. You can skip this step if you already have a Service Account json file.

1. Sign up to Google Cloud Platform or log in if you haven't
   ([https://cloud.google.com/](https://cloud.google.com/)).

2. Go to [https://console.cloud.google.com/](https://console.cloud.google.com/).

3. In the navigation pane on the left, go to **IAM & Admin** > **Service Accounts**.

4. Click the **+ CREATE SERVICE ACCOUNT** button.

5. Give the service account a suitable name, like "sheet-access-bot". Click **Done**. You don't
   have to grant it any specific access privileges at this time.

6. Click your new service account in the list view that appears and navigate to the **Keys**
   section.

7. Click **Add key** and choose **Create new key**. Use the **JSON** key type and confirm by
   clicking **Create**.

8. A json key file should be downloaded to your computer at this point. Copy the contents of that
   file and use it as the value for the `SERVICE_ACCOUNT_JSON` field in your new secret.

We'll name this other Secret `"gsheets-secret"`.

Now you can access the values of your Secrets from Modal Functions that you annotate with the
corresponding `modal.Secret`s, e.g.:

```python
import os

import modal

app = modal.App("example-db-to-sheet")

@app.function(secrets=[modal.Secret.from_name("postgres-secret")])
def show_host():
    # automatically filled from the specified secret
    print("Host is " + os.environ["PGHOST"])

```

Because these Secrets are Python objects, you can construct and manipulate them in your code.
We'll do that below by defining a variable to hold our Secret for accessing Postgres

You can additionally specify

```python
pg_secret = modal.Secret.from_name(
    "postgres-secret",
    required_keys=["PGHOST", "PGPORT", "PGDATABASE", "PGUSER", "PGPASSWORD"],
)

```

In order to connect to the database, we'll use the `psycopg2` Python package. To make it available
to your Modal Function you need to supply it with an `image` argument that tells Modal how to
build the container image that contains that package. We'll base it off of the `Image.debian_slim` base
image that's built into Modal, and make sure to install the required binary packages as well as
the `psycopg2` package itself:

```python
pg_image = (
    modal.Image.debian_slim(python_version="3.11")
    .apt_install("libpq-dev")
    .pip_install("psycopg2~=2.9.9")
)

```

Since the default keynames for a **Postgres compatible** secret correspond to the environment
variables that `psycopg2` looks for, we can now easily connect to the database even without
explicit credentials in your code. We'll create a simple function that queries the city for each
user in the `users` table.

```python
@app.function(image=pg_image, secrets=[pg_secret])
def get_db_rows(verbose=True):
    import psycopg2

    conn = psycopg2.connect()  # no explicit credentials needed
    cur = conn.cursor()
    cur.execute("SELECT city FROM users")
    results = [row[0] for row in cur.fetchall()]
    if verbose:
        print(results)
    return results

```

Note that we import `psycopg2` inside our function instead of the global scope. This allows us to
run this Modal Function even from an environment where `psycopg2` is not installed. We can test run
this function using the `modal run` shell command: `modal run db_to_sheet.py::app.get_db_rows`.

To run this function, make sure there is a table called `users` in your database with a column called `city`.
You can populate the table with some example data using the following SQL commands:

```sql
CREATE TABLE users (city TEXT);
INSERT INTO users VALUES ('Stockholm,,Sweden');
INSERT INTO users VALUES ('New York,NY,USA');
INSERT INTO users VALUES ('Tokyo,,Japan');
```

## Applying Python logic

For each row in our source data we'll run an online lookup of the current weather using the
[http://openweathermap.org](http://openweathermap.org) API. To do this, we'll add the API key to
another Modal Secret. We'll use a custom secret called "weather-secret" with the key
`OPENWEATHER_API_KEY` containing our API key for OpenWeatherMap.

```python
requests_image = modal.Image.debian_slim(python_version="3.11").pip_install(
    "requests~=2.31.0"
)

@app.function(
    image=requests_image,
    secrets=[
        modal.Secret.from_name("weather-secret", required_keys=["OPENWEATHER_API_KEY"])
    ],
)
def city_weather(city):
    import requests

    url = "https://api.openweathermap.org/data/2.5/weather"
    params = {"q": city, "appid": os.environ["OPENWEATHER_API_KEY"]}
    response = requests.get(url, params=params)
    weather_label = response.json()["weather"][0]["main"]
    return weather_label

```

We'll make use of Modal's built-in `function.map` method to create our report. `function.map`
makes it really easy to parallelize work by executing a Function on every element in a sequence of
data. For this example we'll just do a simple count of rows per weather type --
answering the question "how many of our users are experiencing each type of weather?".

```python
from collections import Counter

@app.function()
def create_report(cities):
    # run city_weather for each city in parallel
    user_weather = city_weather.map(cities)
    count_users_by_weather = Counter(user_weather).items()
    return count_users_by_weather

```

Let's try to run this! To make it simple to trigger the function with some
predefined input data, we create a "local entrypoint" that can be
run from the command line with

```bash
modal run db_to_sheet.py
```

```python
@app.local_entrypoint()
def main():
    cities = [
        "Stockholm,,Sweden",
        "New York,NY,USA",
        "Tokyo,,Japan",
    ]
    print(create_report.remote(cities))

```

Running the local entrypoint using `modal run db_to_sheet.py` should print something like:
`dict_items([('Clouds', 3)])`.
Note that since this file only has a single app, and the app has only one local entrypoint
we only have to specify the file to run it - the function/entrypoint is inferred.

In this case the logic is quite simple, but in a real world context you could have applied a
machine learning model or any other tool you could build into a container to transform the data.

## Sending output to a Google Sheet

We'll set up a new Google Sheet to send our report to. Using the "Sharing" dialog in Google
Sheets, share the document to the service account's email address (the value of the `client_email` field in the json file)
and make the service account an editor of the document.

You may also need to enable the Google Sheets API for your project in the Google Cloud Platform console.
If so, the URL will be printed inside the message of a 403 Forbidden error when you run the function.
It begins with https://console.developers.google.com/apis/api/sheets.googleapis.com/overview.

Lastly, we need to point our code to the correct Google Sheet. We'll need the *key* of the document.
You can find the key in the URL of the Google Sheet. It appears after the `/d/` in the URL, like:
`https://docs.google.com/spreadsheets/d/1wOktal......IJR77jD8Do`.

We'll make use of the `pygsheets` python package to authenticate with
Google Sheets and then update the spreadsheet with information from the report we just created:

```python
pygsheets_image = modal.Image.debian_slim(python_version="3.11").pip_install(
    "pygsheets~=2.0.6"
)

@app.function(
    image=pygsheets_image,
    secrets=[
        modal.Secret.from_name("gsheets-secret", required_keys=["SERVICE_ACCOUNT_JSON"])
    ],
)
def update_sheet_report(rows):
    import pygsheets

    gc = pygsheets.authorize(service_account_env_var="SERVICE_ACCOUNT_JSON")
    document_key = "1JxhGsht4wltyPFFOd2hP0eIv6lxZ5pVxJN_ZwNT-l3c"
    sh = gc.open_by_key(document_key)
    worksheet = sh.sheet1
    worksheet.clear("A2")

    worksheet.update_values("A2", [list(row) for row in rows])

```

At this point, we have everything we need in order to run the full program. We can put it all together in
another Modal Function, and add a [`schedule`](https://modal.com/docs/guide/cron) argument so it runs every day automatically:

```python
@app.function(schedule=modal.Period(days=1))
def db_to_sheet():
    rows = get_db_rows.remote()
    report = create_report.remote(rows)
    update_sheet_report.remote(report)
    print("Updated sheet with new weather distribution")
    for weather, count in report:
        print(f"{weather}: {count}")

```

This entire app can now be deployed using `modal deploy db_to_sheet.py`. The [apps page](https://modal.com/apps)
shows our cron job's execution history and lets you navigate to each invocation's logs.
To trigger a manual run from your local code during development, you can also trigger this function using the cli:
`modal run db_to_sheet.py::db_to_sheet`

Note that all of the `@app.function()` annotated functions above run remotely in isolated containers that are specified per
function, but they are called as seamlessly as if we were using regular Python functions. This is a simple
showcase of how you can mix and match Modal Functions that use different environments and have them feed
into each other or even call each other as if they were all functions in the same local program.

### Dbt Duckdb

# Build your own data warehouse with DuckDB, DBT, and Modal

This example contains a minimal but capable [data warehouse](https://en.wikipedia.org/wiki/Data_warehouse).
It's comprised of the following:

- [DuckDB](https://duckdb.org) as the warehouse's [OLAP](https://en.wikipedia.org/wiki/Online_analytical_processing) database engine

- [AWS S3](https://aws.amazon.com/s3/) as the data storage provider

- [DBT](https://docs.getdbt.com/docs/introduction) as the data transformation tool

Meet your new serverless cloud data warehouse, powered by Modal!

## Configure Modal, S3, and DBT

The only thing in the source code that you must update is the S3 bucket name.
AWS S3 bucket names are globally unique, and the one in this source is used by us to host this example.

Update the `BUCKET_NAME` variable below and also any references to the original value
within `sample_proj_duckdb_s3/models/`. The AWS IAM policy below also includes the bucket
name and that must be updated.

```python
from pathlib import Path

import modal

BUCKET_NAME = "modal-example-dbt-duckdb-s3"
LOCAL_DBT_PROJECT = (  # local path
    Path(__file__).parent / "sample_proj_duckdb_s3"
)
PROJ_PATH = "/root/dbt"  # remote paths
PROFILES_PATH = "/root/dbt_profile"
TARGET_PATH = "/root/target"
```

Most of the DBT code and configuration is taken directly from the classic
[Jaffle Shop](https://github.com/dbt-labs/jaffle_shop) demo and modified to support
using `dbt-duckdb` with an S3 bucket.

The DBT `profiles.yml` configuration is taken from
[the `dbt-duckdb` docs](https://github.com/jwills/dbt-duckdb#configuring-your-profile).

We also define the environment our application will run in --
a container image, as in Docker.
See [this guide](https://modal.com/docs/guide/custom-container) for details.

```python
dbt_image = (  # start from a slim Linux image
    modal.Image.debian_slim(python_version="3.11")
    .pip_install(  # install python packages
        "boto3~=1.34",  # aws client sdk
        "dbt-duckdb~=1.8.1",  # dbt and duckdb and a connector
        "pandas~=2.2.2",  # dataframes
        "pyarrow~=16.1.0",  # columnar data lib
        "fastapi[standard]~=0.115.4",  # web app
    )
    .env(  # configure DBT environment variables
        {
            "DBT_PROJECT_DIR": PROJ_PATH,
            "DBT_PROFILES_DIR": PROFILES_PATH,
            "DBT_TARGET_PATH": TARGET_PATH,
        }
    )
    # Here we add all local code and configuration into the Modal Image
    # so that it will be available when we run DBT on Modal.
    .add_local_dir(LOCAL_DBT_PROJECT, remote_path=PROJ_PATH)
    .add_local_file(
        LOCAL_DBT_PROJECT / "profiles.yml",
        remote_path=f"{PROFILES_PATH}/profiles.yml",
    )
)

app = modal.App(name="example-dbt-duckdb", image=dbt_image)

dbt_target = modal.Volume.from_name("dbt-target-vol", create_if_missing=True)

```

We'll also need to authenticate with AWS to store data in S3.

```python
s3_secret = modal.Secret.from_name(
    "modal-examples-aws-user",
    required_keys=["AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY", "AWS_REGION"],
)

```

Create this Secret using the "AWS" template from the [Secrets dashboard](https://modal.com/secrets).
Below we will use the provided credentials in a Modal Function to create an S3 bucket and
populate it with `.parquet` data, so be sure to provide credentials for a user
with permission to create S3 buckets and read & write data from them.

The policy required for this example is the following.
Not that you *must* update the bucket name listed in the policy to your
own bucket name.

```json
{
    "Statement": [
        {
            "Action": "s3:*",
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::modal-example-dbt-duckdb-s3/*",
                "arn:aws:s3:::modal-example-dbt-duckdb-s3"
            ],
            "Sid": "duckdbs3access"
        }
    ],
    "Version": "2012-10-17"
}
```

## Upload seed data

In order to provide source data for DBT to ingest and transform,
we have the below `create_source_data` function which creates an AWS S3 bucket and
populates it with Parquet files based off the CSV data in the `seeds/` directory.

You can kick it off by running this script on Modal:

```bash
modal run dbt_duckdb.py
```

This script also runs the full data warehouse setup, and the whole process takes a minute or two.
We'll walk through the rest of the steps below. See the `app.local_entrypoint`
below for details.

Note that this is not the typical way that `seeds/` data is used, but it's useful for this
demonstration. See [the DBT docs](https://docs.getdbt.com/docs/build/seeds) for more info.

```python
@app.function(
    secrets=[s3_secret],
)
def create_source_data():
    import boto3
    import pandas as pd
    from botocore.exceptions import ClientError

    s3_client = boto3.client("s3")
    s3_client.create_bucket(Bucket=BUCKET_NAME)

    for seed_csv_path in Path(PROJ_PATH, "seeds").glob("*.csv"):
        print(f"Found seed file {seed_csv_path}")
        name = seed_csv_path.stem
        parquet_filename = f"{name}.parquet"
        object_key = f"sources/{parquet_filename}"
        try:
            s3_client.head_object(Bucket=BUCKET_NAME, Key=object_key)
            print(
                f"File '{object_key}' already exists in bucket '{BUCKET_NAME}'. Skipping."
            )
        except ClientError:
            df = pd.read_csv(seed_csv_path)
            df.to_parquet(parquet_filename)
            print(f"Uploading '{object_key}' to S3 bucket '{BUCKET_NAME}'")
            s3_client.upload_file(parquet_filename, BUCKET_NAME, object_key)
            print(f"File '{object_key}' uploaded successfully.")

```

## Run DBT on the cloud with Modal

Modal makes it easy to run Python code in the cloud.
And DBT is a Python tool, so it's easy to run DBT with Modal:
below, we import the `dbt` library's `dbtRunner` to pass commands from our
Python code, running on Modal, the same way we'd pass commands on a command line.

Note that this Modal Function has access to our AWS S3 Secret,
the local files associated with our DBT project and profiles,
and a remote Modal Volume that acts as a distributed file system.

```python
@app.function(
    secrets=[s3_secret],
    volumes={TARGET_PATH: dbt_target},
)
def run(command: str) -> None:
    from dbt.cli.main import dbtRunner

    res = dbtRunner().invoke(command.split(" "))
    if res.exception:
        print(res.exception)

```

You can run this Modal Function from the command line with

`modal run dbt_duckdb.py::run --command run`

A successful run will log something like the following:

```
03:41:04  Running with dbt=1.5.0
03:41:05  Found 5 models, 8 tests, 0 snapshots, 0 analyses, 313 macros, 0 operations, 3 seed files, 3 sources, 0 exposures, 0 metrics, 0 groups
03:41:05
03:41:06  Concurrency: 1 threads (target='modal')
03:41:06
03:41:06  1 of 5 START sql table model main.stg_customers ................................ [RUN]
03:41:06  1 of 5 OK created sql table model main.stg_customers ........................... [OK in 0.45s]
03:41:06  2 of 5 START sql table model main.stg_orders ................................... [RUN]
03:41:06  2 of 5 OK created sql table model main.stg_orders .............................. [OK in 0.34s]
03:41:06  3 of 5 START sql table model main.stg_payments ................................. [RUN]
03:41:07  3 of 5 OK created sql table model main.stg_payments ............................ [OK in 0.36s]
03:41:07  4 of 5 START sql external model main.customers ................................. [RUN]
03:41:07  4 of 5 OK created sql external model main.customers ............................ [OK in 0.72s]
03:41:07  5 of 5 START sql table model main.orders ....................................... [RUN]
03:41:08  5 of 5 OK created sql table model main.orders .................................. [OK in 0.22s]
03:41:08
03:41:08  Finished running 4 table models, 1 external model in 0 hours 0 minutes and 3.15 seconds (3.15s).
03:41:08  Completed successfully
03:41:08
03:41:08  Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5
```

Look for the `'materialized='external'` DBT config in the SQL templates
to see how `dbt-duckdb` is able to write back the transformed data to AWS S3!

After running the `run` command and seeing it succeed, check what's contained
under the bucket's `out/` key prefix. You'll see that DBT has run the transformations
defined in `sample_proj_duckdb_s3/models/` and produced output `.parquet` files.

## Serve fresh data documentation with FastAPI and Modal

DBT also automatically generates [rich, interactive data docs](https://docs.getdbt.com/docs/collaborate/explore-projects).
You can serve these docs on Modal.
Just define a simple [FastAPI](https://fastapi.tiangolo.com/) app:

```python
@app.function(volumes={TARGET_PATH: dbt_target})
@modal.concurrent(max_inputs=100)
@modal.asgi_app()  # wrap a function that returns a FastAPI app in this decorator to host on Modal
def serve_dbt_docs():
    import fastapi
    from fastapi.staticfiles import StaticFiles

    web_app = fastapi.FastAPI()
    web_app.mount(
        "/",
        StaticFiles(  # dbt docs are automatically generated and sitting in the Volume
            directory=TARGET_PATH, html=True
        ),
        name="static",
    )

    return web_app

```

And deploy that app to Modal with

```bash
modal deploy dbt_duckdb.py
# ...
# Created web function serve_dbt_docs => <output-url>
```

If you navigate to the output URL, you should see something like
[![example dbt docs](./dbt_docs.png)](https://modal-labs-examples--example-dbt-duckdb-serve-dbt-docs.modal.run)

You can also check out our instance of the docs [here](https://modal-labs-examples--example-dbt-duckdb-serve-dbt-docs.modal.run).
The app will be served "serverlessly" -- it will automatically scale up or down
during periods of increased or decreased usage, and you won't be charged at all
when it has scaled to zero.

## Schedule daily updates

The following `daily_build` function [runs on a schedule](https://modal.com/docs/guide/cron)
to keep the DuckDB data warehouse up-to-date. It is also deployed by the same `modal deploy` command for the docs app.

The source data for this warehouse is static,
so the daily executions don't really "update" anything, just re-build. But this example could be extended
to have sources which continually provide new data across time.
It will also generate the DBT docs daily to keep them fresh.

```python
@app.function(
    schedule=modal.Period(days=1),
    secrets=[s3_secret],
    volumes={TARGET_PATH: dbt_target},
)
def daily_build() -> None:
    run.remote("build")
    run.remote("docs generate")

@app.local_entrypoint()
def main():
    create_source_data.remote()
    run.remote("run")
    daily_build.remote()

```

### Dicts And Queues

# Use Modal Dicts and Queues together

Modal Dicts and Queues store and communicate objects in distributed applications on Modal.

To illustrate how Dicts and Queues can interact together in a simple distributed
system, consider the following example program that crawls the web, starting
from some initial page and traversing links to many sites in breadth-first order.

The Modal Queue acts as a job queue, accepting new pages to crawl as they are discovered
by the crawlers and doling them out to be crawled via [`.spawn`](https://modal.com/docs/reference/modal.Function#spawn).

The Dict is used to coordinate termination once the maximum number of URLs to crawl is reached.

Starting from Wikipedia, this spawns several dozen containers (auto-scaled on
demand) and crawls about 100,000 URLs per minute.

```python
import queue
import sys
from datetime import datetime

import modal

app = modal.App(
    "example-dicts-and-queues",
    image=modal.Image.debian_slim().pip_install(
        "requests~=2.32.4", "beautifulsoup4~=4.13.4"
    ),
)

def extract_links(url: str) -> list[str]:
    """Extract links from a given URL."""
    import urllib.parse

    import requests
    from bs4 import BeautifulSoup

    resp = requests.get(url, timeout=10)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "html.parser")
    links = []
    for link in soup.find_all("a"):
        links.append(urllib.parse.urljoin(url, link.get("href")))
    return links

@app.function()
def crawl_pages(q: modal.Queue, d: modal.Dict, urls: set[str]) -> None:
    for url in urls:
        if "stop" in d:
            return
        try:
            s = datetime.now()
            links = extract_links(url)
            print(f"Crawled: {url} in {datetime.now() - s}, with {len(links)} links")
            q.put_many(links)
        except Exception as exc:
            print(
                f"Failed to crawl: {url} with error {exc}, skipping...", file=sys.stderr
            )

@app.function()
def scrape(url: str, max_urls: int = 50_000):
    start_time = datetime.now()

    # Create ephemeral dicts and queues
    with modal.Dict.ephemeral() as d, modal.Queue.ephemeral() as q:
        # The dict is used to signal the scraping to stop
        # The queue contains the URLs that have been crawled

        # Initialize queue with a starting URL
        q.put(url)

        # Crawl until the queue is empty, or reaching some number of URLs
        visited = set()
        max_urls = min(max_urls, 50_000)
        while True:
            try:
                next_urls = q.get_many(2000, timeout=5)
            except queue.Empty:
                break
            new_urls = set(next_urls) - visited
            visited |= new_urls
            if len(visited) < max_urls:
                crawl_pages.spawn(q, d, new_urls)
            else:
                d["stop"] = True

        elapsed = (datetime.now() - start_time).total_seconds()
        print(f"Crawled {len(visited)} URLs in {elapsed:.2f} seconds")

@app.local_entrypoint()
def main(starting_url=None, max_urls: int = 10_000):
    starting_url = starting_url or "https://www.wikipedia.org/"
    scrape.remote(starting_url, max_urls=max_urls)

```

### Diffusers Lora Finetune

# Fine-tune Flux on your pet using LoRA

This example finetunes the [Flux.1-dev model](https://huggingface.co/black-forest-labs/FLUX.1-dev)
on images of a pet (by default, a puppy named Qwerty)
using a technique called textual inversion from [the "Dreambooth" paper](https://dreambooth.github.io/).
Effectively, it teaches a general image generation model a new "proper noun",
allowing for the personalized generation of art and photos.
We supplement textual inversion with low-rank adaptation (LoRA)
for increased efficiency during training.

It then makes the model shareable with others -- without costing $25/day for a GPU server--
by hosting a [Gradio app](https://gradio.app/) on Modal.

It demonstrates a simple, productive, and cost-effective pathway
to building on large pretrained models using Modal's building blocks, like
[GPU-accelerated](https://modal.com/docs/guide/gpu) Modal Functions and Clses for compute-intensive work,
[Volumes](https://modal.com/docs/guide/volumes) for storage,
and [web endpoints](https://modal.com/docs/guide/webhooks) for serving.

And with some light customization, you can use it to generate images of your pet!

![Gradio.app image generation interface](./gradio-image-generate.png)

You can find a video walkthrough of this example on the Modal YouTube channel
[here](https://www.youtube.com/watch?v=df-8fiByXMI).

## Imports and setup

We start by importing the necessary libraries and setting up the environment.

```python
from dataclasses import dataclass
from pathlib import Path

import modal

```

## Building up the environment

Machine learning environments are complex, and the dependencies can be hard to manage.
Modal makes creating and working with environments easy via
[containers and container images](https://modal.com/docs/guide/custom-container).

We start from a base image and specify all of our dependencies.
We'll call out the interesting ones as they come up below.
Note that these dependencies are not installed locally
-- they are only installed in the remote environment where our Modal App runs.

```python
app = modal.App(name="example-diffusers-lora-finetune")

image = modal.Image.debian_slim(python_version="3.10").pip_install(
    "accelerate==0.31.0",
    "datasets~=2.13.0",
    "fastapi[standard]==0.115.4",
    "ftfy~=6.1.0",
    "gradio~=5.5.0",
    "huggingface-hub==0.26.2",
    "hf_transfer==0.1.8",
    "numpy<2",
    "peft==0.11.1",
    "pydantic==2.9.2",
    "sentencepiece>=0.1.91,!=0.1.92",
    "smart_open~=6.4.0",
    "starlette==0.41.2",
    "transformers~=4.41.2",
    "torch~=2.2.0",
    "torchvision~=0.16",
    "triton~=2.2.0",
    "wandb==0.17.6",
)

```

### Downloading scripts and installing a git repo with `run_commands`

We'll use an example script from the `diffusers` library to train the model.
We acquire it from GitHub and install it in our environment with a series of commands.
The container environments Modal Functions run in are highly flexible --
see [the docs](https://modal.com/docs/guide/custom-container) for more details.

```python
GIT_SHA = "e649678bf55aeaa4b60bd1f68b1ee726278c0304"  # specify the commit to fetch

image = (
    image.apt_install("git")
    # Perform a shallow fetch of just the target `diffusers` commit, checking out
    # the commit in the container's home directory, /root. Then install `diffusers`
    .run_commands(
        "cd /root && git init .",
        "cd /root && git remote add origin https://github.com/huggingface/diffusers",
        f"cd /root && git fetch --depth=1 origin {GIT_SHA} && git checkout {GIT_SHA}",
        "cd /root && pip install -e .",
    )
)

```

### Configuration with `dataclass`es

Machine learning apps often have a lot of configuration information.
We collect up all of our configuration into dataclasses to avoid scattering special/magic values throughout code.

```python
@dataclass
class SharedConfig:
    """Configuration information shared across project components."""

    # The instance name is the "proper noun" we're teaching the model
    instance_name: str = "Qwerty"
    # That proper noun is usually a member of some class (person, bird),
    # and sharing that information with the model helps it generalize better.
    class_name: str = "Golden Retriever"
    # identifier for pretrained models on Hugging Face
    model_name: str = "black-forest-labs/FLUX.1-dev"

```

### Storing data created by our app with `modal.Volume`

The tools we've used so far work well for fetching external information,
which defines the environment our app runs in,
but what about data that we create or modify during the app's execution?
A persisted [`modal.Volume`](https://modal.com/docs/guide/volumes) can store and share data across Modal Apps and Functions.

We'll use one to store both the original and fine-tuned weights we create during training
and then load them back in for inference. For more on storing model weights on Modal, see
[this guide](https://modal.com/docs/guide/model-weights).

```python
volume = modal.Volume.from_name(
    "dreambooth-finetuning-volume-flux", create_if_missing=True
)
MODEL_DIR = "/model"

```

Note that access to the Flux.1-dev model on Hugging Face is
[gated by a license agreement](https://huggingface.co/docs/hub/en/models-gated) which
you must agree to [here](https://huggingface.co/black-forest-labs/FLUX.1-dev).
After you have accepted the license, [create a Modal Secret](https://modal.com/secrets)
with the name `huggingface-secret` following the instructions in the template.

```python
huggingface_secret = modal.Secret.from_name(
    "huggingface-secret", required_keys=["HF_TOKEN"]
)

image = image.env(
    {"HF_HUB_ENABLE_HF_TRANSFER": "1"}  # turn on faster downloads from HF
)

@app.function(
    volumes={MODEL_DIR: volume},
    image=image,
    secrets=[huggingface_secret],
    timeout=600,  # 10 minutes
)
def download_models(config):
    import torch
    from diffusers import DiffusionPipeline
    from huggingface_hub import snapshot_download

    snapshot_download(
        config.model_name,
        local_dir=MODEL_DIR,
        ignore_patterns=["*.pt", "*.bin"],  # using safetensors
    )

    DiffusionPipeline.from_pretrained(MODEL_DIR, torch_dtype=torch.bfloat16)

```

### Load fine-tuning dataset

Part of the magic of the low-rank fine-tuning is that we only need 3-10 images for fine-tuning.
So we can fetch just a few images, stored on consumer platforms like Imgur or Google Drive,
whenever we need them -- no need for expensive, hard-to-maintain data pipelines.

```python
def load_images(image_urls: list[str]) -> Path:
    import PIL.Image
    from smart_open import open

    img_path = Path("/img")

    img_path.mkdir(parents=True, exist_ok=True)
    for ii, url in enumerate(image_urls):
        with open(url, "rb") as f:
            image = PIL.Image.open(f)
            image.save(img_path / f"{ii}.png")
    print(f"{ii + 1} images loaded")

    return img_path

```

## Low-Rank Adapation (LoRA) fine-tuning for a text-to-image model

The base model we start from is trained to do a sort of "reverse [ekphrasis](https://en.wikipedia.org/wiki/Ekphrasis)":
it attempts to recreate a visual work of art or image from only its description.

We can use the model to synthesize wholly new images
by combining the concepts it has learned from the training data.

We use a pretrained model, the Flux model from Black Forest Labs.
In this example, we "finetune" Flux, making only small adjustments to the weights.
Furthermore, we don't change all the weights in the model.
Instead, using a technique called [_low-rank adaptation_](https://arxiv.org/abs/2106.09685),
we change a much smaller matrix that works "alongside" the existing weights, nudging the model in the direction we want.

We can get away with such a small and simple training process because we're just teach the model the meaning of a single new word: the name of our pet.

The result is a model that can generate novel images of our pet:
as an astronaut in space, as painted by Van Gogh or Bastiat, etc.

### Finetuning with Hugging Face 🧨 Diffusers and Accelerate

The model weights, training libraries, and training script are all provided by [🤗 Hugging Face](https://huggingface.co).

You can kick off a training job with the command `modal run dreambooth_app.py::app.train`.
It should take about ten minutes.

Training machine learning models takes time and produces a lot of metadata --
metrics for performance and resource utilization,
metrics for model quality and training stability,
and model inputs and outputs like images and text.
This is especially important if you're fiddling around with the configuration parameters.

This example can optionally use [Weights & Biases](https://wandb.ai) to track all of this training information.
Just sign up for an account, switch the flag below, and add your API key as a [Modal Secret](https://modal.com/secrets).

```python
USE_WANDB = False

```

You can see an example W&B dashboard [here](https://wandb.ai/cfrye59/dreambooth-lora-sd-xl).
Check out [this run](https://wandb.ai/cfrye59/dreambooth-lora-sd-xl/runs/ca3v1lsh?workspace=user-cfrye59),
which [despite having high GPU utilization](https://wandb.ai/cfrye59/dreambooth-lora-sd-xl/runs/ca3v1lsh/system)
suffered from numerical instability during training and produced only black images -- hard to debug without experiment management logs!

You can read more about how the values in `TrainConfig` are chosen and adjusted [in this blog post on Hugging Face](https://huggingface.co/blog/dreambooth).
To run training on images of your own pet, upload the images to separate URLs and edit the contents of the file at `TrainConfig.instance_example_urls_file` to point to them.

Tip: if the results you're seeing don't match the prompt too well, and instead produce an image
of your subject without taking the prompt into account, the model has likely overfit. In this case, repeat training with a lower
value of `max_train_steps`. If you used W&B, look back at results earlier in training to determine where to stop.
On the other hand, if the results don't look like your subject, you might need to increase `max_train_steps`.

```python
@dataclass
class TrainConfig(SharedConfig):
    """Configuration for the finetuning step."""

    # training prompt looks like `{PREFIX} {INSTANCE_NAME} the {CLASS_NAME} {POSTFIX}`
    prefix: str = "a photo of"
    postfix: str = ""

    # locator for plaintext file with urls for images of target instance
    instance_example_urls_file: str = str(
        Path(__file__).parent / "instance_example_urls.txt"
    )

    # Hyperparameters/constants from the huggingface training example
    resolution: int = 512
    train_batch_size: int = 3
    rank: int = 16  # lora rank
    gradient_accumulation_steps: int = 1
    learning_rate: float = 4e-4
    lr_scheduler: str = "constant"
    lr_warmup_steps: int = 0
    max_train_steps: int = 500
    checkpointing_steps: int = 1000
    seed: int = 117

@app.function(
    image=image,
    gpu="A100-80GB",  # fine-tuning is VRAM-heavy and requires a high-VRAM GPU
    volumes={MODEL_DIR: volume},  # stores fine-tuned model
    timeout=1800,  # 30 minutes
    secrets=[huggingface_secret]
    + (
        [modal.Secret.from_name("wandb-secret", required_keys=["WANDB_API_KEY"])]
        if USE_WANDB
        else []
    ),
)
def train(instance_example_urls, config):
    import subprocess

    from accelerate.utils import write_basic_config

    # load data locally
    img_path = load_images(instance_example_urls)

    # set up hugging face accelerate library for fast training
    write_basic_config(mixed_precision="bf16")

    # define the training prompt
    instance_phrase = f"{config.instance_name} the {config.class_name}"
    prompt = f"{config.prefix} {instance_phrase} {config.postfix}".strip()

    # the model training is packaged as a script, so we have to execute it as a subprocess, which adds some boilerplate
    def _exec_subprocess(cmd: list[str]):
        """Executes subprocess and prints log to terminal while subprocess is running."""
        process = subprocess.Popen(
            cmd,
            stdout=subprocess.PIPE,
            stderr=subprocess.STDOUT,
        )
        with process.stdout as pipe:
            for line in iter(pipe.readline, b""):
                line_str = line.decode()
                print(f"{line_str}", end="")

        if exitcode := process.wait() != 0:
            raise subprocess.CalledProcessError(exitcode, "\n".join(cmd))

    # run training -- see huggingface accelerate docs for details
    print("launching dreambooth training script")
    _exec_subprocess(
        [
            "accelerate",
            "launch",
            "examples/dreambooth/train_dreambooth_lora_flux.py",
            "--mixed_precision=bf16",  # half-precision floats most of the time for faster training
            f"--pretrained_model_name_or_path={MODEL_DIR}",
            f"--instance_data_dir={img_path}",
            f"--output_dir={MODEL_DIR}",
            f"--instance_prompt={prompt}",
            f"--resolution={config.resolution}",
            f"--train_batch_size={config.train_batch_size}",
            f"--gradient_accumulation_steps={config.gradient_accumulation_steps}",
            f"--learning_rate={config.learning_rate}",
            f"--lr_scheduler={config.lr_scheduler}",
            f"--lr_warmup_steps={config.lr_warmup_steps}",
            f"--max_train_steps={config.max_train_steps}",
            f"--checkpointing_steps={config.checkpointing_steps}",
            f"--seed={config.seed}",  # increased reproducibility by seeding the RNG
        ]
        + (
            [
                "--report_to=wandb",
                # validation output tracking is useful, but currently broken for Flux LoRA training
                # f"--validation_prompt={prompt} in space",  # simple test prompt
                # f"--validation_epochs={config.max_train_steps // 5}",
            ]
            if USE_WANDB
            else []
        ),
    )
    # The trained model information has been output to the volume mounted at `MODEL_DIR`.
    # To persist this data for use in our web app, we 'commit' the changes
    # to the volume.
    volume.commit()

```

## Running our model

To generate images from prompts using our fine-tuned model, we define a Modal Function called `inference`.

Naively, this would seem to be a bad fit for the flexible, serverless infrastructure of Modal:
wouldn't you need to include the steps to load the model and spin it up in every function call?

In order to initialize the model just once on container startup,
we use Modal's [container lifecycle](https://modal.com/docs/guide/lifecycle-functions) features, which require the function to be part
of a class. Note that the `modal.Volume` we saved the model to is mounted here as well,
so that the fine-tuned model created  by `train` is available to us.

```python
@app.cls(image=image, gpu="A100", volumes={MODEL_DIR: volume})
class Model:
    @modal.enter()
    def load_model(self):
        import torch
        from diffusers import DiffusionPipeline

        # Reload the modal.Volume to ensure the latest state is accessible.
        volume.reload()

        # set up a hugging face inference pipeline using our model
        pipe = DiffusionPipeline.from_pretrained(
            MODEL_DIR,
            torch_dtype=torch.bfloat16,
        ).to("cuda")
        pipe.load_lora_weights(MODEL_DIR)
        self.pipe = pipe

    @modal.method()
    def inference(self, text, config):
        image = self.pipe(
            text,
            num_inference_steps=config.num_inference_steps,
            guidance_scale=config.guidance_scale,
        ).images[0]

        return image

```

## Wrap the trained model in a Gradio web UI

[Gradio](https://gradio.app) makes it super easy to expose a model's functionality
in an easy-to-use, responsive web interface.

This model is a text-to-image generator,
so we set up an interface that includes a user-entry text box
and a frame for displaying images.

We also provide some example text inputs to help
guide users and to kick-start their creative juices.

And we couldn't resist adding some Modal style to it as well!

You can deploy the app on Modal with the command
`modal deploy dreambooth_app.py`.
You'll be able to come back days, weeks, or months later and find it still ready to go,
even though you don't have to pay for a server to run while you're not using it.

```python
@dataclass
class AppConfig(SharedConfig):
    """Configuration information for inference."""

    num_inference_steps: int = 50
    guidance_scale: float = 6

web_image = image.add_local_dir(
    # Add local web assets to the image
    Path(__file__).parent / "assets",
    remote_path="/assets",
)

@app.function(
    image=web_image,
    max_containers=1,
)
@modal.concurrent(max_inputs=1000)
@modal.asgi_app()
def fastapi_app():
    import gradio as gr
    from fastapi import FastAPI
    from fastapi.responses import FileResponse
    from gradio.routes import mount_gradio_app

    web_app = FastAPI()

    # Call out to the inference in a separate Modal environment with a GPU
    def go(text=""):
        if not text:
            text = example_prompts[0]
        return Model().inference.remote(text, config)

    # set up AppConfig
    config = AppConfig()

    instance_phrase = f"{config.instance_name} the {config.class_name}"

    example_prompts = [
        f"{instance_phrase}",
        f"a painting of {instance_phrase.title()} With A Pearl Earring, by Vermeer",
        f"oil painting of {instance_phrase} flying through space as an astronaut",
        f"a painting of {instance_phrase} in cyberpunk city. character design by cory loftis. volumetric light, detailed, rendered in octane",
        f"drawing of {instance_phrase} high quality, cartoon, path traced, by studio ghibli and don bluth",
    ]

    modal_docs_url = "https://modal.com/docs"
    modal_example_url = f"{modal_docs_url}/examples/dreambooth_app"

    description = f"""Describe what they are doing or how a particular artist or style would depict them. Be fantastical! Try the examples below for inspiration.

### Learn how to make a "Dreambooth" for your own pet [here]({modal_example_url}).
    """

    # custom styles: an icon, a background, and a theme
    @web_app.get("/favicon.ico", include_in_schema=False)
    async def favicon():
        return FileResponse("/assets/favicon.svg")

    @web_app.get("/assets/background.svg", include_in_schema=False)
    async def background():
        return FileResponse("/assets/background.svg")

    with open("/assets/index.css") as f:
        css = f.read()

    theme = gr.themes.Default(
        primary_hue="green", secondary_hue="emerald", neutral_hue="neutral"
    )

    # add a gradio UI around inference
    with gr.Blocks(
        theme=theme,
        css=css,
        title=f"Generate images of {config.instance_name} on Modal",
    ) as interface:
        gr.Markdown(
            f"# Generate images of {instance_phrase}.\n\n{description}",
        )
        with gr.Row():
            inp = gr.Textbox(  # input text component
                label="",
                placeholder=f"Describe the version of {instance_phrase} you'd like to see",
                lines=10,
            )
            out = gr.Image(  # output image component
                height=512, width=512, label="", min_width=512, elem_id="output"
            )
        with gr.Row():
            btn = gr.Button("Dream", variant="primary", scale=2)
            btn.click(
                fn=go, inputs=inp, outputs=out
            )  # connect inputs and outputs with inference function

            gr.Button(  # shameless plug
                "⚡️ Powered by Modal",
                variant="secondary",
                link="https://modal.com",
            )

        with gr.Column(variant="compact"):
            # add in a few examples to inspire users
            for ii, prompt in enumerate(example_prompts):
                btn = gr.Button(prompt, variant="secondary")
                btn.click(fn=lambda idx=ii: example_prompts[idx], outputs=inp)

    # mount for execution on Modal
    return mount_gradio_app(
        app=web_app,
        blocks=interface,
        path="/",
    )

```

## Running your fine-tuned model from the command line

You can use the `modal` command-line interface to set up, customize, and deploy this app:

- `modal run diffusers_lora_finetune.py` will train the model. Change the `instance_example_urls_file` to point to your own pet's images.
- `modal serve diffusers_lora_finetune.py` will [serve](https://modal.com/docs/guide/webhooks#developing-with-modal-serve) the Gradio interface at a temporary location. Great for iterating on code!
- `modal shell diffusers_lora_finetune.py` is a convenient helper to open a bash [shell](https://modal.com/docs/guide/developing-debugging#interactive-shell) in our image. Great for debugging environment issues.

Remember, once you've trained your own fine-tuned model, you can deploy it permanently -- for no cost when it is not being used! --
using `modal deploy diffusers_lora_finetune.py`.

If you just want to try the app out, you can find our deployment [here](https://modal-labs--example-diffusers-lora-finetune-fastapi-app.modal.run).

```python
@app.local_entrypoint()
def run(  # add more config params here to make training configurable
    max_train_steps: int = 250,
):
    print("🎨 loading model")
    download_models.remote(SharedConfig())
    print("🎨 setting up training")
    config = TrainConfig(max_train_steps=max_train_steps)
    instance_example_urls = (
        Path(TrainConfig.instance_example_urls_file).read_text().splitlines()
    )
    train.remote(instance_example_urls, config)
    print("🎨 training finished")

```

### Discord Bot

# Serve a Discord Bot on Modal

In this example we will demonstrate how to use Modal to build and serve a Discord bot that uses
[slash commands](https://discord.com/developers/docs/interactions/application-commands).

Slash commands send information from Discord server members to a service at a URL.
Here, we set up a simple [FastAPI app](https://fastapi.tiangolo.com/)
to run that service and deploy it easily  Modal’s
[`@asgi_app`](https://modal.com/docs/guide/webhooks#serving-asgi-and-wsgi-apps) decorator.

As our example service, we hit a simple free API:
the [Free Public APIs API](https://www.freepublicapis.com/api),
a directory of free public APIs.

[Try it out on Discord](https://discord.gg/PmG7P47EPQ)!

## Set up our App and its Image

First, we define the [container image](https://modal.com/docs/guide/images)
that all the pieces of our bot will run in.

We set that as the default image for a Modal [App](https://modal.com/docs/guide/apps).
The App is where we'll attach all the components of our bot.

```python
import json
from enum import Enum

import modal

image = modal.Image.debian_slim(python_version="3.11").pip_install(
    "fastapi[standard]==0.115.4", "pynacl~=1.5.0", "requests~=2.32.3"
)

app = modal.App("example-discord-bot", image=image)

```

## Hit the Free Public APIs API

We start by defining the core service that our bot will provide.

In a real application, this might be [music generation](https://modal.com/docs/examples/musicgen),
a [chatbot](https://modal.com/docs/examples/chat_with_pdf_vision),
or [interacting with a database](https://modal.com/docs/examples/cron_datasette).

Here, we just hit a simple free public API:
the [Free Public APIs](https://www.freepublicapis.com) API,
an "API of APIs" that returns information about free public APIs,
like the [Global Shark Attack API](https://www.freepublicapis.com/global-shark-attack-api)
and the [Corporate Bullshit Generator](https://www.freepublicapis.com/corporate-bullshit-generator).
We convert the response into a Markdown-formatted message.

We turn our Python function into a Modal Function by attaching the `app.function` decorator.
We make the function `async` and add `@modal.concurrent()` with a large `max_inputs` value, because
communicating with an external API is a classic case for better performance from asynchronous execution.
Modal handles things like the async event loop for us.

```python
@app.function()
@modal.concurrent(max_inputs=1000)
async def fetch_api() -> str:
    import aiohttp

    url = "https://www.freepublicapis.com/api/random"

    async with aiohttp.ClientSession() as session:
        try:
            async with session.get(url) as response:
                response.raise_for_status()
                data = await response.json()
                message = (
                    f"# {data.get('emoji') or '🤖'} [{data['title']}]({data['source']})"
                )
                message += f"\n _{''.join(data['description'].splitlines())}_"
        except Exception as e:
            message = f"# 🤖: Oops! {e}"

    return message

```

This core component has nothing to do with Discord,
and it's nice to be able to interact with and test it in isolation.

For that, we add a `local_entrypoint` that calls the Modal Function.
Notice that we add `.remote` to the function's name.

Later, when you replace this component of the app with something more interesting,
test it by triggering this entrypoint with  `modal run discord_bot.py`.

```python
@app.local_entrypoint()
def test_fetch_api():
    result = fetch_api.remote()
    if result.startswith("# 🤖: Oops! "):
        raise Exception(result)
    else:
        print(result)

```

## Integrate our Modal Function with Discord Interactions

Now we need to map this function onto Discord's interface --
in particular the [Interactions API](https://discord.com/developers/docs/interactions/overview).

Reviewing the documentation, we see that we need to send a JSON payload
to a specific API URL that will include an `app_id` that identifies our bot
and a `token` that identifies the interaction (loosely, message) that we're participating in.

So let's write that out. This function doesn't need to live on Modal,
since it's just encapsulating some logic -- we don't want to turn it into a service or an API on its own.
That means we don't need any Modal decorators.

```python
async def send_to_discord(payload: dict, app_id: str, interaction_token: str):
    import aiohttp

    interaction_url = f"https://discord.com/api/v10/webhooks/{app_id}/{interaction_token}/messages/@original"

    async with aiohttp.ClientSession() as session:
        async with session.patch(interaction_url, json=payload) as resp:
            print("🤖 Discord response: " + await resp.text())

```

Other parts of our application might want to both hit the Free Public APIs API and send the result to Discord,
so we both write a Python function for this and we promote it to a Modal Function with a decorator.

Notice that we use the `.local` suffix to call our `fetch_api` Function. That means we run
the Function the same way we run all the other Python functions, rather than treating it as a special
Modal Function. This reduces a bit of extra latency, but couples these two Functions more tightly.

```python
@app.function()
@modal.concurrent(max_inputs=1000)
async def reply(app_id: str, interaction_token: str):
    message = await fetch_api.local()
    await send_to_discord({"content": message}, app_id, interaction_token)

```

## Set up a Discord app

Now, we need to actually connect to Discord.
We start by creating an application on the Discord Developer Portal.

1. Go to the
   [Discord Developer Portal](https://discord.com/developers/applications) and
   log in with your Discord account.
2. On the portal, go to **Applications** and create a new application by
   clicking **New Application** in the top right next to your profile picture.
3. [Create a custom Modal Secret](https://modal.com/docs/guide/secrets) for your Discord bot.
   On Modal's Secret creation page, select 'Discord'. Copy your Discord application’s
   **Public Key** and **Application ID** (from the **General Information** tab in the Discord Developer Portal)
   and paste them as the value of `DISCORD_PUBLIC_KEY` and `DISCORD_CLIENT_ID`.
   Additionally, head to the **Bot** tab and use the **Reset Token** button to create a new bot token.
   Paste this in the value of an additional key in the Secret, `DISCORD_BOT_TOKEN`.
   Name this Secret `discord-secret`.

We access that Secret in code like so:

```python
discord_secret = modal.Secret.from_name(
    "discord-secret",
    required_keys=[  # included so we get nice error messages if we forgot a key
        "DISCORD_BOT_TOKEN",
        "DISCORD_CLIENT_ID",
        "DISCORD_PUBLIC_KEY",
    ],
)

```

## Register a Slash Command

Next, we’re going to register a [Slash Command](https://discord.com/developers/docs/interactions/application-commands#slash-commands)
for our Discord app. Slash Commands are triggered by users in servers typing `/` and the name of the command.

The Modal Function below will register a Slash Command for your bot named `bored`.
More information about Slash Commands can be found in the Discord docs
[here](https://discord.com/developers/docs/interactions/application-commands).

You can run this Function with

```bash
modal run discord_bot::create_slash_command
```

```python
@app.function(secrets=[discord_secret], image=image)
def create_slash_command(force: bool = False):
    """Registers the slash command with Discord. Pass the force flag to re-register."""
    import os

    import requests

    BOT_TOKEN = os.getenv("DISCORD_BOT_TOKEN")
    CLIENT_ID = os.getenv("DISCORD_CLIENT_ID")

    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bot {BOT_TOKEN}",
    }
    url = f"https://discord.com/api/v10/applications/{CLIENT_ID}/commands"

    command_description = {
        "name": "api",
        "description": "Information about a random free, public API",
    }

    # first, check if the command already exists
    response = requests.get(url, headers=headers)
    try:
        response.raise_for_status()
    except Exception as e:
        raise Exception("Failed to create slash command") from e

    commands = response.json()
    command_exists = any(
        command.get("name") == command_description["name"] for command in commands
    )

    # and only recreate it if the force flag is set
    if command_exists and not force:
        print(f"🤖: command {command_description['name']} exists")
        return

    response = requests.post(url, headers=headers, json=command_description)
    try:
        response.raise_for_status()
    except Exception as e:
        raise Exception("Failed to create slash command") from e
    print(f"🤖: command {command_description['name']} created")

```

## Host a Discord Interactions endpoint on Modal

If you look carefully at the definition of the Slash Command above,
you'll notice that it doesn't know anything about our bot besides an ID.

To hook the Slash Commands in the Discord UI up to our logic for hitting the Bored API,
we need to set up a service that listens at some URL and follows a specific protocol,
described [here](https://discord.com/developers/docs/interactions/overview#configuring-an-interactions-endpoint-url).

Here are some of the most important facets:

1. We'll need to respond within five seconds or Discord will assume we are dead.
Modal's fast-booting serverless containers usually start faster than that,
but it's not guaranteed. So we'll add the `min_containers` parameter to our
Function so that there's at least one live copy ready to respond quickly at any time.
Modal charges a minimum of about 2¢ an hour for live containers (pricing details [here](https://modal.com/pricing)).
Note that that still fits within Modal's $30/month of credits on the free tier.

2. We have to respond to Discord that quickly, but we don't have to respond to the user that quickly.
We instead send an acknowledgement so that they know we're alive and they can close their connection to us.
We also trigger our `reply` Modal Function, which will respond to the user via Discord's Interactions API,
but we don't wait for the result, we just `spawn` the call.

3. The protocol includes some authentication logic that is mandatory
and checked by Discord. We'll explain in more detail in the next section.

We can set up our interaction endpoint by deploying a FastAPI app on Modal.
This is as easy as creating a Python Function that returns a FastAPI app
and adding the `modal.asgi_app` decorator.
For more details on serving Python web apps on Modal, see
[this guide](https://modal.com/docs/guide/webhooks).

```python
@app.function(secrets=[discord_secret], min_containers=1)
@modal.concurrent(max_inputs=1000)
@modal.asgi_app()
def web_app():
    from fastapi import FastAPI, HTTPException, Request
    from fastapi.middleware.cors import CORSMiddleware

    web_app = FastAPI()

    # must allow requests from other domains, e.g. from Discord's servers
    web_app.add_middleware(
        CORSMiddleware,
        allow_origins=["*"],
        allow_credentials=True,
        allow_methods=["*"],
        allow_headers=["*"],
    )

    @web_app.post("/api")
    async def get_api(request: Request):
        body = await request.body()

        # confirm this is a request from Discord
        authenticate(request.headers, body)

        print("🤖: parsing request")
        data = json.loads(body.decode())
        if data.get("type") == DiscordInteractionType.PING.value:
            print("🤖: acking PING from Discord during auth check")
            return {"type": DiscordResponseType.PONG.value}

        if data.get("type") == DiscordInteractionType.APPLICATION_COMMAND.value:
            print("🤖: handling slash command")
            app_id = data["application_id"]
            interaction_token = data["token"]

            # kick off request asynchronously, will respond when ready
            reply.spawn(app_id, interaction_token)

            # respond immediately with defer message
            return {
                "type": DiscordResponseType.DEFERRED_CHANNEL_MESSAGE_WITH_SOURCE.value
            }

        print(f"🤖: unable to parse request with type {data.get('type')}")
        raise HTTPException(status_code=400, detail="Bad request")

    return web_app

```

The authentication for Discord is a bit involved and there aren't,
to our knowledge, any good Python libraries for it.

So we have to implement the protocol "by hand".

Essentially, Discord sends a header in their request
that we can use to verify the request comes from them.
For that, we use the `DISCORD_PUBLIC_KEY` from
our Application Information page.

The details aren't super important, but they appear in the `authenticate` function below
(which defers the real cryptography work to [PyNaCl](https://pypi.org/project/PyNaCl/),
a Python wrapper for [`libsodium`](https://github.com/jedisct1/libsodium)).

Discord will also check that we reject unauthorized requests,
so we have to be sure to get this right!

```python
def authenticate(headers, body):
    import os

    from fastapi.exceptions import HTTPException
    from nacl.exceptions import BadSignatureError
    from nacl.signing import VerifyKey

    print("🤖: authenticating request")
    # verify the request is from Discord using their public key
    public_key = os.getenv("DISCORD_PUBLIC_KEY")
    verify_key = VerifyKey(bytes.fromhex(public_key))

    signature = headers.get("X-Signature-Ed25519")
    timestamp = headers.get("X-Signature-Timestamp")

    message = timestamp.encode() + body

    try:
        verify_key.verify(message, bytes.fromhex(signature))
    except BadSignatureError:
        # either an unauthorized request or Discord's "negative control" check
        raise HTTPException(status_code=401, detail="Invalid request")

```

The code above used a few enums to abstract bits of the Discord protocol.
Now that we've walked through all of it,
we're in a position to understand what those are
and so the code for them appears below.

```python
class DiscordInteractionType(Enum):
    PING = 1  # hello from Discord during auth check
    APPLICATION_COMMAND = 2  # an actual command

class DiscordResponseType(Enum):
    PONG = 1  # hello back during auth check
    DEFERRED_CHANNEL_MESSAGE_WITH_SOURCE = 5  # we'll send a message later

```

## Deploy on Modal

You can deploy this app on Modal by running the following commands:

``` shell
modal run discord_bot.py  # checks the API wrapper, little test
modal run discord_bot.py::create_slash_command  # creates the slash command, if missing
modal deploy discord_bot.py  # deploys the web app and the API wrapper
```

Copy the Modal URL that is printed in the output and go back to the **General Information** section on the
[Discord Developer Portal](https://discord.com/developers/applications).
Paste the URL, making sure to append the path of your `POST` route (here, `/api`), in the
**Interactions Endpoint URL** field, then click **Save Changes**. If your
endpoint URL is incorrect or if authentication is incorrectly implemented,
Discord will refuse to save the URL. Once it saves, you can start
handling interactions!

## Finish setting up Discord bot

To start using the Slash Command you just set up, you need to invite the bot to
a Discord server. To do so, go to your application's **Installation** section on the
[Discord Developer Portal](https://discord.com/developers/applications).
Copy the **Discored Provided Link** and visit it to invite the bot to your bot to the server.

Now you can open your Discord server and type `/api` in a channel to trigger the bot.
You can see a working version [in our test Discord server](https://discord.gg/PmG7P47EPQ).

### Doc Ocr Jobs

# Run a job queue for GOT-OCR

This tutorial shows you how to use Modal as an infinitely scalable job queue
that can service async tasks from a web app. For the purpose of this tutorial,
we've also built a [React + FastAPI web app on Modal](https://modal.com/docs/examples/doc_ocr_webapp)
that works together with it, but note that you don't need a web app running on Modal
to use this pattern. You can submit async tasks to Modal from any Python
application (for example, a regular Django app running on Kubernetes).

Our job queue will handle a single task: running OCR transcription for images of receipts.
We'll make use of a pre-trained model:
the [General OCR Theory (GOT) 2.0 model](https://huggingface.co/stepfun-ai/GOT-OCR2_0).

Try it out for yourself [here](https://modal-labs-examples--example-doc-ocr-webapp-wrapper.modal.run/).

[![Webapp frontend](https://modal-cdn.com/doc_ocr_frontend.jpg)](https://modal-labs-examples--example-doc-ocr-webapp-wrapper.modal.run/)

## Define an App

Let's first import `modal` and define an [`App`](https://modal.com/docs/reference/modal.App).
Later, we'll use the name provided for our `App` to find it from our web app and submit tasks to it.

```python
from typing import Optional

import modal

app = modal.App("example-doc-ocr-jobs")

```

We also define the dependencies for our Function by specifying an
[Image](https://modal.com/docs/guide/images).

```python
inference_image = modal.Image.debian_slim(python_version="3.12").pip_install(
    "accelerate==0.28.0",
    "huggingface_hub[hf_transfer]==0.27.1",
    "numpy<2",
    "tiktoken==0.6.0",
    "torch==2.5.1",
    "torchvision==0.20.1",
    "transformers==4.48.0",
    "verovio==4.3.1",
)

```

## Cache the pre-trained model on a Modal Volume

We can obtain the pre-trained model we want to run from Hugging Face
using its name and a revision identifier.

```python
MODEL_NAME = "ucaslcl/GOT-OCR2_0"
MODEL_REVISION = "cf6b7386bc89a54f09785612ba74cb12de6fa17c"

```

The logic for loading the model based on this information
is encapsulated in the `setup` function below.

```python
def setup():
    import warnings

    from transformers import AutoModel, AutoTokenizer

    with warnings.catch_warnings():  # filter noisy warnings from GOT modeling code
        warnings.simplefilter("ignore")
        tokenizer = AutoTokenizer.from_pretrained(
            MODEL_NAME, revision=MODEL_REVISION, trust_remote_code=True
        )

        model = AutoModel.from_pretrained(
            MODEL_NAME,
            revision=MODEL_REVISION,
            trust_remote_code=True,
            device_map="cuda",
            use_safetensors=True,
            pad_token_id=tokenizer.eos_token_id,
        )

    return tokenizer, model

```

The `.from_pretrained` methods from Hugging Face are smart enough
to only download models if they haven't been downloaded before.
But in Modal's serverless environment, filesystems are ephemeral,
and so using this code alone would mean that models need to get downloaded
on every request.

So instead, we create a Modal [Volume](https://modal.com/docs/guide/volumes)
to store the model -- a durable filesystem that any Modal Function can access.

```python
model_cache = modal.Volume.from_name("hf-hub-cache", create_if_missing=True)

```

We also update the environment variables for our Function
to include this new path for the model cache --
and to enable fast downloads with the `hf_transfer` library.

```python
MODEL_CACHE_PATH = "/root/models"
inference_image = inference_image.env(
    {"HF_HUB_CACHE": MODEL_CACHE_PATH, "HF_HUB_ENABLE_HF_TRANSFER": "1"}
)

```

## Run OCR inference on Modal by wrapping with `app.function`

Now let's set up the actual OCR inference.

Using the [`@app.function`](https://modal.com/docs/reference/modal.App#function)
decorator, we set up a Modal [Function](https://modal.com/docs/reference/modal.Function).
We provide arguments to that decorator to customize the hardware, scaling, and other features
of the Function.

Here, we say that this Function should use NVIDIA L40S [GPUs](https://modal.com/docs/guide/gpu),
automatically [retry](https://modal.com/docs/guide/retries#function-retries) failures up to 3 times,
and have access to our [shared model cache](https://modal.com/docs/guide/volumes).

```python
@app.function(
    gpu="l40s",
    retries=3,
    volumes={MODEL_CACHE_PATH: model_cache},
    image=inference_image,
)
def parse_receipt(image: bytes) -> str:
    from tempfile import NamedTemporaryFile

    tokenizer, model = setup()

    with NamedTemporaryFile(delete=False, mode="wb+") as temp_img_file:
        temp_img_file.write(image)
        output = model.chat(tokenizer, temp_img_file.name, ocr_type="format")

    print("Result: ", output)

    return output

```

## Deploy

Now that we have a function, we can publish it by deploying the app:

```shell
modal deploy doc_ocr_jobs.py
```

Once it's published, we can [look up](https://modal.com/docs/guide/trigger-deployed-functions) this Function
from another Python process and submit tasks to it:

```python
fn = modal.Function.from_name("example-doc-ocr-jobs", "parse_receipt")
fn.spawn(my_image)
```

Modal will auto-scale to handle all the tasks queued, and
then scale back down to 0 when there's no work left. To see how you could use this from a Python web
app, take a look at the [receipt parser frontend](https://modal.com/docs/examples/doc_ocr_webapp)
tutorial.

## Run manually

We can also trigger `parse_receipt` manually for easier debugging:

```shell
modal run doc_ocr_jobs
```

To try it out, you can find some
example receipts [here](https://drive.google.com/drive/folders/1S2D1gXd4YIft4a5wDtW99jfl38e85ouW).

```python
@app.local_entrypoint()
def main(receipt_filename: Optional[str] = None):
    import urllib.request
    from pathlib import Path

    if receipt_filename is None:
        receipt_filename = Path(__file__).parent / "receipt.png"
    else:
        receipt_filename = Path(receipt_filename)

    if receipt_filename.exists():
        image = receipt_filename.read_bytes()
        print(f"running OCR on {receipt_filename}")
    else:
        receipt_url = "https://modal-cdn.com/cdnbot/Brandys-walmart-receipt-8g68_a_hk_f9c25fce.webp"
        request = urllib.request.Request(receipt_url)
        with urllib.request.urlopen(request) as response:
            image = response.read()
        print(f"running OCR on sample from URL {receipt_url}")
    print(parse_receipt.remote(image))

```

### Doc Ocr Webapp

# Serve a document OCR web app

This tutorial shows you how to use Modal to deploy a fully serverless
[React](https://reactjs.org/) + [FastAPI](https://fastapi.tiangolo.com/) application.
We're going to build a simple "Receipt Parser" web app that submits OCR transcription
tasks to a separate Modal app defined in [another example](https://modal.com/docs/examples/doc_ocr_jobs),
polls until the task is completed, and displays
the results. Try it out for yourself
[here](https://modal-labs-examples--example-doc-ocr-webapp-wrapper.modal.run/).

[![Webapp frontend](https://modal-cdn.com/doc_ocr_frontend.jpg)](https://modal-labs-examples--example-doc-ocr-webapp-wrapper.modal.run/)

## Basic setup

Let's get the imports out of the way and define an [`App`](https://modal.com/docs/reference/modal.App).

```python
from pathlib import Path

import fastapi
import fastapi.staticfiles
import modal

app = modal.App("example-doc-ocr-webapp")

```

Modal works with any [ASGI](https://modal.com/docs/guide/webhooks#serving-asgi-and-wsgi-apps) or
[WSGI](https://modal.com/docs/guide/webhooks#wsgi) web framework. Here, we choose to use [FastAPI](https://fastapi.tiangolo.com/).

```python
web_app = fastapi.FastAPI()

```

## Define endpoints

We need two endpoints: one to accept an image and submit it to the Modal job queue,
and another to poll for the results of the job.

In `parse`, we're going to submit tasks to the function defined in the [Job
Queue tutorial](https://modal.com/docs/examples/doc_ocr_jobs), so we import it first using
[`Function.lookup`](https://modal.com/docs/reference/modal.Function#lookup).

We call [`.spawn()`](https://modal.com/docs/reference/modal.Function#spawn) on the function handle
we imported above to kick off our function without blocking on the results. `spawn` returns
a unique ID for the function call, which we then use
to poll for its result.

```python
@web_app.post("/parse")
async def parse(request: fastapi.Request):
    parse_receipt = modal.Function.from_name("example-doc-ocr-jobs", "parse_receipt")

    form = await request.form()
    receipt = await form["receipt"].read()  # type: ignore
    call = parse_receipt.spawn(receipt)
    return {"call_id": call.object_id}

```

`/result` uses the provided `call_id` to instantiate a `modal.FunctionCall` object, and attempt
to get its result. If the call hasn't finished yet, we return a `202` status code, which indicates
that the server is still working on the job.

```python
@web_app.get("/result/{call_id}")
async def poll_results(call_id: str):
    function_call = modal.functions.FunctionCall.from_id(call_id)
    try:
        result = function_call.get(timeout=0)
    except TimeoutError:
        return fastapi.responses.JSONResponse(content="", status_code=202)

    return result

```

Now that we've defined our endpoints, we're ready to host them on Modal.
First, we specify our dependencies -- here, a basic Debian Linux
environment with FastAPI installed.

```python
image = modal.Image.debian_slim(python_version="3.12").pip_install(
    "fastapi[standard]==0.115.4"
)

```

Then, we add the static files for our front-end. We've made [a simple React
app](https://github.com/modal-labs/modal-examples/tree/main/09_job_queues/doc_ocr_frontend)
that hits the two endpoints defined above. To package these files with our app, we use
`add_local_dir` with the local directory of the assets, and specify that we want them
in the `/assets` directory inside our container (the `remote_path`). Then, we instruct FastAPI to [serve
this static file directory](https://fastapi.tiangolo.com/tutorial/static-files/) at our root path.

```python
local_assets_path = Path(__file__).parent / "doc_ocr_frontend"
image = image.add_local_dir(local_assets_path, remote_path="/assets")

@app.function(image=image)
@modal.asgi_app()
def wrapper():
    web_app.mount("/", fastapi.staticfiles.StaticFiles(directory="/assets", html=True))
    return web_app

```

## Running

While developing, you can run this as an ephemeral app by executing the command

```shell
modal serve doc_ocr_webapp.py
```

Modal watches all the mounted files and updates the app if anything changes.
See [these docs](https://modal.com/docs/guide/webhooks#developing-with-modal-serve)
for more details.

## Deploy

To deploy your application, run

```shell
modal deploy doc_ocr_webapp.py
```

That's all!

If successful, this will print a URL for your app that you can navigate to in
your browser 🎉 .

[![Webapp frontend](https://modal-cdn.com/doc_ocr_frontend.jpg)](https://modal-labs-examples--example-doc-ocr-webapp-wrapper.modal.run/)

### Dynamic Batching

# Dynamic batching for ASCII and character conversion

This example demonstrates how to dynamically batch a simple
application that converts ASCII codes to characters and vice versa.

For more details about using dynamic batching and optimizing
the batching configurations for your application, see
the [dynamic batching guide](https://modal.com/docs/guide/dynamic-batching).

## Setup

Let's start by defining the image for the application.

```python
import modal

app = modal.App(
    "example-dynamic-batching",
    image=modal.Image.debian_slim(python_version="3.11"),
)

```

## Defining a Batched Function

Now, let's define a function that converts ASCII codes to characters. This
async Batched Function allows us to convert up to four ASCII codes at once.

```python
@app.function()
@modal.batched(max_batch_size=4, wait_ms=1000)
async def asciis_to_chars(asciis: list[int]) -> list[str]:
    return [chr(ascii) for ascii in asciis]

```

If there are fewer than four ASCII codes in the batch, the Function will wait
for one second, as specified by `wait_ms`, to allow more inputs to arrive before
returning the result.

The input `asciis` to the Function is a list of integers, and the
output is a list of strings. To allow batching, the input list `asciis`
and the output list must have the same length.

You must invoke the Function with an individual ASCII input, and a single
character will be returned in response.

## Defining a class with a Batched Method

Next, let's define a class that converts characters to ASCII codes. This
class has an async Batched Method `chars_to_asciis` that converts characters
to ASCII codes.

Note that if a class has a Batched Method, it cannot have other Batched Methods
or Methods.

```python
@app.cls()
class AsciiConverter:
    @modal.batched(max_batch_size=4, wait_ms=1000)
    async def chars_to_asciis(self, chars: list[str]) -> list[int]:
        asciis = [ord(char) for char in chars]
        return asciis

```

## ASCII and character conversion

Finally, let's define the `local_entrypoint` that uses the Batched Function
and Class Method to convert ASCII codes to characters and
vice versa.

We use [`map.aio`](https://modal.com/docs/reference/modal.Function#map) to asynchronously map
over the ASCII codes and characters. This allows us to invoke the Batched
Function and the Batched Method over a range of ASCII codes and characters
in parallel.

Run this script to see which characters correspond to ASCII codes 33 through 38!

```python
@app.local_entrypoint()
async def main():
    ascii_converter = AsciiConverter()
    chars = []
    async for char in asciis_to_chars.map.aio(range(33, 39)):
        chars.append(char)

    print("Characters:", chars)

    asciis = []
    async for ascii in ascii_converter.chars_to_asciis.map.aio(chars):
        asciis.append(ascii)

    print("ASCII codes:", asciis)

```

### Esm3

# Build a protein folding dashboard with ESM3, Molstar, and Gradio

![Image of dashboard UI for ESM3 protein folding](https://modal-cdn.com/example-esm3-ui.png)

There are perhaps a quadrillion distinct proteins on the planet Earth,
each one a marvel of nanotechnology discovered by painstaking evolution.
We know the amino acid sequence of nearly a billion but we only
know the three-dimensional structure of a few hundred thousand,
gathered by slow, difficult observational methods like X-ray crystallography.
Built upon this data are machine learning models like
EvolutionaryScale's [ESM3](https://www.evolutionaryscale.ai/blog/esm3-release)
that can predict the structure of any sequence in seconds.

In this example, we'll show how you can use Modal to not
just run the latest protein-folding model but also build tools around it for
you and your team of scientists to understand and analyze the results.

## Basic Setup

```python
import base64
import io
from pathlib import Path
from typing import Optional

import modal

MINUTES = 60  # seconds

app = modal.App("example-esm3")

```

### Create a Volume to store ESM3 model weights and Entrez sequence data

To minimize cold start times, we'll store the ESM3 model weights on a Modal
[Volume](https://modal.com/docs/guide/volumes).
For patterns and best practices for storing model weights on Modal, see
[this guide](https://modal.com/docs/guide/model-weights).
We'll use this same distributed storage primitive to store sequence data.

```python
volume = modal.Volume.from_name("example-esm3", create_if_missing=True)
VOLUME_PATH = Path("/vol")
MODELS_PATH = VOLUME_PATH / "models"
DATA_PATH = VOLUME_PATH / "data"

```

### Define dependencies in container images

The container image for structure inference is based on Modal's default slim Debian
Linux image with `esm` for loading and running the model, `gemmi` for
managing protein structure file conversions, and `hf_transfer`
for faster downloading of the model weights from Hugging Face.

```python
esm3_image = (
    modal.Image.debian_slim(python_version="3.11")
    .pip_install(
        "esm==3.1.1",
        "torch==2.4.1",
        "gemmi==0.7.0",
        "huggingface_hub[hf_transfer]==0.26.2",
    )
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1", "HF_HOME": str(MODELS_PATH)})
)

```

We'll also define a separate image, with different dependencies,
for the part of our app that hosts the dashboard.
This helps reduce the complexity of Python dependency management
by "walling off" the different parts, e.g. separating
functions that depend on finicky ML packages
from those that depend on pedantic web packages.
Dependencies include `gradio` for building a web UI in Python and
`biotite` for extracting sequences from UniProt accession numbers.

You can read more about how to configure container images on Modal in
[this guide](https://modal.com/docs/guide/images).

```python
web_app_image = (
    modal.Image.debian_slim(python_version="3.11")
    .pip_install("gradio~=4.44.0", "biotite==0.41.2", "fastapi[standard]==0.115.4")
    .add_local_dir(Path(__file__).parent / "frontend", remote_path="/assets")
)

```

Here we "pre-import" libraries that will be used by the functions we run
on Modal in a given image using the `with image.imports` context manager.

```python
with esm3_image.imports():
    import tempfile

    import gemmi
    import torch
    from esm.models.esm3 import ESM3
    from esm.sdk.api import ESMProtein, GenerationConfig

with web_app_image.imports():
    import biotite.database.entrez as entrez
    import biotite.sequence.io.fasta as fasta
    from fastapi import FastAPI

```

## Define a `Model` inference class for ESM3

Next, we map the model's setup and inference code onto Modal.

1. For setup code that only needs to run once, we put it in a method
decorated with `@enter`, which runs on container start. For details,
see [this guide](https://modal.com/docs/guide/cold-start).
2. The rest of the inference code goes in a method decorated with `@method`.
3. We accelerate the compute-intensive inference with a GPU, specifically an A10G.
For more on using GPUs on Modal, see [this guide](https://modal.com/docs/guide/gpu).

```python
@app.cls(
    image=esm3_image,
    volumes={VOLUME_PATH: volume},
    secrets=[modal.Secret.from_name("huggingface-secret")],
    gpu="A10G",
    timeout=20 * MINUTES,
)
class Model:
    @modal.enter()
    def enter(self):
        self.model = ESM3.from_pretrained("esm3_sm_open_v1")
        self.model.to("cuda")

        print("using half precision and tensor cores for fast ESM3 inference")
        self.model = self.model.half()
        torch.backends.cuda.matmul.allow_tf32 = True

        self.max_steps = 250
        print(f"setting max ESM steps to: {self.max_steps}")

    def convert_protein_to_MMCIF(self, esm_protein, output_path):
        structure = gemmi.read_pdb_string(esm_protein.to_pdb_string())
        doc = structure.make_mmcif_document()
        doc.write_file(str(output_path), gemmi.cif.WriteOptions())

    def get_generation_config(self, num_steps):
        return GenerationConfig(track="structure", num_steps=num_steps)

    @modal.method()
    def inference(self, sequence: str):
        num_steps = min(len(sequence), self.max_steps)

        print(f"running ESM3 inference with num_steps={num_steps}")
        esm_protein = self.model.generate(
            ESMProtein(sequence=sequence), self.get_generation_config(num_steps)
        )

        print("checking for errors in output")
        if hasattr(esm_protein, "error_msg"):
            raise ValueError(esm_protein.error_msg)

        print("converting ESMProtein into MMCIF file")
        save_path = Path(tempfile.mktemp() + ".mmcif")
        self.convert_protein_to_MMCIF(esm_protein, save_path)

        print("returning MMCIF bytes")
        return io.BytesIO(save_path.read_bytes())

```

## Serve a dashboard as an `asgi_app`

In this section we'll create a web interface around the ESM3 model
that can help scientists and stakeholders understand and interrogate the results of the model.

You can deploy this UI, along with the backing inference endpoint,
with the following command:

```bash
modal deploy esm3.py
```

### Integrating Modal Functions

The integration between our dashboard and our inference backend
is made simple by the Modal SDK:
because the definition of the `Model` class is available in the same Python
context as the defintion of the web UI,
we can instantiate an instance and call its methods with `.remote`.

The inference runs in a GPU-accelerated container with all of ESM3's
dependencies, while this code executes in a CPU-only container
with only our web dependencies.

```python
def run_esm(sequence: str) -> str:
    sequence = sequence.strip()

    print("running ESM")
    mmcif_buffer = Model().inference.remote(sequence)

    print("converting mmCIF bytes to base64 for compatibility with HTML")
    mmcif_content = mmcif_buffer.read().decode()
    mmcif_base64 = base64.b64encode(mmcif_content.encode()).decode()

    return get_molstar_html(mmcif_base64)

```

### Building a UI in Python with Gradio

We'll visualize the results using [Mol* ](https://molstar.org/).
Mol* (pronounced "molstar") is an open-source toolkit for
visualizing and analyzing large-scale molecular data, including secondary structures
and residue-specific positions of proteins.

Second, we'll create links to lookup the metadata and structure of known
proteins using the [Universal Protein Resource](https://www.uniprot.org/)
database from the UniProt consortium which is supported by the European
Bioinformatics Institute, the National Human Genome Research
Institute, and the Swiss Institute of Bioinformatics. UniProt
is also a hub that links to many other databases, like the RCSB Protein
Data Bank.

To pull sequence data, we'll use the [Biotite](https://www.biotite-python.org/)
library to pull [FASTA](https://en.wikipedia.org/wiki/FASTA_format) files from
UniProt which contain labelled sequences.

You should see the URL for this UI in the output of `modal deploy`
or on your [Modal app dashboard](https://modal.com/apps) for this app.

```python
@app.function(
    image=web_app_image,
    volumes={VOLUME_PATH: volume},
    max_containers=1,  # Gradio requires sticky sessions
)
@modal.concurrent(max_inputs=1000)  # Gradio can handle many async inputs
@modal.asgi_app()
def ui():
    import gradio as gr
    from fastapi.responses import FileResponse
    from gradio.routes import mount_gradio_app

    web_app = FastAPI()

    # custom styles: an icon, a background, and some CSS
    @web_app.get("/favicon.ico", include_in_schema=False)
    async def favicon():
        return FileResponse("/assets/favicon.svg")

    @web_app.get("/assets/background.svg", include_in_schema=False)
    async def background():
        return FileResponse("/assets/background.svg")

    css = Path("/assets/index.css").read_text()

    theme = gr.themes.Default(
        primary_hue="green", secondary_hue="emerald", neutral_hue="neutral"
    )

    title = "Predict & Visualize Protein Structures"

    with gr.Blocks(theme=theme, css=css, title=title, js=always_dark()) as interface:
        gr.Markdown(f"# {title}")

        with gr.Row():
            with gr.Column():
                gr.Markdown("## Enter UniProt ID ")
                uniprot_num_box = gr.Textbox(
                    label="Enter UniProt ID or select one on the right",
                    placeholder="e.g. P02768, P69905,  etc.",
                )
                get_sequence_button = gr.Button(
                    "Retrieve Sequence from UniProt ID", variant="primary"
                )

                uniprot_link_button = gr.Button(value="View protein on UniProt website")
                uniprot_link_button.click(
                    fn=None,
                    inputs=uniprot_num_box,
                    js=get_js_for_uniprot_link(),
                )

            with gr.Column():
                example_uniprots = get_uniprot_examples()

                def extract_uniprot_num(example_idx):
                    uniprot = example_uniprots[example_idx]
                    return uniprot[uniprot.index("[") + 1 : uniprot.index("]")]

                gr.Markdown("## Example UniProt Accession Numbers")
                with gr.Row():
                    half_len = int(len(example_uniprots) / 2)
                    with gr.Column():
                        for i, uniprot in enumerate(example_uniprots[:half_len]):
                            btn = gr.Button(uniprot, variant="secondary")
                            btn.click(
                                fn=lambda j=i: extract_uniprot_num(j),
                                outputs=uniprot_num_box,
                            )

                    with gr.Column():
                        for i, uniprot in enumerate(example_uniprots[half_len:]):
                            btn = gr.Button(uniprot, variant="secondary")
                            btn.click(
                                fn=lambda j=i + half_len: extract_uniprot_num(j),
                                outputs=uniprot_num_box,
                            )

        gr.Markdown("## Enter Sequence")
        sequence_box = gr.Textbox(
            label="Enter a sequence or retrieve it from a UniProt ID",
            placeholder="e.g. MVTRLE..., PVTTIMHALL..., etc.",
        )
        get_sequence_button.click(
            fn=get_sequence, inputs=[uniprot_num_box], outputs=[sequence_box]
        )

        run_esm_button = gr.Button("Run ESM3 Folding", variant="primary")

        gr.Markdown("## ESM3 Predicted Structure")
        molstar_html = gr.HTML()

        run_esm_button.click(fn=run_esm, inputs=sequence_box, outputs=molstar_html)

    # return a FastAPI app for Modal to serve
    return mount_gradio_app(app=web_app, blocks=interface, path="/")

```

## Folding from the command line

If you want to quickly run the ESM3 model without the web interface, you can
run it from the command line like this:

```shell
modal run esm3
```

This will run the same inference code above on Modal. The results are
returned in the [Crystallographic Information File](https://en.wikipedia.org/wiki/Crystallographic_Information_File)
format, which you can render with the online [Molstar Viewer](https://molstar.org/viewer/).

```python
@app.local_entrypoint()
def main(sequence: Optional[str] = None, output_dir: Optional[str] = None):
    if sequence is None:
        print("using sequence for insulin [P01308]")
        sequence = "MRTPMLLALLALATLCLAGRADAKPGDAESGKGAAFVSKQEGSEVVKRLRRYLDHWLGAPAPYPDPLEPKREVCELNPDCDELADHIGFQEAYRRFYGPV"

    if output_dir is None:
        output_dir = Path("/tmp/esm3")
        output_dir.mkdir(parents=True, exist_ok=True)
    output_path = output_dir / "output.mmcif"

    print("starting inference on Modal")
    results_buffer = Model().inference.remote(sequence)

    print(f"writing results to {output_path}")
    output_path.write_bytes(results_buffer.read())

```

## Addenda

The remainder of this code is boilerplate.

### Extracting Sequences from UniProt Accession Numbers

To retrieve sequence information we'll utilize the `biotite` library which
will allow us to fetch [fasta](https://en.wikipedia.org/wiki/FASTA_format)
sequence files from the [National Center for Biotechnology Information (NCBI) Entrez database](https://www.ncbi.nlm.nih.gov/Web/Search/entrezfs.html).

```python
def get_sequence(uniprot_num: str) -> str:
    try:
        DATA_PATH.mkdir(parents=True, exist_ok=True)

        uniprot_num = uniprot_num.strip()
        fasta_path = DATA_PATH / f"{uniprot_num}.fasta"

        print(f"Fetching {fasta_path} from the entrez database")
        entrez.fetch_single_file(
            uniprot_num, fasta_path, db_name="protein", ret_type="fasta"
        )
        fasta_file = fasta.FastaFile.read(fasta_path)

        protein_sequence = fasta.get_sequence(fasta_file)
        return str(protein_sequence)

    except Exception as e:
        return f"Error: {e}"

```

### Supporting functions for the Gradio app

The following Python code is used to enhance the Gradio app,
mostly by generating some extra HTML & JS and handling styling.

```python
def get_js_for_uniprot_link():
    url = "https://www.uniprot.org/uniprotkb/"
    end = "/entry#structure"
    return f"""(uni_id) => {{ if (!uni_id) return; window.open("{url}" + uni_id + "{end}"); }}"""

def get_molstar_html(mmcif_base64):
    return f"""
    <iframe
        id="molstar_frame"
        style="width: 100%; height: 600px; border: none;"
        srcdoc='
            <!DOCTYPE html>
            <html>
                <head>
                    <script src="https://cdn.jsdelivr.net/npm/@rcsb/rcsb-molstar/build/dist/viewer/rcsb-molstar.js"></script>
                    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/@rcsb/rcsb-molstar/build/dist/viewer/rcsb-molstar.css">
                </head>
                <body>
                    <div id="protein-viewer" style="width: 1200px; height: 400px; position: center"></div>
                    <script>
                        console.log("Initializing viewer...");
                        (async function() {{
                            // Create plugin instance
                            const viewer = new rcsbMolstar.Viewer("protein-viewer");

                            // CIF data in base64
                            const mmcifData = "{mmcif_base64}";

                            // Convert base64 to blob
                            const blob = new Blob(
                                [atob(mmcifData)],
                                {{ type: "text/plain" }}
                            );

                            // Create object URL
                            const url = URL.createObjectURL(blob);

                            try {{
                                // Load structure
                                await viewer.loadStructureFromUrl(url, "mmcif");
                            }} catch (error) {{
                                console.error("Error loading structure:", error);
                            }}
                      }})();
                    </script>
                </body>
            </html>
        '>
    </iframe>"""

def get_uniprot_examples():
    return [
        "Albumin [P02768]",
        "Insulin [P01308]",
        "Hemoglobin [P69905]",
        "Lysozyme [P61626]",
        "BRCA1 [P38398]",
        "Immunoglobulin [P01857]",
        "Actin [P60709]",
        "Ribonuclease [P07998]",
    ]

def always_dark():
    return """
    function refresh() {
        const url = new URL(window.location);

        if (url.searchParams.get('__theme') !== 'dark') {
            url.searchParams.set('__theme', 'dark');
            window.location.href = url.href;
        }
    }
    """

```

### Fastapi App

# Deploy FastAPI app with Modal

This example shows how you can deploy a [FastAPI](https://fastapi.tiangolo.com/) app with Modal.
You can serve any app written in an ASGI-compatible web framework (like FastAPI) using this pattern or you can server WSGI-compatible frameworks like Flask with [`wsgi_app`](https://modal.com/docs/guide/webhooks#wsgi).

```python
from typing import Optional

import modal
from fastapi import FastAPI, Header
from pydantic import BaseModel

image = modal.Image.debian_slim().pip_install("fastapi[standard]", "pydantic")
app = modal.App("example-fastapi-app", image=image)
web_app = FastAPI()

class Item(BaseModel):
    name: str

@web_app.get("/")
async def handle_root(user_agent: Optional[str] = Header(None)):
    print(f"GET /     - received user_agent={user_agent}")
    return "Hello World"

@web_app.post("/foo")
async def handle_foo(item: Item, user_agent: Optional[str] = Header(None)):
    print(f"POST /foo - received user_agent={user_agent}, item.name={item.name}")
    return item

@app.function()
@modal.asgi_app()
def fastapi_app():
    return web_app

@app.function()
@modal.fastapi_endpoint(method="POST")
def f(item: Item):
    return "Hello " + item.name

if __name__ == "__main__":
    app.deploy("webapp")

```

### Fasthtml App

# Deploy a FastHTML app with Modal

This example shows how you can deploy a FastHTML app with Modal.
[FastHTML](https://www.fastht.ml/) is a Python library built on top of [HTMX](https://htmx.org/)
which allows you to create entire web applications using only Python.

The integration is pretty simple, thanks to the ASGI standard.
You just need to define a function returns your FastHTML app
and is decorated with `app.function` and `modal.asgi_app`.

```python
import modal

app = modal.App("example-fasthtml-app")

@app.function(
    image=modal.Image.debian_slim(python_version="3.12").pip_install(
        "python-fasthtml==0.5.2"
    )
)
@modal.asgi_app()
def serve():
    import fasthtml.common as fh

    app = fh.FastHTML()

    @app.get("/")
    def home():
        return fh.Div(fh.P("Hello World!"), hx_get="/change")

    return app

```

### Fasthtml Checkboxes

# Deploy 100,000 multiplayer checkboxes on Modal with FastHTML

[![Screenshot of FastHTML Checkboxes UI](./ui.png)](https://modal-labs-examples--example-fasthtml-checkboxes-web.modal.run)

This example shows how you can deploy a multiplayer checkbox game with FastHTML on Modal.

[FastHTML](https://www.fastht.ml/) is a Python library built on top of [HTMX](https://htmx.org/)
which allows you to create entire web applications using only Python.
For a simpler template for using FastHTML with Modal, check out
[this example](https://modal.com/docs/examples/fasthtml_app).

Our example is inspired by [1 Million Checkboxes](https://onemillioncheckboxes.com/).

```python
import time
from asyncio import Lock
from pathlib import Path
from uuid import uuid4

import modal

from .constants import N_CHECKBOXES

app = modal.App("example-fasthtml-checkboxes")
db = modal.Dict.from_name("example-fasthtml-checkboxes-db", create_if_missing=True)

css_path_local = Path(__file__).parent / "styles.css"
css_path_remote = "/assets/styles.css"

@app.function(
    image=modal.Image.debian_slim(python_version="3.12")
    .pip_install("python-fasthtml==0.12.21", "inflect~=7.4.0")
    .add_local_file(css_path_local, remote_path=css_path_remote),
    max_containers=1,  # we currently maintain state in memory, so we restrict the server to one worker
)
@modal.concurrent(max_inputs=1000)
@modal.asgi_app()
def web():
    import fasthtml.common as fh
    import inflect

    # Connected clients are tracked in-memory
    clients = {}
    clients_mutex = Lock()

    # We keep all checkbox fasthtml elements in memory during operation, and persist to modal dict across restarts
    checkboxes = db.get("checkboxes", [])
    checkbox_mutex = Lock()

    if len(checkboxes) == N_CHECKBOXES:
        print("Restored checkbox state from previous session.")
    else:
        print("Initializing checkbox state.")
        checkboxes = []
        for i in range(N_CHECKBOXES):
            checkboxes.append(
                fh.Input(
                    id=f"cb-{i}",
                    type="checkbox",
                    checked=False,
                    # when clicked, that checkbox will send a POST request to the server with its index
                    hx_post=f"/checkbox/toggle/{i}",
                    hx_swap_oob="true",  # allows us to later push diffs to arbitrary checkboxes by id
                )
            )

    async def on_shutdown():
        # Handle the shutdown event by persisting current state to modal dict
        async with checkbox_mutex:
            db["checkboxes"] = checkboxes
        print("Checkbox state persisted.")

    style = open(css_path_remote, "r").read()
    app, _ = fh.fast_app(
        # FastHTML uses the ASGI spec, which allows handling of shutdown events
        on_shutdown=[on_shutdown],
        hdrs=[fh.Style(style)],
    )

    # handler run on initial page load
    @app.get("/")
    async def get():
        # register a new client
        client = Client()
        async with clients_mutex:
            clients[client.id] = client

        return (
            fh.Title(f"{N_CHECKBOXES // 1000}k Checkboxes"),
            fh.Main(
                fh.H1(
                    f"{inflect.engine().number_to_words(N_CHECKBOXES).title()} Checkboxes"
                ),
                fh.Div(
                    *checkboxes,
                    id="checkbox-array",
                ),
                cls="container",
                # use HTMX to poll for diffs to apply
                hx_trigger="every 1s",  # poll every second
                hx_get=f"/diffs/{client.id}",  # call the diffs endpoint
                hx_swap="none",  # don't replace the entire page
            ),
        )

    # users submitting checkbox toggles
    @app.post("/checkbox/toggle/{i}")
    async def toggle(i: int):
        async with checkbox_mutex:
            cb = checkboxes[i]
            cb.checked = not cb.checked
            checkboxes[i] = cb

        async with clients_mutex:
            expired = []
            for client in clients.values():
                # clean up old clients
                if not client.is_active():
                    expired.append(client.id)

                # add diff to client for when they next poll
                client.add_diff(i)

            for client_id in expired:
                del clients[client_id]
        return

    # clients polling for any outstanding diffs
    @app.get("/diffs/{client_id}")
    async def diffs(client_id: str):
        # we use the `hx_swap_oob='true'` feature to
        # push updates only for the checkboxes that changed
        async with clients_mutex:
            client = clients.get(client_id, None)
            if client is None or len(client.diffs) == 0:
                return

            client.heartbeat()
            diffs = client.pull_diffs()

        async with checkbox_mutex:
            diff_array = [checkboxes[i] for i in diffs]

        return diff_array

    return app

```

Class for tracking state to push out to connected clients

```python
class Client:
    def __init__(self):
        self.id = str(uuid4())
        self.diffs = []
        self.inactive_deadline = time.time() + 30

    def is_active(self):
        return time.time() < self.inactive_deadline

    def heartbeat(self):
        self.inactive_deadline = time.time() + 30

    def add_diff(self, i):
        if i not in self.diffs:
            self.diffs.append(i)

    def pull_diffs(self):
        # return a copy of the diffs and clear them
        diffs = self.diffs
        self.diffs = []
        return diffs

```

### Fastrtc Flip Webcam

# Run a FastRTC app on Modal

[FastRTC](https://fastrtc.org/) is a Python library for real-time communication on the web.
This example demonstrates how to run a simple FastRTC app in the cloud on Modal.

It's intended to help you get up and running with real-time streaming applications on Modal
as quickly as possible. If you're interested in running a production-grade WebRTC app on Modal,
see [this example](https://modal.com/docs/examples/webrtc_yolo).

In this example, we stream webcam video from a browser to a container on Modal,
where the video is flipped, annotated, and sent back with under 100ms of delay.
You can try it out [here](https://modal-labs-examples--example-fastrtc-flip-webcam-ui.modal.run/)
or just dive straight into the code to run it yourself.

## Set up FastRTC on Modal

First, we import the `modal` SDK
and use it to define a [container image](https://modal.com/docs/guide/images)
with FastRTC and related dependencies.

```python
import modal

web_image = modal.Image.debian_slim(python_version="3.12").pip_install(
    "fastapi[standard]==0.115.4",
    "fastrtc==0.0.23",
    "gradio==5.7.1",
    "opencv-python-headless==4.11.0.86",
)

```

Then, we set that as the default Image on our Modal [App](https://modal.com/docs/guide/apps).

```python
app = modal.App("example-fastrtc-flip-webcam", image=web_image)

```

### Configure WebRTC streaming on Modal

Under the hood, FastRTC uses the WebRTC
[APIs](https://www.w3.org/TR/webrtc/) and
[protocols](https://datatracker.ietf.org/doc/html/rfc8825).

WebRTC provides low latency ("real-time") peer-to-peer communication
for Web applications, focusing on audio and video.
Considering that the Web is a platform originally designed
for high-latency, client-server communication of text and images,
that's no mean feat!

In addition to protocols that implement this communication,
WebRTC includes APIs for describing and manipulating audio/video streams.
In this demo, we set a few simple parameters, like the direction of the webcam
and the minimum frame rate. See the
[MDN Web Docs for `MediaTrackConstraints`](https://developer.mozilla.org/en-US/docs/Web/API/MediaTrackConstraints)
for more.

```python
TRACK_CONSTRAINTS = {
    "width": {"exact": 640},
    "height": {"exact": 480},
    "frameRate": {"min": 30},
    "facingMode": {  # https://developer.mozilla.org/en-US/docs/Web/API/MediaTrackSettings/facingMode
        "ideal": "user"
    },
}

```

In theory, the Internet is designed for peer-to-peer communication
all the way down to its heart, the Internet Protocol (IP): just send packets between IP addresses.
In practice, peer-to-peer communication on the contemporary Internet is fraught with difficulites,
from restrictive firewalls to finicky work-arounds for
[the exhaustion of IPv4 addresses](https://www.a10networks.com/glossary/what-is-ipv4-exhaustion/),
like [Carrier-Grade Network Address Translation (CGNAT)](https://en.wikipedia.org/wiki/Carrier-grade_NAT).

So establishing peer-to-peer connections can be quite involved.
The protocol for doing so is called Interactive Connectivity Establishment (ICE).
It is described in [this RFC](https://datatracker.ietf.org/doc/html/rfc8445#section-2).

ICE involves the peers exchanging a list of connections that might be used.
We use a fairly simple setup here, where our peer on Modal uses the
[Session Traversal Utilities for NAT (STUN)](https://datatracker.ietf.org/doc/html/rfc5389)
server provided by Google. A STUN server basically just reflects back to a client what their
IP address and port number appear to be when they talk to it. The peer on Modal communicates
that information to the other peer trying to connect to it -- in this case, a browser trying to share a webcam feed.
Note the use of `stun` and port `19302` in the URL in place of
something more familiar, like `http` and port `80`.

```python
RTC_CONFIG = {"iceServers": [{"url": "stun:stun.l.google.com:19302"}]}

```

## Running a FastRTC app on Modal

FastRTC builds on top of the [Gradio](https://www.gradio.app/docs)
library for defining Web UIs in Python.
Gradio in turn is compatible with the
[Asynchronous Server Gateway Interface (ASGI)](https://asgi.readthedocs.io/en/latest/)
protocol for asynchronous Python web servers, like
[FastAPI](https://fastrtc.org/userguide/streams/),
so we can host it on Modal's cloud platform using the
[`modal.asgi_app` decorator](https://modal.com/docs/guide/webhooks#serving-asgi-and-wsgi-apps)
with [Modal Function](https://modal.com/docs/guide/apps).

But before we do that, we need to consider limits:
on how many peers can connect to one instance on Modal
and on how long they can stay connected.
We picked some sensible defaults to show how they interact
with the deployment parameters of the Modal Function.
You'll want to tune these for your application!

```python
MAX_CONCURRENT_STREAMS = 10  # number of peers per instance on Modal

MINUTES = 60  # seconds
TIME_LIMIT = 10 * MINUTES  # time limit

@app.function(
    # gradio requires sticky sessions
    # so we limit the number of concurrent containers to 1
    # and allow that container to handle concurrent streams
    max_containers=1,
    scaledown_window=TIME_LIMIT + 1 * MINUTES,  # add a small buffer to time limit
)
@modal.concurrent(max_inputs=MAX_CONCURRENT_STREAMS)  # inputs per container
@modal.asgi_app()  # ASGI on Modal
def ui():
    import fastrtc  # WebRTC in Gradio
    import gradio as gr  # WebUIs in Python
    from fastapi import FastAPI  # asynchronous ASGI server framework
    from gradio.routes import mount_gradio_app  # connects Gradio and FastAPI

    with gr.Blocks() as blocks:  # block-wise UI definition
        gr.HTML(  # simple HTML header
            "<h1 style='text-align: center'>"
            "Streaming Video Processing with Modal and FastRTC"
            "</h1>"
        )

        with gr.Column():  # a column of UI elements
            fastrtc.Stream(  # high-level media streaming UI element
                modality="video",
                mode="send-receive",
                handler=flip_vertically,  # handler -- handle incoming frame, produce outgoing frame
                ui_args={"title": "Click 'Record' to flip your webcam in the cloud"},
                rtc_configuration=RTC_CONFIG,
                track_constraints=TRACK_CONSTRAINTS,
                concurrency_limit=MAX_CONCURRENT_STREAMS,  # limit simultaneous connections
                time_limit=TIME_LIMIT,  # limit time per connection
            )

    return mount_gradio_app(app=FastAPI(), blocks=blocks, path="/")

```

To try this out for yourself, run

```bash
modal serve 07_web_endpoints/fastrtc_flip_webcam.py
```

and head to the `modal.run` URL that appears in your terminal.
You can also check on the application's dashboard
via the `modal.com` URL thatappears below it.

The `modal serve` command produces a hot-reloading development server --
try editing the `title` in the `ui_args` above and watch the server redeploy.

This temporary deployment is tied to your terminal session.
To deploy permanently, run

```bash
modal deploy 07_web_endponts/fastrtc_flip_webcam.py
```

Note that Modal is a serverless platform with [usage-based pricing](https://modal.com/pricing),
so this application will spin down and cost you nothing when it is not in use.

## Addenda

This FastRTC app is very much the "hello world" or "echo server"
of FastRTC: it just flips the incoming webcam stream and adds a "hello" message.
That logic appears below.

```python
def flip_vertically(image):
    import cv2
    import numpy as np

    image = image.astype(np.uint8)

    if image is None:
        print("failed to decode image")
        return

    # flip vertically and caption to show video was processed on Modal
    image = cv2.flip(image, 0)
    lines = ["Hello from Modal!"]
    caption_image(image, lines)

    return image

def caption_image(
    img, lines, font_scale=0.8, thickness=2, margin=10, font=None, color=None
):
    import cv2

    if font is None:
        font = cv2.FONT_HERSHEY_SIMPLEX
    if color is None:
        color = (127, 238, 100, 128)  # Modal Green

    # get text sizes
    sizes = [cv2.getTextSize(line, font, font_scale, thickness)[0] for line in lines]
    if not sizes:
        return

    # position text in bottom right
    pos_xs = [img.shape[1] - size[0] - margin for size in sizes]

    pos_ys = [img.shape[0] - margin]
    for _width, height in reversed(sizes[:-1]):
        next_pos = pos_ys[-1] - 2 * height
        pos_ys.append(next_pos)

    for line, pos in zip(lines, zip(pos_xs, reversed(pos_ys))):
        cv2.putText(img, line, pos, font, font_scale, color, thickness)

```

### Finetune Yolo

# Fine-tune open source YOLO models for object detection

Example by [@Erik-Dunteman](https://github.com/erik-dunteman) and [@AnirudhRahul](https://github.com/AnirudhRahul/).

The popular "You Only Look Once" (YOLO) model line provides high-quality object detection in an economical package.
In this example, we use the [YOLOv10](https://docs.ultralytics.com/models/yolov10/) model, released on May 23, 2024.

We will:

- Download two custom datasets from the [Roboflow](https://roboflow.com/) computer vision platform: a dataset of birds and a dataset of bees

- Fine-tune the model on those datasets, in parallel, using the [Ultralytics package](https://docs.ultralytics.com/)

- Run inference with the fine-tuned models on single images and on streaming frames

For commercial use, be sure to consult the [Ultralytics software license options](https://docs.ultralytics.com/#yolo-licenses-how-is-ultralytics-yolo-licensed),
which include AGPL-3.0.

## Set up the environment

```python
import warnings
from dataclasses import dataclass
from datetime import datetime
from pathlib import Path

import modal

```

Modal runs your code in the cloud inside containers. So to use it, we have to define the dependencies
of our code as part of the container's [image](https://modal.com/docs/guide/custom-container).

```python
image = (
    modal.Image.debian_slim(python_version="3.10")
    .apt_install(  # install system libraries for graphics handling
        ["libgl1-mesa-glx", "libglib2.0-0"]
    )
    .pip_install(  # install python libraries for computer vision
        ["ultralytics~=8.2.68", "roboflow~=1.1.37", "opencv-python~=4.10.0"]
    )
    .pip_install(  # add an optional extra that renders images in the terminal
        "term-image==0.7.1"
    )
)

```

We also create a persistent [Volume](https://modal.com/docs/guide/volumes) for storing datasets, trained weights, and inference outputs. For more on storing model weights on Modal, see
[this guide](https://modal.com/docs/guide/model-weights).

```python
volume = modal.Volume.from_name("example-yolo-finetune", create_if_missing=True)
volume_path = (  # the path to the volume from within the container
    Path("/root") / "data"
)

```

We attach both of these to a Modal [App](https://modal.com/docs/guide/apps).

```python
app = modal.App("example-yolo-finetune", image=image, volumes={volume_path: volume})

```

## Download a dataset

We'll be downloading our data from the [Roboflow](https://roboflow.com/) computer vision platform, so to follow along you'll need to:

- Create a free account on [Roboflow](https://app.roboflow.com/)

- [Generate a Private API key](https://app.roboflow.com/settings/api)

- Set up a Modal [Secret](https://modal.com/docs/guide/secrets) called `roboflow-api-key` in the Modal UI [here](https://modal.com/secrets),
setting the `ROBOFLOW_API_KEY` to the value of your API key.

You're also free to bring your own dataset with a config in YOLOv10-compatible yaml format.

We'll be training on the medium size model, but you're free to experiment with [other model sizes](https://docs.ultralytics.com/models/yolov10/#model-variants).

```python
@dataclass
class DatasetConfig:
    """Information required to download a dataset from Roboflow."""

    workspace_id: str
    project_id: str
    version: int
    format: str
    target_class: str

    @property
    def id(self) -> str:
        return f"{self.workspace_id}/{self.project_id}/{self.version}"

@app.function(
    secrets=[
        modal.Secret.from_name("roboflow-api-key", required_keys=["ROBOFLOW_API_KEY"])
    ]
)
def download_dataset(config: DatasetConfig):
    import os

    from roboflow import Roboflow

    rf = Roboflow(api_key=os.getenv("ROBOFLOW_API_KEY"))
    project = (
        rf.workspace(config.workspace_id)
        .project(config.project_id)
        .version(config.version)
    )
    dataset_dir = volume_path / "dataset" / config.id
    project.download(config.format, location=str(dataset_dir))

```

## Train a model

We train the model on a single A100 GPU. Training usually takes only a few minutes.

```python
MINUTES = 60

TRAIN_GPU_COUNT = 1
TRAIN_GPU = f"A100:{TRAIN_GPU_COUNT}"
TRAIN_CPU_COUNT = 4

@app.function(
    gpu=TRAIN_GPU,
    cpu=TRAIN_CPU_COUNT,
    timeout=60 * MINUTES,
)
def train(
    model_id: str,
    dataset: DatasetConfig,
    model_size="yolov10m.pt",
    quick_check=False,
):
    from ultralytics import YOLO

    volume.reload()  # make sure volume is synced

    model_path = volume_path / "runs" / model_id
    model_path.mkdir(parents=True, exist_ok=True)

    data_path = volume_path / "dataset" / dataset.id / "data.yaml"

    model = YOLO(model_size)
    model.train(
        # dataset config
        data=data_path,
        fraction=0.4
        if not quick_check
        else 0.04,  # fraction of dataset to use for training/validation
        # optimization config
        device=list(range(TRAIN_GPU_COUNT)),  # use the GPU(s)
        epochs=8 if not quick_check else 1,  # pass over entire dataset this many times
        batch=0.95,  # automatic batch size to target fraction of GPU util
        seed=117,  # set seed for reproducibility
        # data processing config
        workers=max(
            TRAIN_CPU_COUNT // TRAIN_GPU_COUNT, 1
        ),  # split CPUs evenly across GPUs
        cache=False,  # cache preprocessed images in RAM?
        # model saving config
        project=f"{volume_path}/runs",
        name=model_id,
        exist_ok=True,  # overwrite previous model if it exists
        verbose=True,  # detailed logs
    )

```

## Run inference on single inputs and on streams

We demonstrate two different ways to run inference -- on single images and on a stream of images.

The images we use for inference are loaded from the test set, which was added to our Volume when we downloaded the dataset.
Each image read takes ~50ms, and inference can take ~5ms, so the disk read would be our biggest bottleneck if we just looped over the image paths.
To avoid it, we parallelize the disk reads across many workers using Modal's [`.map`](https://modal.com/docs/guide/scale),
streaming the images to the model. This roughly mimics the behavior of an interactive object detection pipeline.
This can increase throughput up to ~60 images/s, or ~17 milliseconds/image, depending on image size.

```python
@app.function()
def read_image(image_path: str):
    import cv2

    source = cv2.imread(image_path)
    return source

```

We use the `@enter` feature of [`modal.Cls`](https://modal.com/docs/guide/lifecycle-functions)
to load the model only once on container start and reuse it for future inferences.
We use a generator to stream images to the model.

```python
@app.cls(gpu="a10g")
class Inference:
    weights_path: str = modal.parameter()

    @modal.enter()
    def load_model(self):
        from ultralytics import YOLO

        self.model = YOLO(self.weights_path)

    @modal.method()
    def predict(self, model_id: str, image_path: str, display: bool = False):
        """A simple method for running inference on one image at a time."""
        results = self.model.predict(
            image_path,
            half=True,  # use fp16
            save=True,
            exist_ok=True,
            project=f"{volume_path}/predictions/{model_id}",
        )
        if display:
            from term_image.image import from_file

            terminal_image = from_file(results[0].path)
            terminal_image.draw()
        # you can view the output file via the Volumes UI in the Modal dashboard -- https://modal.com/storage

    @modal.method()
    def streaming_count(self, batch_dir: str, threshold: float | None = None):
        """Counts the number of objects in a directory of images.

        Intended as a demonstration of high-throughput streaming inference."""
        import os
        import time

        image_files = [os.path.join(batch_dir, f) for f in os.listdir(batch_dir)]

        completed, start = 0, time.monotonic_ns()
        for image in read_image.map(image_files):
            # note that we run predict on a single input at a time.
            # each individual inference is usually done before the next image arrives, so there's no throughput benefit to batching.
            results = self.model.predict(
                image,
                half=True,  # use fp16
                save=False,  # don't save to disk, as it slows down the pipeline significantly
                verbose=False,
            )
            completed += 1
            for res in results:
                for conf in res.boxes.conf:
                    if threshold is None:
                        yield 1
                        continue
                    if conf.item() >= threshold:
                        yield 1
            yield 0

        elapsed_seconds = (time.monotonic_ns() - start) / 1e9
        print(
            "Inferences per second:",
            round(completed / elapsed_seconds, 2),
        )

```

## Running the example

We'll kick off our parallel training jobs and run inference from the command line.

```bash
modal run finetune_yolo.py
```

This runs the training in `quick_check` mode, useful for debugging the pipeline and getting a feel for it.
To do a longer run that actually meaningfully improves performance, use:

```bash
modal run finetune_yolo.py --no-quick-check
```

```python
@app.local_entrypoint()
def main(quick_check: bool = True, inference_only: bool = False):
    """Run fine-tuning and inference on two datasets.

    Args:
        quick_check: fine-tune on a small subset. Lower quality results, but faster iteration.
        inference_only: skip fine-tuning and only run inference
    """

    birds = DatasetConfig(
        workspace_id="birds-s35xe",
        project_id="birds-u8mti",
        version=2,
        format="yolov9",
        target_class="🐥",
    )
    bees = DatasetConfig(
        workspace_id="bees-tbdsg",
        project_id="bee-counting",
        version=11,
        format="yolov9",
        target_class="🐝",
    )
    datasets = [birds, bees]

    # .for_each runs a function once on each element of the input iterators
    # here, that means download each dataset, in parallel
    if not inference_only:
        download_dataset.for_each(datasets)

    today = datetime.now().strftime("%Y-%m-%d")
    model_ids = [dataset.id + f"/{today}" for dataset in datasets]

    if not inference_only:
        train.for_each(model_ids, datasets, kwargs={"quick_check": quick_check})

    # let's run inference!
    for model_id, dataset in zip(model_ids, datasets):
        inference = Inference(
            weights_path=str(volume_path / "runs" / model_id / "weights" / "best.pt")
        )

        # predict on a single image and save output to the volume
        test_images = volume.listdir(
            str(Path("dataset") / dataset.id / "test" / "images")
        )
        # run inference on the first 5 images
        for ii, image in enumerate(test_images):
            print(f"{model_id}: Single image prediction on image", image.path)
            inference.predict.remote(
                model_id=model_id,
                image_path=f"{volume_path}/{image.path}",
                display=(
                    ii == 0  # display inference results only on first image
                ),
            )
            if ii >= 4:
                break

        # streaming inference on images from the test set
        print(f"{model_id}: Streaming inferences on all images in the test set...")
        count = 0
        for detection in inference.streaming_count.remote_gen(
            batch_dir=f"{volume_path}/dataset/{dataset.id}/test/images"
        ):
            if detection:
                print(f"{dataset.target_class}", end="")
                count += 1
            else:
                print("🎞️", end="", flush=True)
        print(f"\n{model_id}: Counted {count} {dataset.target_class}s!")

```

## Addenda

The rest of the code in this example is utility code.

```python
warnings.filterwarnings(  # filter warning from the terminal image library
    "ignore",
    message="It seems this process is not running within a terminal. Hence, some features will behave differently or be disabled.",
    category=UserWarning,
)

```

### Flan T5 Finetune

# Finetuning Flan-T5

Example by [@anishpdalal](https://github.com/anishpdalal)

[Flan-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) is a highly versatile model that's been instruction-tuned to
perform well on a variety of text-based tasks such as question answering and summarization. There are smaller model variants available which makes
Flan-T5 a great base model to use for finetuning on a specific instruction dataset with just a single GPU. In this example, we'll
finetune Flan-T5 on the [Extreme Sum ("XSum")](https://huggingface.co/datasets/xsum) dataset to summarize news articles.

## Defining dependencies

The example uses the `dataset` package from HuggingFace to load the xsum dataset. It also uses the `transformers`
and `accelerate` packages with a PyTorch backend to finetune and serve the model. Finally, we also
install `tensorboard` and serve it via a web app. All packages are installed into a Debian Slim base image
using the `pip_install` function.

```python
from pathlib import Path

import modal

VOL_MOUNT_PATH = Path("/vol")

```

Other Flan-T5 models can be found [here](https://huggingface.co/docs/transformers/model_doc/flan-t5)

```python
BASE_MODEL = "google/flan-t5-base"

image = modal.Image.debian_slim(python_version="3.12").pip_install(
    "accelerate",
    "transformers",
    "torch",
    "datasets",
    "tensorboard",
)

app = modal.App(name="example-flan-t5-finetune", image=image)
output_vol = modal.Volume.from_name("finetune-volume", create_if_missing=True)

```

### Handling preemption

As this finetuning job is long-running it's possible that it experiences a preemption.
The training code is robust to preemption events by periodically saving checkpoints and restoring
from checkpoint on restart. But it's also helpful to observe in logs when a preemption restart has occurred,
so we track restarts with a `modal.Dict`.

See the [guide on preemptions](https://modal.com/docs/guide/preemption#preemption)
for more details on preemption handling.

```python
restart_tracker_dict = modal.Dict.from_name(
    "finetune-restart-tracker", create_if_missing=True
)

def track_restarts(restart_tracker: modal.Dict) -> int:
    if not restart_tracker.contains("count"):
        preemption_count = 0
        print(f"Starting first time. {preemption_count=}")
        restart_tracker["count"] = preemption_count
    else:
        preemption_count = restart_tracker.get("count") + 1
        print(f"Restarting after pre-emption. {preemption_count=}")
        restart_tracker["count"] = preemption_count
    return preemption_count

```

## Finetuning Flan-T5 on XSum dataset

Each row in the dataset has a `document` (input news article) and `summary` column.

```python
@app.function(
    gpu="A10g",
    timeout=7200,
    volumes={VOL_MOUNT_PATH: output_vol},
)
def finetune(num_train_epochs: int = 1, size_percentage: int = 10):
    from datasets import load_dataset
    from transformers import (
        AutoModelForSeq2SeqLM,
        AutoTokenizer,
        DataCollatorForSeq2Seq,
        Seq2SeqTrainer,
        Seq2SeqTrainingArguments,
    )

    restarts = track_restarts(restart_tracker_dict)

    # Use size percentage to retrieve subset of the dataset to iterate faster
    if size_percentage:
        xsum_train = load_dataset("xsum", split=f"train[:{size_percentage}%]")
        xsum_test = load_dataset("xsum", split=f"test[:{size_percentage}%]")

    # Load the whole dataset
    else:
        xsum = load_dataset("xsum")
        xsum_train = xsum["train"]
        xsum_test = xsum["test"]

    # Load the tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
    model = AutoModelForSeq2SeqLM.from_pretrained(BASE_MODEL)

    # Replace all padding tokens with a large negative number so that the loss function ignores them in
    # its calculation
    padding_token_id = -100

    batch_size = 8

    def preprocess(batch):
        # prepend summarize: prefix to document to convert the example to a summarization instruction
        inputs = ["summarize: " + doc for doc in batch["document"]]

        model_inputs = tokenizer(
            inputs, max_length=512, truncation=True, padding="max_length"
        )

        labels = tokenizer(
            text_target=batch["summary"],
            max_length=128,
            truncation=True,
            padding="max_length",
        )

        labels["input_ids"] = [
            [l if l != tokenizer.pad_token_id else padding_token_id for l in label]
            for label in labels["input_ids"]
        ]

        model_inputs["labels"] = labels["input_ids"]
        return model_inputs

    tokenized_xsum_train = xsum_train.map(
        preprocess, batched=True, remove_columns=["document", "summary", "id"]
    )

    tokenized_xsum_test = xsum_test.map(
        preprocess, batched=True, remove_columns=["document", "summary", "id"]
    )

    data_collator = DataCollatorForSeq2Seq(
        tokenizer,
        model=model,
        label_pad_token_id=padding_token_id,
        pad_to_multiple_of=batch_size,
    )

    training_args = Seq2SeqTrainingArguments(
        # Save checkpoints to the mounted volume
        output_dir=str(VOL_MOUNT_PATH / "model"),
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        predict_with_generate=True,
        learning_rate=3e-5,
        num_train_epochs=num_train_epochs,
        logging_strategy="steps",
        logging_steps=100,
        evaluation_strategy="steps",
        save_strategy="steps",
        save_steps=100,
        save_total_limit=2,
        load_best_model_at_end=True,
    )

    trainer = Seq2SeqTrainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=tokenized_xsum_train,
        eval_dataset=tokenized_xsum_test,
    )

    try:
        resume = restarts > 0
        if resume:
            print("resuming from checkpoint")
        trainer.train(resume_from_checkpoint=resume)
    except KeyboardInterrupt:  # handle possible preemption
        print("received interrupt; saving state and model")
        trainer.save_state()
        trainer.save_model()
        raise

    # Save the trained model and tokenizer to the mounted volume
    model.save_pretrained(str(VOL_MOUNT_PATH / "model"))
    tokenizer.save_pretrained(str(VOL_MOUNT_PATH / "tokenizer"))
    output_vol.commit()
    print("✅ done")

```

## Monitoring Finetuning with Tensorboard

Tensorboard is an application for visualizing training loss. In this example we
serve it as a Modal WSGI app.

```python
@app.function(volumes={VOL_MOUNT_PATH: output_vol})
@modal.wsgi_app()
def monitor():
    import tensorboard

    board = tensorboard.program.TensorBoard()
    board.configure(logdir=f"{VOL_MOUNT_PATH}/logs")
    (data_provider, deprecated_multiplexer) = board._make_data_provider()
    wsgi_app = tensorboard.backend.application.TensorBoardWSGIApp(
        board.flags,
        board.plugin_loaders,
        data_provider,
        board.assets_zip_provider,
        deprecated_multiplexer,
    )
    return wsgi_app

```

## Model Inference

```python
@app.cls(volumes={VOL_MOUNT_PATH: output_vol})
class Summarizer:
    @modal.enter()
    def load_model(self):
        from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, pipeline

        # Load saved tokenizer and finetuned from training run
        tokenizer = AutoTokenizer.from_pretrained(
            BASE_MODEL, cache_dir=VOL_MOUNT_PATH / "tokenizer/"
        )
        model = AutoModelForSeq2SeqLM.from_pretrained(
            BASE_MODEL, cache_dir=VOL_MOUNT_PATH / "model/"
        )

        self.summarizer = pipeline("summarization", tokenizer=tokenizer, model=model)

    @modal.method()
    def generate(self, input: str) -> str:
        return self.summarizer(input)[0]["summary_text"]

@app.local_entrypoint()
def main():
    input = """
    The 14-time major champion, playing in his first full PGA Tour event for almost 18 months,
    carded a level-par second round of 72, but missed the cut by four shots after his first-round 76.
    World number one Jason Day and US Open champion Dustin Johnson also missed the cut at Torrey Pines in San Diego.
    Overnight leader Rose carded a one-under 71 to put him on eight under. Canada's
    Adam Hadwin and USA's Brandt Snedeker are tied in second on seven under, while US PGA champion
    Jimmy Walker missed the cut as he finished on three over. Woods is playing in just his
    second tournament since 15 months out with a back injury. "It's frustrating not being
    able to have a chance to win the tournament," said the 41-year-old, who won his last major,
    the US Open, at the same course in 2008. "Overall today was a lot better than yesterday.
    I hit it better, I putted well again. I hit a lot of beautiful putts that didn't go in, but
    I hit it much better today, which was nice." Scotland's Martin Laird and England's Paul Casey
    are both on two under, while Ireland's Shane Lowry is on level par.
    """
    model = Summarizer()
    response = model.generate.remote(input)
    print(response)

```

## Run via the CLI

Trigger model finetuning using the following command:

```bash
modal run --detach flan_t5_finetune.py::finetune --num-train-epochs=1 --size-percentage=10
View the tensorboard logs at https://<username>--example-flan-t5-finetune-monitor-dev.modal.run
```

Then, you can invoke inference via the `local_entrypoint` with this command:

```bash
modal run flan_t5_finetune.py
World number one Tiger Woods missed the cut at the US Open as he failed to qualify for the final round of the event in Los Angeles.
```

### Flask App

# Deploy Flask app with Modal

This example shows how you can deploy a [Flask](https://flask.palletsprojects.com/en/3.0.x/) app with Modal.
You can serve any app written in a WSGI-compatible web framework (like Flask) on Modal with this pattern. You can serve an app written in an ASGI-compatible framework, like FastAPI, with [`asgi_app`](https://modal.com/docs/guide/webhooks#asgi).

```python
import modal

app = modal.App(
    "example-flask-app",
    image=modal.Image.debian_slim().pip_install("flask"),
)

@app.function()
@modal.wsgi_app()
def flask_app():
    from flask import Flask, request

    web_app = Flask(__name__)

    @web_app.get("/")
    def home():
        return "Hello Flask World!"

    @web_app.post("/foo")
    def foo():
        return request.json

    return web_app

```

### Flask Streaming

# Deploy Flask app with streaming results with Modal

This example shows how you can deploy a [Flask](https://flask.palletsprojects.com/en/3.0.x/) app with Modal that streams results back to the client.

```python
import modal

app = modal.App(
    "example-flask-streaming",
    image=modal.Image.debian_slim().pip_install("flask"),
)

@app.function()
def generate_rows():
    """
    This creates a large CSV file, about 10MB, which will be streaming downloaded
    by a web client.
    """
    for i in range(10_000):
        line = ",".join(str((j + i) * i) for j in range(128))
        yield f"{line}\n"

@app.function()
@modal.wsgi_app()
def flask_app():
    from flask import Flask

    web_app = Flask(__name__)

    # These web handlers follow the example from
    # https://flask.palletsprojects.com/en/2.2.x/patterns/streaming/

    @web_app.route("/")
    def generate_large_csv():
        # Run the function locally in the web app's container.
        return generate_rows.local(), {"Content-Type": "text/csv"}

    @web_app.route("/remote")
    def generate_large_csv_in_container():
        # Run the function remotely in a separate container,
        # which will stream back results to the web app container,
        # which will stream back to the web client.
        #
        # This is less efficient, but demonstrates how web serving
        # containers can be separated from and cooperate with other
        # containers.
        return generate_rows.remote(), {"Content-Type": "text/csv"}

    return web_app

```

### Flux

# Run Flux fast on H100s with `torch.compile`

_Update: To speed up inference by another >2x, check out the additional optimization
techniques we tried in [this blog post](https://modal.com/blog/flux-3x-faster)!_

In this guide, we'll run Flux as fast as possible on Modal using open source tools.
We'll use `torch.compile` and NVIDIA H100 GPUs.

## Setting up the image and dependencies

```python
import time
from io import BytesIO
from pathlib import Path

import modal

```

We'll make use of the full [CUDA toolkit](https://modal.com/docs/guide/cuda)
in this example, so we'll build our container image off of the `nvidia/cuda` base.

```python
cuda_version = "12.4.0"  # should be no greater than host CUDA version
flavor = "devel"  # includes full CUDA toolkit
operating_sys = "ubuntu22.04"
tag = f"{cuda_version}-{flavor}-{operating_sys}"

cuda_dev_image = modal.Image.from_registry(
    f"nvidia/cuda:{tag}", add_python="3.11"
).entrypoint([])

```

Now we install most of our dependencies with `apt` and `pip`.
For Hugging Face's [Diffusers](https://github.com/huggingface/diffusers) library
we install from GitHub source and so pin to a specific commit.

PyTorch added faster attention kernels for Hopper GPUs in version 2.5.

```python
diffusers_commit_sha = "81cf3b2f155f1de322079af28f625349ee21ec6b"

flux_image = (
    cuda_dev_image.apt_install(
        "git",
        "libglib2.0-0",
        "libsm6",
        "libxrender1",
        "libxext6",
        "ffmpeg",
        "libgl1",
    )
    .pip_install(
        "invisible_watermark==0.2.0",
        "transformers==4.44.0",
        "huggingface_hub[hf_transfer]==0.26.2",
        "accelerate==0.33.0",
        "safetensors==0.4.4",
        "sentencepiece==0.2.0",
        "torch==2.5.0",
        f"git+https://github.com/huggingface/diffusers.git@{diffusers_commit_sha}",
        "numpy<2",
    )
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1", "HF_HUB_CACHE": "/cache"})
)

```

Later, we'll also use `torch.compile` to increase the speed further.
Torch compilation needs to be re-executed when each new container starts,
so we turn on some extra caching to reduce compile times for later containers.

```python
flux_image = flux_image.env(
    {
        "TORCHINDUCTOR_CACHE_DIR": "/root/.inductor-cache",
        "TORCHINDUCTOR_FX_GRAPH_CACHE": "1",
    }
)

```

Finally, we construct our Modal [App](https://modal.com/docs/reference/modal.App),
set its default image to the one we just constructed,
and import `FluxPipeline` for downloading and running Flux.1.

```python
app = modal.App("example-flux", image=flux_image)

with flux_image.imports():
    import torch
    from diffusers import FluxPipeline

```

## Defining a parameterized `Model` inference class

Next, we map the model's setup and inference code onto Modal.

1. We run the model setup in the method decorated with `@modal.enter()`. This includes loading the
weights and moving them to the GPU, along with an optional `torch.compile` step (see details below).
The `@modal.enter()` decorator ensures that this method runs only once, when a new container starts,
instead of in the path of every call.

2. We run the actual inference in methods decorated with `@modal.method()`.

*Note: Access to the Flux.1-schnell model on Hugging Face is
[gated by a license agreement](https://huggingface.co/docs/hub/en/models-gated)
which you must agree to
[here](https://huggingface.co/black-forest-labs/FLUX.1-schnell).
After you have accepted the license,
[create a Modal Secret](https://modal.com/secrets)
with the name `huggingface-secret` following the instructions in the template.*

```python
MINUTES = 60  # seconds
VARIANT = "schnell"  # or "dev"
NUM_INFERENCE_STEPS = 4  # use ~50 for [dev], smaller for [schnell]

@app.cls(
    gpu="H100",  # fast GPU with strong software support
    scaledown_window=20 * MINUTES,
    timeout=60 * MINUTES,  # leave plenty of time for compilation
    volumes={  # add Volumes to store serializable compilation artifacts, see section on torch.compile below
        "/cache": modal.Volume.from_name("hf-hub-cache", create_if_missing=True),
        "/root/.nv": modal.Volume.from_name("nv-cache", create_if_missing=True),
        "/root/.triton": modal.Volume.from_name("triton-cache", create_if_missing=True),
        "/root/.inductor-cache": modal.Volume.from_name(
            "inductor-cache", create_if_missing=True
        ),
    },
    secrets=[modal.Secret.from_name("huggingface-secret")],
)
class Model:
    compile: bool = (  # see section on torch.compile below for details
        modal.parameter(default=False)
    )

    @modal.enter()
    def enter(self):
        pipe = FluxPipeline.from_pretrained(
            f"black-forest-labs/FLUX.1-{VARIANT}", torch_dtype=torch.bfloat16
        ).to("cuda")  # move model to GPU
        self.pipe = optimize(pipe, compile=self.compile)

    @modal.method()
    def inference(self, prompt: str) -> bytes:
        print("🎨 generating image...")
        out = self.pipe(
            prompt,
            output_type="pil",
            num_inference_steps=NUM_INFERENCE_STEPS,
        ).images[0]

        byte_stream = BytesIO()
        out.save(byte_stream, format="JPEG")
        return byte_stream.getvalue()

```

## Calling our inference function

To generate an image we just need to call the `Model`'s `generate` method
with `.remote` appended to it.
You can call `.generate.remote` from any Python environment that has access to your Modal credentials.
The local environment will get back the image as bytes.

Here, we wrap the call in a Modal [`local_entrypoint`](https://modal.com/docs/reference/modal.App#local_entrypoint)
so that it can be run with `modal run`:

```bash
modal run flux.py
```

By default, we call `generate` twice to demonstrate how much faster
the inference is after cold start. In our tests, clients received images in about 1.2 seconds.
We save the output bytes to a temporary file.

```python
@app.local_entrypoint()
def main(
    prompt: str = "a computer screen showing ASCII terminal art of the"
    " word 'Modal' in neon green. two programmers are pointing excitedly"
    " at the screen.",
    twice: bool = True,
    compile: bool = False,
):
    t0 = time.time()
    image_bytes = Model(compile=compile).inference.remote(prompt)
    print(f"🎨 first inference latency: {time.time() - t0:.2f} seconds")

    if twice:
        t0 = time.time()
        image_bytes = Model(compile=compile).inference.remote(prompt)
        print(f"🎨 second inference latency: {time.time() - t0:.2f} seconds")

    output_path = Path("/tmp") / "flux" / "output.jpg"
    output_path.parent.mkdir(exist_ok=True, parents=True)
    print(f"🎨 saving output to {output_path}")
    output_path.write_bytes(image_bytes)

```

## Speeding up Flux with `torch.compile`

By default, we do some basic optimizations, like adjusting memory layout
and re-expressing the attention head projections as a single matrix multiplication.
But there are additional speedups to be had!

PyTorch 2 added a compiler that optimizes the
compute graphs created dynamically during PyTorch execution.
This feature helps close the gap with the performance of static graph frameworks
like TensorRT and TensorFlow.

Here, we follow the suggestions from Hugging Face's
[guide to fast diffusion inference](https://huggingface.co/docs/diffusers/en/tutorials/fast_diffusion),
which we verified with our own internal benchmarks.
Review that guide for detailed explanations of the choices made below.

The resulting compiled Flux `schnell` deployment returns images to the client in under a second (~700 ms), according to our testing.
_Super schnell_!

Compilation takes up to twenty minutes on first iteration.
As of time of writing in late 2024,
the compilation artifacts cannot be fully serialized,
so some compilation work must be re-executed every time a new container is started.
That includes when scaling up an existing deployment or the first time a Function is invoked with `modal run`.

We cache compilation outputs from `nvcc`, `triton`, and `inductor`,
which can reduce compilation time by up to an order of magnitude.
For details see [this tutorial](https://pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html).

You can turn on compilation with the `--compile` flag.
Try it out with:

```bash
modal run flux.py --compile
```

The `compile` option is passed by a [`modal.parameter`](https://modal.com/docs/reference/modal.parameter#modalparameter) on our class.
Each different choice for a `parameter` creates a [separate auto-scaling deployment](https://modal.com/docs/guide/parameterized-functions).
That means your client can use arbitrary logic to decide whether to hit a compiled or eager endpoint.

```python
def optimize(pipe, compile=True):
    # fuse QKV projections in Transformer and VAE
    pipe.transformer.fuse_qkv_projections()
    pipe.vae.fuse_qkv_projections()

    # switch memory layout to Torch's preferred, channels_last
    pipe.transformer.to(memory_format=torch.channels_last)
    pipe.vae.to(memory_format=torch.channels_last)

    if not compile:
        return pipe

    # set torch compile flags
    config = torch._inductor.config
    config.disable_progress = False  # show progress bar
    config.conv_1x1_as_mm = True  # treat 1x1 convolutions as matrix muls
    # adjust autotuning algorithm
    config.coordinate_descent_tuning = True
    config.coordinate_descent_check_all_directions = True
    config.epilogue_fusion = False  # do not fuse pointwise ops into matmuls

    # tag the compute-intensive modules, the Transformer and VAE decoder, for compilation
    pipe.transformer = torch.compile(
        pipe.transformer, mode="max-autotune", fullgraph=True
    )
    pipe.vae.decode = torch.compile(
        pipe.vae.decode, mode="max-autotune", fullgraph=True
    )

    # trigger torch compilation
    print("🔦 running torch compilation (may take up to 20 minutes)...")

    pipe(
        "dummy prompt to trigger torch compilation",
        output_type="pil",
        num_inference_steps=NUM_INFERENCE_STEPS,  # use ~50 for [dev], smaller for [schnell]
    ).images[0]

    print("🔦 finished torch compilation")

    return pipe

```

### Generators

# Run a generator function on Modal

This example shows how you can run a generator function on Modal. We define a
function that `yields` values and then call it with the [`remote_gen`](https://modal.com/docs/reference/modal.Function#remote_gen) method. The
`remote_gen` method returns a generator object that can be used to iterate over
the values produced by the function.

```python
import modal

app = modal.App("example-generators")

@app.function()
def f(i):
    for j in range(i):
        yield j

@app.local_entrypoint()
def main():
    for r in f.remote_gen(10):
        print(r)

```

### Generators Async

# Run async generator function on Modal

This example shows how you can run an async generator function on Modal.
Modal natively supports async/await syntax using asyncio.

```python
import modal

app = modal.App("example-generators-async")

@app.function()
def f(i):
    for j in range(i):
        yield j

@app.local_entrypoint()
async def run_async():
    async for r in f.remote_gen.aio(10):
        print(r)

```

### Get Started

# Example (get_started.py)

This is the source code for **01_getting_started.get_started**.
```python
import modal

app = modal.App("example-get-started")

@app.function()
def square(x):
    print("This code is running on a remote worker!")
    return x**2

@app.local_entrypoint()
def main():
    print("the square is", square.remote(42))

```

### Gpu Fallbacks

# Set "fallback" GPUs

GPU availabilities on Modal can fluctuate, especially for
tightly-constrained requests, like for eight co-located GPUs
in a specific region.

If your code can run on multiple different GPUs, you can specify
your GPU request as a list, in order of preference, and whenever
your Function scales up, we will try to schedule it on each requested GPU type in order.

The code below demonstrates the usage of the `gpu` parameter with a list of GPUs.

```python
import subprocess

import modal

app = modal.App("example-gpu-fallbacks")

@app.function(
    gpu=["h100", "a100", "any"],  # "any" means any of L4, A10, or T4
    max_inputs=1,  # new container each input, so we re-roll the GPU dice every time
)
async def remote(_idx):
    gpu = subprocess.run(
        ["nvidia-smi", "--query-gpu=name", "--format=csv,noheader"],
        check=True,
        text=True,
        stdout=subprocess.PIPE,
    ).stdout.strip()
    print(gpu)
    return gpu

@app.local_entrypoint()
def local(count: int = 32):
    from collections import Counter

    gpu_counter = Counter(remote.map([i for i in range(count)], order_outputs=False))
    print(f"ran {gpu_counter.total()} times")
    print(f"on the following {len(gpu_counter.keys())} GPUs:", end="\n")
    print(
        *[f"{gpu.rjust(32)}: {'🔥' * ct}" for gpu, ct in gpu_counter.items()],
        sep="\n",
    )

```

### Gpu Packing

# Run multiple instances of a model on a single GPU

Many models are small enough to fit multiple instances onto a single GPU.
Doing so can dramatically reduce the number of GPUs needed to handle demand.

We use `@modal.concurrent` to allow multiple connections into the container
We load the model instances into a FIFO queue to ensure only one http handler can access it at once

```python
import asyncio
import time
from contextlib import asynccontextmanager

import modal

MODEL_PATH = "/model.bge"

def download_model():
    from sentence_transformers import SentenceTransformer

    model = SentenceTransformer("BAAI/bge-small-en-v1.5")
    model.save(MODEL_PATH)

image = (
    modal.Image.debian_slim(python_version="3.12")
    .pip_install("sentence-transformers==3.2.0")
    .run_function(download_model)
)

app = modal.App("example-gpu-packing", image=image)

```

ModelPool holds multiple instances of the model, using a queue

```python
class ModelPool:
    def __init__(self):
        self.pool: asyncio.Queue = asyncio.Queue()

    async def put(self, model):
        await self.pool.put(model)

    # We provide a context manager to easily acquire and release models from the pool
    @asynccontextmanager
    async def acquire_model(self):
        model = await self.pool.get()
        try:
            yield model
        finally:
            await self.pool.put(model)

with image.imports():
    from sentence_transformers import SentenceTransformer

@app.cls(
    gpu="A10G",
    max_containers=1,  # Max one container for this app, for the sake of demoing concurrent_inputs
)
@modal.concurrent(max_inputs=100)  # Allow concurrent inputs into our single container.
class Server:
    n_models: int = modal.parameter(default=10)

    @modal.enter()
    def init(self):
        self.model_pool = ModelPool()

    @modal.enter()
    async def load_models(self):
        # Boot N models onto the gpu, and place into the pool
        t0 = time.time()
        for i in range(self.n_models):
            model = SentenceTransformer("/model.bge", device="cuda")
            await self.model_pool.put(model)

        print(f"Loading {self.n_models} models took {time.time() - t0:.4f}s")

    @modal.method()
    def prewarm(self):
        pass

    @modal.method()
    async def predict(self, sentence):
        # Block until a model is available
        async with self.model_pool.acquire_model() as model:
            # We now have exclusive access to this model instance
            embedding = model.encode(sentence)
            await asyncio.sleep(
                0.2
            )  # Simulate extra inference latency, for demo purposes
        return embedding.tolist()

@app.local_entrypoint()
async def main(n_requests: int = 100):
    # We benchmark with 100 requests in parallel.
    # Thanks to @modal.concurrent(), 100 requests will enter .predict() at the same time.

    sentences = ["Sentence {}".format(i) for i in range(n_requests)]

    # Baseline: a server with a pool size of 1 model
    print("Testing Baseline (1 Model)")
    t0 = time.time()
    server = Server(n_models=1)
    server.prewarm.remote()
    print("Container boot took {:.4f}s".format(time.time() - t0))

    t0 = time.time()
    async for result in server.predict.map.aio(sentences):
        pass
    print(f"Inference took {time.time() - t0:.4f}s\n")

    # Packing: a server with a pool size of 10 models
    # Note: this increases boot time, but reduces inference time
    print("Testing Packing (10 Models)")
    t0 = time.time()
    server = Server(n_models=10)
    server.prewarm.remote()
    print("Container boot took {:.4f}s".format(time.time() - t0))

    t0 = time.time()
    async for result in server.predict.map.aio(sentences):
        pass
    print(f"Inference took {time.time() - t0:.4f}s\n")

```

### Gpu Snapshot

# Example (gpu_snapshot.py)

This is the source code for **06_gpu_and_ml.gpu_snapshot**.

```python
import modal

image = modal.Image.debian_slim().uv_pip_install("sentence-transformers<6")
app_name = "example-gpu-snapshot"
app = modal.App(app_name, image=image)

snapshot_key = "v1"  # change this to invalidate the snapshot cache

with image.imports():  # import in the global scope so imports can be snapshot
    from sentence_transformers import SentenceTransformer

@app.cls(
    gpu="a10",
    enable_memory_snapshot=True,
    experimental_options={"enable_gpu_snapshot": True},
)
class SnapshotEmbedder:
    @modal.enter(snap=True)
    def load(self):
        # during enter phase of container lifecycle,
        # load the model onto the GPU so it can be snapshot
        print("loading model")
        self.model = SentenceTransformer("BAAI/bge-small-en-v1.5", device="cuda")
        print(f"snapshotting {snapshot_key}")

    @modal.method()
    def run(self, sentences: list[str]) -> list[list[float]]:
        # later invocations of the Function will start here
        embeddings = self.model.encode(sentences, normalize_embeddings=True)
        return embeddings.tolist()

if __name__ == "__main__":
    # after deployment, we can use the class from anywhere
    SnapshotEmbedder = modal.Cls.from_name(app_name, "SnapshotEmbedder")
    embedder = SnapshotEmbedder()
    try:
        print("calling Modal Function")
        print(embedder.run.remote(sentences=["what is the meaning of life?"]))
    except modal.exception.NotFoundError:
        raise Exception(
            f"To take advantage of GPU snapshots, deploy first with modal deploy {__file__}"
        )

```

### Grpo Trl

# Train a model to solve coding problems using GRPO and TRL

This example demonstrates how to run [GRPO](https://arxiv.org/pdf/2402.03300) on Modal using the TRL [GRPO trainer](https://huggingface.co/docs/trl/main/en/grpo_trainer)
GRPO is a reinforcement learning algorithm introduced by DeepSeek, and was used to train DeepSeek R1.
TRL is a reinforcement learning training library by Huggingface.

First we perform the imports and then define the app.

```python
from __future__ import annotations

import os
import re
import subprocess
from pathlib import Path
from typing import Iterable, Sequence

import modal

app: modal.App = modal.App("example-grpo-trl")

```

We define an image where we install the TRL library.
We also install vLLM for the next part of this example. We also use Weights & Biases for logging.

```python
image: modal.Image = modal.Image.debian_slim().pip_install(
    "trl[vllm]==0.19.1", "datasets==3.5.1", "wandb==0.17.6"
)

```

We import the necessary libraries needed in the context of the image.

```python
with image.imports():
    from datasets import Dataset, load_dataset
    from trl import GRPOConfig, GRPOTrainer

```

We also define a [Modal Volume](https://modal.com/docs/guide/volumes#volumes) for storing model checkpoints.

```python
MODELS_DIR = Path("/models")
checkpoints_volume: modal.Volume = modal.Volume.from_name(
    "example-grpo-trl-checkpoints", create_if_missing=True
)

```

## Defining the reward function

In this example, we use the [OpenCoder-LLM/opc-sft-stage2](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2) dataset to train a model to solve coding problems.

In reinforcement learning, we define a reward function for the model. Since we are evaluating code that is generated by
a model, we use [Modal Sandboxes](https://modal.com/docs/guide/sandbox) to evaluate the code securely.

For each completion from the model and a test case to test the completion, we define a simple reward function.
The function returns 1 if there are no errors, and 0 otherwise. You might want to adjust this reward function
as the model is unlikely to learn well with this function.

```python
@app.function()
def compute_reward(completion: str, testcase: Sequence[str]) -> int:
    sb, score = None, 0
    sb: modal.Sandbox = modal.Sandbox.create(app=app)
    code_to_execute: str = get_generated_code_and_test_cases(completion, testcase)

    try:
        p = sb.exec("python", "-c", code_to_execute, timeout=30)
        p.wait()
        return_code = p.returncode
        if return_code == 0:
            score = 1
    except Exception as e:
        print(e)
    finally:
        sb.terminate()
        return score

```

We write a function that constructs a program from the model completion. This is determined based on the format of the data.
The completions are supposed to follow the format "```python ...".
The test cases are a list of assert statements.
More details [here](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2).

```python
def get_generated_code_and_test_cases(completion: str, testcase: Sequence[str]) -> str:
    if "```python" in completion:
        # Find the start and end of the code block
        start_idx: int = completion.find("```python") + len("```python")
        end_idx: int = completion.find("```", start_idx)
        if end_idx != -1:
            code: str = completion[start_idx:end_idx].strip()
        else:
            code: str = completion[start_idx:].strip()
    else:
        code: str = completion.strip()

    test_cases: str = "\n".join(testcase)
    full_code: str = f"{code}\n\n{test_cases}"
    return full_code

```

Finally, we define the function that is passed into the GRPOTrainer, which takes in a list of completions.
Custom reward functions must conform to a [specific signature](https://huggingface.co/docs/trl/main/en/grpo_trainer#using-a-custom-reward-function).

```python
def reward_helper_function(
    completions: Sequence[str], testcases: Sequence[Sequence[str]], **kwargs: object
) -> Iterable[int]:
    return compute_reward.starmap(zip(completions, testcases))

```

## Kicking off a training run

Preprocess the data, preparing the columns that `GRPOTrainer` expects.
We use the OpenCoder-LLM educational instruct dataset, which has (instruction, code, test case) triples validated through a Python compiler.
More details [here](https://huggingface.co/datasets/OpenCoder-LLM/opc-sft-stage2).

```python
def start_grpo_trainer(use_vllm=False, vllm_mode=None):
    dataset: Dataset = load_dataset(
        "OpenCoder-LLM/opc-sft-stage2", "educational_instruct", split="train"
    )
    dataset = dataset.rename_column(
        "instruction", "prompt"
    )  # Needed for the GRPO trainer
    dataset = dataset.rename_column("testcase", "testcases")
    dataset = dataset.select(range(128))  # To simplify testing.
    training_args: GRPOConfig = GRPOConfig(
        output_dir=str(MODELS_DIR),
        report_to="wandb",
        use_vllm=use_vllm,
        vllm_mode=vllm_mode,
        save_steps=1,
        max_steps=5,  # To simplify testing. Remove for production use cases.
    )
    trainer = GRPOTrainer(
        model="Qwen/Qwen2-0.5B-Instruct",
        reward_funcs=reward_helper_function,
        args=training_args,
        train_dataset=dataset,
    )
    trainer.train()

```

We use Weights & Biases for logging, hence we use a [Modal Secret](https://modal.com/docs/guide/secrets#secrets) with wandb credentials.

```python
@app.function(
    image=image,
    gpu="H100",
    timeout=60 * 60 * 24,  # 24 hours
    secrets=[modal.Secret.from_name("wandb-secret")],
    volumes={"/models": checkpoints_volume},
)
def train() -> None:
    start_grpo_trainer()

```

To run: `modal run --detach grpo_trl.py::train`.

## Speeding up training with vLLM

vLLM can be used either in server mode (run vLLM server on separate gpu) or colocate mode (within the training process).
In server mode, vLLM runs in a separate process (and using separate GPUs) and communicates with the trainer via HTTP.
This is ideal if you have dedicated GPUs for inference. More details [here](https://huggingface.co/docs/trl/main/en/grpo_trainer#-option-1-server-mode).
Here, we use 2 GPUs. We run the GRPOTrainer on 1 of them, and the vLLM process on another.

```python
@app.function(
    image=image,
    gpu="H100:2",
    timeout=60 * 60 * 24,  # 24 hours
    secrets=[modal.Secret.from_name("wandb-secret")],
    volumes={str(MODELS_DIR): checkpoints_volume},
)
def train_vllm_server_mode() -> None:
    env_copy = os.environ.copy()
    env_copy["CUDA_VISIBLE_DEVICES"] = "0"  # Run serve vLLM process on GPU 0

    # Start vllm-serve in the background
    subprocess.Popen(
        ["trl", "vllm-serve", "--model", "Qwen/Qwen2-0.5B-Instruct"],
        env=env_copy,
    )
    os.environ["CUDA_VISIBLE_DEVICES"] = "1"  # Run training process on GPU 1
    start_grpo_trainer(use_vllm=True, vllm_mode="server")

```

You can execute this using `modal run --detach grpo_trl.py::train_vllm_server_mode`.

In colocate mode, vLLM runs inside the trainer process and shares GPU memory with the training model.
This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs.
More details [here](https://huggingface.co/docs/trl/main/en/grpo_trainer#-option-2-colocate-mode).

```python
@app.function(
    image=image,
    gpu="H100",
    timeout=60 * 60 * 24,  # 24 hours
    secrets=[modal.Secret.from_name("wandb-secret")],
    volumes={"/models": checkpoints_volume},
)
def train_vllm_colocate_mode() -> None:
    # Rank of the current process (0 for single-process training)
    os.environ["RANK"] = "0"
    # Local rank of the process on the node (0 for single-process training)
    os.environ["LOCAL_RANK"] = "0"
    # Total number of processes (1 for single-process training)
    os.environ["WORLD_SIZE"] = "1"
    # Address of the master node (localhost for single node)
    os.environ["MASTER_ADDR"] = "localhost"
    # Port for communication between processes
    os.environ["MASTER_PORT"] = "12355"
    start_grpo_trainer(use_vllm=True, vllm_mode="colocate")

```

You can execute this using `modal run --detach grpo_trl.py::train_vllm_colocate_mode`.

## Performing inference on the trained model

We use vLLM to perform inference on the trained model.

```python
VLLM_PORT: int = 8000

```

Once you have the model checkpoints in your Modal Volume, you can load the weights and perform inference using vLLM. For more on storing model weights on Modal, see
[this guide](https://modal.com/docs/guide/model-weights).
The weights path is as follows: `global_step_n/actor/huggingface` where n is the checkpoint you want (eg `global_step_5/actor/huggingface`).
The `latest_checkpointed_iteration.txt` file stores the most recent checkpoint index.

```python
def get_latest_checkpoint_file_path():
    checkpoint_dirs = [
        d.name
        for d in MODELS_DIR.iterdir()
        if d.is_dir() and re.match(r"^checkpoint-(\d+)$", d.name)
    ]
    if not checkpoint_dirs:
        raise FileNotFoundError("No checkpoint directories found in models dir")
    latest_checkpoint_index = max(
        int(re.match(r"^checkpoint-(\d+)$", d).group(1)) for d in checkpoint_dirs
    )
    return str(MODELS_DIR / f"checkpoint-{latest_checkpoint_index}")

```

We provide the code for setting up an OpenAI compatible inference endpoint here. For more details re. serving models on vLLM, check out [this example.](https://modal.com/docs/examples/vllm_inference#deploy-the-server)

```python
vllm_image = (
    modal.Image.debian_slim(python_version="3.12")
    .pip_install(
        "vllm==0.9.1",
        "flashinfer-python==0.2.6.post1",
        extra_index_url="https://download.pytorch.org/whl/cu128",
    )
    .env({"VLLM_USE_V1": "1"})
)

vllm_cache_vol = modal.Volume.from_name("vllm-cache", create_if_missing=True)

@app.function(
    image=vllm_image,
    gpu="H100",
    scaledown_window=15 * 60,  # How long should we stay up with no requests?
    timeout=10 * 60,  # How long should we wait for container start?
    volumes={"/root/.cache/vllm": vllm_cache_vol, MODELS_DIR: checkpoints_volume},
)
@modal.concurrent(
    max_inputs=32
)  # How many requests can one replica handle? tune carefully!
@modal.web_server(port=VLLM_PORT, startup_timeout=10 * 60)
def serve():
    latest_checkpoint_file_path = get_latest_checkpoint_file_path()

    cmd = [
        "vllm",
        "serve",
        "--uvicorn-log-level=info",
        latest_checkpoint_file_path,
        "--host",
        "0.0.0.0",
        "--port",
        str(VLLM_PORT),
    ]
    subprocess.Popen(" ".join(cmd), shell=True)

```

You can then deploy the server using `modal deploy grpo_trl.py`, which gives you a custom url. You can then query it using the following curl command:

```bash
curl -X POST <url>/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "messages": [
      {"role": "system", "content": "You are a helpful assistant for solving math problems."},
      {"role": "user", "content": "James had 4 apples. Mary gave him 2 and he ate 1. How many does he have left?"}
    ],
    "temperature": 0.7
  }'
```

or in the [following ways](https://modal.com/docs/examples/vllm_inference#interact-with-the-server).

### Grpo Verl

# Train a model to solve math problems using GRPO and verl

This example demonstrates how to train with [GRPO](https://arxiv.org/pdf/2402.03300) on Modal using the [verl](https://github.com/volcengine/verl) framework.
GRPO is a reinforcement learning algorithm introduced by DeepSeek, and was used to train DeepSeek R1.
verl is a reinforcement learning training library that is an implementation of [HybridFlow](https://arxiv.org/abs/2409.19256v2), an RLHF framework.

The training process works as follows:
- Each example in the dataset corresponds to a math problem.
- In each training step, the model attempts to solve the math problems showing its steps.
- We then compute a reward for the model's solution using the reward function defined below.
- That reward value is then used to update the model's parameters according to the GRPO training algorithm.

## Setup

Import the necessary modules for Modal deployment.

```python
import re
import subprocess
from pathlib import Path
from typing import Literal, Optional

import modal

```

## Defining the image and app

```python
app = modal.App("example-grpo-verl")

```

We define an image where we clone the verl repo and install its dependencies. We use a base verl image as a starting point.

```python
VERL_REPO_PATH: Path = Path("/root/verl")
image = (
    modal.Image.from_registry("verlai/verl:app-verl0.4-vllm0.8.5-mcore0.12.1")
    .apt_install("git")
    .run_commands(f"git clone https://github.com/volcengine/verl {VERL_REPO_PATH}")
    .pip_install("verl[vllm]==0.4.1")
)

```

## Defining the dataset

In this example, we'll use reinforcement learning to train a model to solve math problems.
We use the [GSM8K](https://huggingface.co/datasets/openai/gsm8k) dataset of math problems and a [Modal Volume](https://modal.com/docs/guide/volumes#volumes) to store the data.

```python
DATA_PATH: Path = Path("/data")
data_volume: modal.Volume = modal.Volume.from_name(
    "grpo-verl-example-data", create_if_missing=True
)

```

We write a Modal Function to populate the Volume with the data. This downloads the dataset and stores it in the Volume.
You will need to run this step if you don't already have data you'd like to use for this example.

```python
@app.function(image=image, volumes={DATA_PATH: data_volume})
def prep_dataset() -> None:
    subprocess.run(
        [
            "python",
            VERL_REPO_PATH / "examples" / "data_preprocess" / "gsm8k.py",
            "--local_dir",
            DATA_PATH,
        ],
        check=True,
    )

```

You can kick off the dataset download with
`modal run <filename.py>::prep_dataset`

## Defining a reward function

In reinforcement learning, we define a reward function for the model.
We can define this in a separate file, or in the same file as in this case that we then pass as an argument to verl.
We use a `default` reward function for GSM8K from the [verl repo](https://github.com/volcengine/verl/blob/v0.1/verl/utils/reward_score/gsm8k.py), modified to return 1.0 if it's a correct answer and 0 otherwise.

```python
def extract_solution(
    solution_str: str, method: Literal["strict", "flexible"] = "strict"
) -> Optional[str]:
    assert method in ["strict", "flexible"]

    if method == "strict":
        # This also tests the formatting of the model
        solution = re.search("#### (\\-?[0-9\\.\\,]+)", solution_str)
        if solution is None:
            final_answer: Optional[str] = None
        else:
            final_answer = solution.group(0)
            final_answer = (
                final_answer.split("#### ")[1].replace(",", "").replace("$", "")
            )
    elif method == "flexible":
        answer = re.findall("(\\-?[0-9\\.\\,]+)", solution_str)
        final_answer: Optional[str] = None
        if len(answer) == 0:
            # No reward if there is no answer.
            pass
        else:
            invalid_str: list[str] = ["", "."]
            # Find the last number that is not '.'
            for final_answer in reversed(answer):
                if final_answer not in invalid_str:
                    break
    return final_answer

```

Reward functions need to follow a [predefined signature.](https://verl.readthedocs.io/en/latest/preparation/reward_function.html)

```python
def compute_reward(
    data_source: str, solution_str: str, ground_truth: str, extra_info: dict
) -> float:
    answer = extract_solution(solution_str=solution_str, method="strict")
    if answer is None:
        return 0.0
    else:
        if answer == ground_truth:
            return 1.0
        else:
            return 0.0

```

We then define constants to pass into verl during the training run.

```python
PATH_TO_REWARD_FUNCTION: Path = Path("/root/grpo_verl.py")
REWARD_FUNCTION_NAME: str = "compute_reward"

```

## Kicking off a training run

We define some more constants for the training run.

```python
MODELS_PATH: Path = Path("/models")
MINUTES: int = 60

```

We also define a Volume for storing model checkpoints.

```python
checkpoints_volume: modal.Volume = modal.Volume.from_name(
    "grpo-verl-example-checkpoints", create_if_missing=True
)

```

Now, we write a Modal Function for kicking off the training run.
If you wish to use Weights & Biases, as we do in this code, you'll need to create a Weights & Biases [Secret.](https://modal.com/docs/guide/secrets#secrets)

verl uses Ray under the hood. It creates Ray workers for each step where each Ray worker is a python process and each step is a step in the RL dataflow pipeline.
verl also keeps a separate control flow process that's independent of this, responsible for figuring out what step in the RL pipeline to execute.
Each Ray worker gets mapped onto 1 or more GPUs. Depending on the number of GPUs available, Ray will decide what workers go where, or to hold off scheduling workers
if there are no available GPUs. Generally, more VRAM = less hot-swapping of Ray workers, which means less waiting around for memory copying each iteration.
In this example we have chosen a configuration that allows for easy automated testing, but you may wish to use more GPUs or more powerful GPU types.
More details [here](https://verl.readthedocs.io/en/latest/hybrid_flow.html).

```python
@app.function(
    image=image,
    gpu="H100:2",
    volumes={
        MODELS_PATH: checkpoints_volume,
        DATA_PATH: data_volume,
    },
    secrets=[modal.Secret.from_name("wandb-secret")],
    timeout=24 * 60 * MINUTES,
)
def train(*arglist) -> None:
    data_volume.reload()

    cmd: list[str] = [
        "python",
        "-m",
        "verl.trainer.main_ppo",
        "algorithm.adv_estimator=grpo",
        f"data.train_files={DATA_PATH / 'train.parquet'}",
        f"data.val_files={DATA_PATH / 'test.parquet'}",
        "data.train_batch_size=128",
        "data.max_prompt_length=64",
        "data.max_response_length=1024",
        "data.filter_overlong_prompts=True",
        "data.truncation=error",
        "actor_rollout_ref.model.path=Qwen/Qwen2-0.5B",
        "actor_rollout_ref.actor.optim.lr=1e-6",
        "actor_rollout_ref.model.use_remove_padding=False",
        "actor_rollout_ref.actor.ppo_mini_batch_size=128",
        "actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=16",
        "actor_rollout_ref.actor.checkpoint.save_contents='model,optimizer,extra,hf_model'",
        "actor_rollout_ref.actor.use_kl_loss=True",
        "actor_rollout_ref.actor.entropy_coeff=0",
        "actor_rollout_ref.actor.kl_loss_coef=0.001",
        "actor_rollout_ref.actor.kl_loss_type=low_var_kl",
        "actor_rollout_ref.model.enable_gradient_checkpointing=True",
        "actor_rollout_ref.actor.fsdp_config.param_offload=False",
        "actor_rollout_ref.actor.fsdp_config.optimizer_offload=False",
        "actor_rollout_ref.rollout.tensor_model_parallel_size=2",
        "actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16",
        "actor_rollout_ref.rollout.name=vllm",
        "actor_rollout_ref.rollout.gpu_memory_utilization=0.4",
        "actor_rollout_ref.rollout.n=5",
        "actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16",
        "actor_rollout_ref.ref.fsdp_config.param_offload=True",
        "algorithm.use_kl_in_reward=False",
        "trainer.critic_warmup=0",
        "trainer.logger=['console', 'wandb']",
        "trainer.project_name=verl_grpo_example_qwen2-0.5b",
        "trainer.experiment_name=qwen2-0.5b_example",
        "trainer.n_gpus_per_node=2",
        "trainer.nnodes=1",
        "trainer.test_freq=5",
        f"trainer.default_local_dir={MODELS_PATH}",
        "trainer.resume_mode=auto",
        # Parameters chosen to ensure easy automated testing. Remove if needed.
        "trainer.save_freq=1",
        "trainer.total_training_steps=1",
        "trainer.total_epochs=1",
        # For the custom reward function.
        f"custom_reward_function.path={str(PATH_TO_REWARD_FUNCTION)}",
        f"custom_reward_function.name={REWARD_FUNCTION_NAME}",
    ]
    if arglist:
        cmd.extend(arglist)

    subprocess.run(cmd, check=True)

```

You can now run the training using `modal run --detach grpo_verl.py::train`, or pass in any [additional args from the CLI](https://modal.com/docs/guide/apps#argument-parsing) like this `modal run --detach grpo.py::train -- trainer.total_epochs=20 actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16`.

## Performing inference on the trained model

We use vLLM to perform inference on the trained model.

```python
VLLM_PORT: int = 8000

```

Once you have the model checkpoints in your Modal Volume, you can load the weights and perform inference using vLLM. For more on storing model weights on Modal, see
[this guide](https://modal.com/docs/guide/model-weights).
The weights path is as follows: `global_step_n/actor/huggingface` where n is the checkpoint you want (e.g. `global_step_5/actor/huggingface`).
The `latest_checkpointed_iteration.txt` file stores the most recent checkpoint index.

```python
def get_latest_checkpoint_file_path():
    with open(MODELS_PATH / "latest_checkpointed_iteration.txt") as f:
        latest_checkpoint_index = int(f.read())
    return str(
        MODELS_PATH / f"global_step_{latest_checkpoint_index}" / "actor" / "huggingface"
    )

```

We provide the code for setting up an OpenAI compatible inference endpoint here. For more details re. serving models on vLLM, check out [this example.](https://modal.com/docs/examples/vllm_inference#deploy-the-server)

```python
vllm_image = (
    modal.Image.debian_slim(python_version="3.12")
    .pip_install(
        "vllm==0.9.1",
        "flashinfer-python==0.2.6.post1",
        extra_index_url="https://download.pytorch.org/whl/cu128",
    )
    .env({"VLLM_USE_V1": "1"})
)

vllm_cache_vol = modal.Volume.from_name("vllm-cache", create_if_missing=True)

@app.function(
    image=vllm_image,
    gpu="H100:2",
    scaledown_window=15 * MINUTES,  # How long should we stay up with no requests?
    timeout=10 * MINUTES,  # How long should we wait for container start?
    volumes={"/root/.cache/vllm": vllm_cache_vol, MODELS_PATH: checkpoints_volume},
)
@modal.concurrent(
    max_inputs=32
)  # How many requests can one replica handle? Tune carefully!
@modal.web_server(port=VLLM_PORT, startup_timeout=10 * MINUTES)
def serve():
    import subprocess

    latest_checkpoint_file_path = get_latest_checkpoint_file_path()

    cmd = [
        "vllm",
        "serve",
        "--uvicorn-log-level=info",
        latest_checkpoint_file_path,
        "--host",
        "0.0.0.0",
        "--port",
        str(VLLM_PORT),
        "--tensor-parallel-size",
        "2",
    ]
    subprocess.Popen(" ".join(cmd), shell=True)

```

You can then deploy the server using `modal deploy grpo_verl.py`, which gives you a custom URL. You can then query it using the following curl command:

```bash
curl -X POST <url>/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "messages": [
      {"role": "system", "content": "You are a helpful assistant for solving math problems."},
      {"role": "user", "content": "James had 4 apples. Mary gave him 2 and he ate 1. How many does he have left?"}
    ],
    "temperature": 0.7
  }'
```

or in the [following ways](https://modal.com/docs/examples/vllm_inference#interact-with-the-server).

### Hackernews Alerts

# Run cron jobs in the cloud to search Hacker News

In this example, we use Modal to deploy a cron job that periodically queries Hacker News for
new posts matching a given search term, and posts the results to Slack.

## Import and define the app

Let's start off with imports, and defining a Modal app.

```python
import os
from datetime import datetime, timedelta

import modal

app = modal.App("example-hackernews-alerts")

```

Now, let's define an image that has the `slack-sdk` package installed, in which we can run a function
that posts a slack message.

```python
slack_sdk_image = modal.Image.debian_slim().pip_install("slack-sdk")

```

## Defining the function and importing the secret

Our Slack bot will need access to a bot token.
We can use Modal's [Secrets](https://modal.com/secrets) interface to accomplish this.
To quickly create a Slack bot secret, click the "Create new secret" button.
Then, select the Slack secret template from the list options,
and follow the instructions in the "Where to find the credentials?" panel.
Name your secret `hn-bot-slack.`

Now, we define the function `post_to_slack`, which simply instantiates the Slack client using our token,
and then uses it to post a message to a given channel name.

```python
@app.function(
    image=slack_sdk_image,
    secrets=[modal.Secret.from_name("hn-bot-slack", required_keys=["SLACK_BOT_TOKEN"])],
)
async def post_to_slack(message: str):
    import slack_sdk

    client = slack_sdk.WebClient(token=os.environ["SLACK_BOT_TOKEN"])
    client.chat_postMessage(channel="hn-alerts", text=message)

```

## Searching Hacker News

We are going to use Algolia's [Hacker News Search API](https://hn.algolia.com/api) to query for posts
matching a given search term in the past X days. Let's define our search term and query period.

```python
QUERY = "serverless"
WINDOW_SIZE_DAYS = 1

```

Let's also define an image that has the `requests` package installed, so we can query the API.

```python
requests_image = modal.Image.debian_slim().pip_install("requests")

```

We can now define our main entrypoint, that queries Algolia for the term, and calls `post_to_slack`
on all the results. We specify a [schedule](https://modal.com/docs/guide/cron)
in the function decorator, which means that our function will run automatically at the given interval.

```python
@app.function(image=requests_image)
def search_hackernews():
    import requests

    url = "http://hn.algolia.com/api/v1/search"

    threshold = datetime.utcnow() - timedelta(days=WINDOW_SIZE_DAYS)

    params = {
        "query": QUERY,
        "numericFilters": f"created_at_i>{threshold.timestamp()}",
    }

    response = requests.get(url, params, timeout=10).json()
    urls = [item["url"] for item in response["hits"] if item.get("url")]

    print(f"Query returned {len(urls)} items.")

    post_to_slack.for_each(urls)

```

## Test running

We can now test run our scheduled function as follows: `modal run hackernews_alerts.py::app.search_hackernews`

## Defining the schedule and deploying

Let's define a function that will be called by Modal every day

```python
@app.function(schedule=modal.Period(days=1))
def run_daily():
    search_hackernews.remote()

```

In order to deploy this as a persistent cron job, you can run `modal deploy hackernews_alerts.py`,

Once the job is deployed, visit the [apps page](https://modal.com/apps) page to see
its execution history, logs and other stats.

### Hello World

# Hello, world!

This tutorial demonstrates some core features of Modal:

* You can run functions on Modal just as easily as you run them locally.
* Running functions in parallel on Modal is simple and fast.
* Logs and errors show up immediately, even for functions running on Modal.

## Importing Modal and setting up

We start by importing `modal` and creating a `App`.
We build up this `App` to [define our application](https://modal.com/docs/guide/apps).

```python
import sys

import modal

app = modal.App("example-hello-world")

```

## Defining a function

Modal takes code and runs it in the cloud.

So first we've got to write some code.

Let's write a simple function that takes in an input,
prints a log or an error to the console,
and then returns an output.

To make this function work with Modal, we just wrap it in a decorator,
[`@app.function`](https://modal.com/docs/reference/modal.App#function).

```python
@app.function()
def f(i):
    if i % 2 == 0:
        print("hello", i)
    else:
        print("world", i, file=sys.stderr)

    return i * i

```

## Running our function locally, remotely, and in parallel

Now let's see three different ways we can call that function:

1. As a regular call on your `local` machine, with `f.local`

2. As a `remote` call that runs in the cloud, with `f.remote`

3. By `map`ping many copies of `f` in the cloud over many inputs, with `f.map`

We call `f` in each of these ways inside the `main` function below.

```python
@app.local_entrypoint()
def main():
    # run the function locally
    print(f.local(1000))

    # run the function remotely on Modal
    print(f.remote(1000))

    # run the function in parallel and remotely on Modal
    total = 0
    for ret in f.map(range(200)):
        total += ret

    print(total)

```

Enter `modal run hello_world.py` in a shell, and you'll see a Modal app initialize.
You'll then see the `print`ed logs of
the `main` function and, mixed in with them, all the logs of `f` as it is run
locally, then remotely, and then remotely and in parallel.

That's all triggered by adding the
[`@app.local_entrypoint`](https://modal.com/docs/reference/modal.App#local_entrypoint)
decorator on `main`, which defines it as the function to start from locally when we invoke `modal run`.

## What just happened?

When we called `.remote` on `f`, the function was executed
_in the cloud_, on Modal's infrastructure, not on the local machine.

In short, we took the function `f`, put it inside a container,
sent it the inputs, and streamed back the logs and outputs.

## But why does this matter?

Try one of these things next to start seeing the full power of Modal!

### You can change the code and run it again

For instance, change the `print` statement in the function `f`
to print `"spam"` and `"eggs"` instead and run the app again.
You'll see that that your new code is run with no extra work from you --
and it should even run faster!

Modal's goal is to make running code in the cloud feel like you're
running code locally. That means no waiting for long image builds when you've just moved a comma,
no fiddling with container image pushes, and no context-switching to a web UI to inspect logs.

### You can map over more data

Change the `map` range from `200` to some large number, like `1170`. You'll see
Modal create and run even more containers in parallel this time.

And it'll happen lightning fast!

### You can run a more interesting function

The function `f` is a bit silly and doesn't do much, but in its place
imagine something that matters to you, like:

* Running [language model inference](https://modal.com/docs/examples/vllm_inference)
or [fine-tuning](https://modal.com/docs/examples/slack-finetune)
* Manipulating [audio](https://modal.com/docs/examples/musicgen)
or [images](https://modal.com/docs/examples/diffusers_lora_finetune)
* [Embedding huge text datasets](https://modal.com/docs/examples/amazon_embeddings) at lightning fast speeds

Modal lets you parallelize that operation effortlessly by running hundreds or
thousands of containers in the cloud.

### Hello World Async

# Async functions

Modal natively supports async/await syntax using asyncio.

First, let's import some global stuff.

```python
import sys

import modal

app = modal.App("example-hello-world-async")

```

## Defining a function

Now, let's define a function. The wrapped function can be synchronous or
asynchronous, but calling it in either context will still work.
Let's stick to a normal synchronous function

```python
@app.function()
def f(i):
    if i % 2 == 0:
        print("hello", i)
    else:
        print("world", i, file=sys.stderr)

    return i * i

```

## Running the app with asyncio

Let's make the main entrypoint asynchronous. In async contexts, we should
call the function using `await` or iterate over the map using `async for`.
Otherwise we would block the event loop while our call is being run.

```python
@app.local_entrypoint()
async def run_async():
    # Call the function using .remote.aio() in order to run it asynchronously
    print(await f.remote.aio(1000))

    # Parallel map.
    total = 0
    # Call .map asynchronously using using f.map.aio(...)
    async for ret in f.map.aio(range(20)):
        total += ret

    print(total)

```

### Hp Sweep Gpt

# Train an SLM from scratch with early-stopping grid search over hyperparameters

![Split-Panel Image. Left: AI generated picture of Shakespeare. Right: SLM generated text](./shakespeare.jpg)

When you want a language model that performs well on your task, there are three options,
ordered by the degree of customization:

- [**Prompt Engineering**](https://en.wikipedia.org/wiki/Prompt_engineering):
large and capable language models understand tasks in natural language, so you can
carefully design a natural language "prompt" to elicit the desired behavior.

- [**Fine-Tuning**](https://modal.com/docs/examples/llm-finetuning):
those same language models were trained by gradient descent on data sets representing tasks,
and they can be further trained by gradient descent on data sets representative of your task.

- **Training from Scratch**:
if you have enough data for your task, you can throw the pretrained model away and make your own.

Each step adds additional engineering complexity, but also leads to a superior cost-performance Pareto frontier
for your tasks. Fine-tuned models at one-tenth the size regularly outperform more generic models,
and models trained from scratch outperform them.

Because these models are so much smaller than the Large Language Models that power generic
assistant chatbots like ChatGPT and Claude, they are often called _Small Language Models_ (SLMs).

In this example, we will explore training an SLM from scratch on Modal.

In fact, we'll train 8 SLMs in parallel with different hyperparameters
and then select the best one for additional training.

We'll monitor this training live and serve our training and trained models
as web endpoints and simple browser UIs.

Along the way we'll use many features of the Modal platform:
[distributed volumes](https://modal.com/docs/guide/volumes),
multiple [web endpoints](https://modal.com/docs/guide/webhooks),
and [parallel container execution](https://modal.com/docs/guide/scale#parallel-execution-of-inputs).

Together, these features give every machine learning and AI team
the same infrastructural capabilities that the most sophisticated companies
have in their internal platforms.

## Basic Setup

```python
import logging as L
import urllib.request
from dataclasses import dataclass
from pathlib import Path, PosixPath
from typing import Optional

import modal
from pydantic import BaseModel

MINUTES = 60  # seconds
HOURS = 60 * MINUTES

app_name = "example-hp-sweep-gpt"
app = modal.App(app_name)

```

We'll use A10G GPUs for training, which are able to train the model to recognizably improved performance
in ~15 minutes while keeping costs under ~$1.

```python
gpu = "A10G"

```

### Create a Volume to store data, weights, and logs

Since we'll be coordinating training across multiple machines we'll use a
distributed [Volume](https://modal.com/docs/guide/volumes)
to store the data, checkpointed models, and TensorBoard logs.

```python
volume = modal.Volume.from_name("example-hp-sweep-gpt-volume", create_if_missing=True)
volume_path = PosixPath("/vol/data")
model_filename = "nano_gpt_model.pt"
best_model_filename = "best_nano_gpt_model.pt"
tb_log_path = volume_path / "tb_logs"
model_save_path = volume_path / "models"

```

### Define dependencies in container images

The container image for training  is based on Modal's default slim Debian Linux image with `torch`
for defining and running our neural network and `tensorboard` for monitoring training.

```python
base_image = modal.Image.debian_slim(python_version="3.11").pip_install(
    "pydantic==2.9.1"
)

torch_image = base_image.pip_install(
    "torch==2.1.2",
    "tensorboard==2.17.1",
    "numpy<2",
)

```

We also have some local dependencies that we'll need to import into the remote environment.
We add them into the remote container.

```python
torch_image = torch_image.add_local_dir(
    Path(__file__).parent / "src", remote_path="/root/src"
)

```

We'll serve a simple web endpoint:

```python
web_image = base_image.pip_install("fastapi[standard]==0.115.4", "starlette==0.41.2")

```

And we'll deploy a web UI for interacting with our trained models using Gradio.

```python
assets_path = Path(__file__).parent / "assets"
ui_image = web_image.pip_install("gradio~=4.44.0").add_local_dir(
    assets_path, remote_path="/assets"
)

```

We can also "pre-import" libraries that will be used by the functions we run on Modal in a given image
using the `with image.imports` context manager.

```python
with torch_image.imports():
    import glob
    import os
    from timeit import default_timer as timer

    import tensorboard
    import torch
    from src.dataset import Dataset
    from src.logs_manager import LogsManager
    from src.model import AttentionModel
    from src.tokenizer import Tokenizer

```

## Running SLM training on Modal

Here we define the training function, wrapping it in a decorator
that specifies the infrastructural parameters, like the container `image` we want to use,
which `volume` to mount where, the `gpu` we're using, and so on.

Training consists of specifying optimization parameters, loading the
`dataset`, building the `model`, setting up TensorBoard logging &
checkpointing, and then finally executing the `training_loop` itself.

```python
@app.function(
    image=torch_image,
    volumes={volume_path: volume},
    gpu=gpu,
    timeout=1 * HOURS,
)
def train_model(
    node_rank,
    n_nodes,
    hparams,
    experiment_name,
    run_to_first_save=False,
    n_steps=3000,
    n_steps_before_eval=None,
    n_steps_before_checkpoint=None,
):
    # optimizer, data, and model prep
    batch_size = 64
    learning_rate = 3e-4

    n_eval_steps = 100
    if n_steps_before_eval is None:
        n_steps_before_eval = int(n_steps / 8)  # eval eight times per run
    if n_steps_before_checkpoint is None:
        n_steps_before_checkpoint = int(n_steps / 4)  # save four times per run

    train_percent = 0.9

    L.basicConfig(
        level=L.INFO,
        format=f"\033[0;32m%(asctime)s %(levelname)s [%(filename)s.%(funcName)s:%(lineno)d] [Node {node_rank + 1}] %(message)s\033[0m",
        datefmt="%b %d %H:%M:%S",
    )

    # use GPU if available
    device = "cuda" if torch.cuda.is_available() else "cpu"
    L.info("Remote Device: %s // GPU: %s", device, gpu)

    input_file_path = volume_path / "shakespeare_char.txt"
    text = prepare_data(input_file_path, volume)

    # construct tokenizer & dataset
    tokenizer = Tokenizer(text)
    dataset = Dataset(
        tokenizer.encode(text),
        train_percent,
        batch_size,
        hparams.context_size,
        device,
    )

    # build the model
    model = build_model(hparams, tokenizer.vocab_size, device)
    num_parameters = sum(p.numel() for p in model.parameters())
    L.info(f"Num parameters: {num_parameters}")

    optimizer = setup_optimizer(model, learning_rate)

    # TensorBoard logging & checkpointing prep
    logs_manager = LogsManager(experiment_name, hparams, num_parameters, tb_log_path)
    L.info(f"Model name: {logs_manager.model_name}")

    model_save_dir = model_save_path / experiment_name / logs_manager.model_name
    if model_save_dir.exists():
        L.info("Loading model from checkpoint...")
        checkpoint = torch.load(str(model_save_dir / model_filename))
        is_best_model = not run_to_first_save
        if is_best_model:
            make_best_symbolic_link(model_save_dir, model_filename, experiment_name)
        model.load_state_dict(checkpoint["model"])
        start_step = checkpoint["steps"] + 1
    else:
        model_save_dir.mkdir(parents=True, exist_ok=True)
        start_step = 0
        checkpoint = init_checkpoint(model, tokenizer, optimizer, start_step, hparams)

    checkpoint_path = model_save_dir / model_filename

    out = training_loop(
        start_step,
        n_steps,
        n_steps_before_eval,
        n_steps_before_checkpoint,
        n_eval_steps,
        dataset,
        tokenizer,
        model,
        optimizer,
        logs_manager,
        checkpoint,
        checkpoint_path,
        run_to_first_save,
    )

    return node_rank, float(out["val"]), hparams

```

## Launch a hyperparameter sweep from a `local_entrypoint`

The main entry point coordinates the hyperparameter optimization.
First we specify the default hyperparameters for the model, taken from
[Andrej Karpathy's walkthrough](https://www.youtube.com/watch?v=kCc8FmEb1nY&t=5976s).
For better performance, you can increase the `context_size` and scale up the GPU accordingly.

```python
@dataclass
class ModelHyperparameters:
    n_heads: int = 6
    n_embed: int = 384
    n_blocks: int = 6
    context_size: int = 256
    dropout: float = 0.2

```

Next we define the local entrypoint: the code we run locally to coordinate training.

It will train 8 models in parallel across 8 containers, each
with different hyperparameters, varying the number of heads (`n_heads`), the
`context_size` (called the "block size" by Karpathy), and the dropout rate (`dropout`). To run in
parallel we need to use the [`starmap` method](https://modal.com/docs/guide/scale#parallel-execution-of-inputs).

We train all of the models until the first checkpoint and then stop early so we
can compare the validation losses.

Then we restart training for the best model and train it to completion.

You can kick off training with the following command:

```bash
modal run 06_gpu_and_ml/hyperparameter-sweep/hp_sweep_gpt.py
```

The output will look something like this:

```
Sep 16 21:20:39 INFO [hp_sweep_gpt.py.train_model:127] [Node 1]  Remote Device: cuda // GPU: A10G
Sep 16 21:20:40 INFO [hp_sweep_gpt.py.train_model:149] [Node 1]  Num parameters: 10693697
Sep 16 21:20:40 INFO [hp_sweep_gpt.py.train_model:156] [Node 1]  Model Name: E2024-0916-142031.618259_context_size=8_n_heads=1_dropout=0.1
Sep 16 21:20:41 INFO [hp_sweep_gpt.py.train_model:225] [Node 1]      0) //  1.03s // Train Loss: 3.58 // Val Loss: 3.60
Sep 16 21:20:41 INFO [hp_sweep_gpt.py.train_model:127] [Node 2]  Remote Device: cuda // GPU: A10G
...
```

The `local_entrypoint` code is below. Note that the arguments to it can also be passed via the command line.
Use `--help` for details.

```python
@app.local_entrypoint()
def main(
    n_steps: int = 3000,
    n_steps_before_checkpoint: Optional[int] = None,
    n_steps_before_eval: Optional[int] = None,
):
    from datetime import datetime
    from itertools import product

    experiment_name = f"E{datetime.now().strftime('%Y-%m-%d-%H%M%S.%f')}"
    default_hparams = ModelHyperparameters()

    # build list of hyperparameters to train & validate
    nheads_options = (1, default_hparams.n_heads)
    context_size_options = (8, default_hparams.context_size)
    dropout_options = (0.1, default_hparams.dropout)

    hparams_list = [
        ModelHyperparameters(n_heads=h, context_size=c, dropout=d)
        for h, c, d in product(nheads_options, context_size_options, dropout_options)
    ]

    # run training for each hyperparameter setting
    results = []
    stop_early = True  # stop early so we can compare val losses
    print(f"Testing {len(hparams_list)} hyperparameter settings")
    n_nodes = len(hparams_list)
    static_params = (
        experiment_name,
        stop_early,
        n_steps,
        n_steps_before_eval,
        n_steps_before_checkpoint,
    )
    for result in train_model.starmap(
        [(i, n_nodes, h, *static_params) for i, h in enumerate(hparams_list)],
        order_outputs=False,
    ):
        # result = (node_rank, val_loss, hparams)
        node_rank = result[0]
        results.append(result)
        print(
            f"[Node {node_rank + 1}/{n_nodes}] Finished. Early stop val loss result: {result[1:]}"
        )

    # find the model and hparams with the lowest validation loss
    best_result = min(results, key=lambda x: x[1])
    print(f"Best early stop val loss result: {best_result}")
    best_hparams = best_result[-1]

    # finish training with best hparams
    node_rank = 0
    n_nodes = 1  # only one node for final training run
    train_model.remote(
        node_rank,
        n_nodes,
        best_hparams,
        experiment_name,
        not stop_early,
        n_steps,
        n_steps_before_eval,
        n_steps_before_checkpoint,
    )

```

### Monitor experiments with TensorBoard

To monitor our training we will create a TensorBoard WSGI web app, which will
display the progress of our training across all 8 models. We'll use the latest
logs for the most recent experiment written to the Volume.

To ensure we have the latest data we add some
[WSGI Middleware](https://peps.python.org/pep-3333/)
that checks the Modal Volume for updates when the page is reloaded.

```python
class VolumeMiddleware:
    def __init__(self, app):
        self.app = app

    def __call__(self, environ, start_response):
        if (route := environ.get("PATH_INFO")) in ["/", "/modal-volume-reload"]:
            try:
                volume.reload()
            except Exception as e:
                print("Exception while re-loading traces: ", e)
            if route == "/modal-volume-reload":
                environ["PATH_INFO"] = "/"  # redirect
        return self.app(environ, start_response)

```

To ensure a unique color per experiment you can click the palette (🎨) icon
under TensorBoard > Time Series > Run and use the Regex:
`E(\d{4})-(\d{2})-(\d{2})-(\d{6})\.(\d{6})`

You can deploy this TensorBoard service by running

```
modal deploy 06_gpu_and_ml/hyperparameter-sweep/hp_sweep_gpt.py
```

and visit it at the URL that ends with `-monitor-training.modal.run`.

After training finishes, your TensorBoard UI will look something like this:

![8 lines on a graph, validation loss on y-axis, time step on x-axis. All lines go down over the first 1000 time steps, and one goes to 5000 time steps with a final loss of 1.52](./tensorboard.png)

You can also find some sample text generated by the model in the "Text" tab.

```python
@app.function(
    image=torch_image,
    volumes={volume_path: volume},
)
@modal.concurrent(max_inputs=1000)
@modal.wsgi_app()
def monitor_training():
    board = tensorboard.program.TensorBoard()
    board.configure(logdir=str(tb_log_path))
    (data_provider, deprecated_multiplexer) = board._make_data_provider()
    wsgi_app = tensorboard.backend.application.TensorBoardWSGIApp(
        board.flags,
        board.plugin_loaders,
        data_provider,
        board.assets_zip_provider,
        deprecated_multiplexer,
        experimental_middlewares=[VolumeMiddleware],
    )
    return wsgi_app

```

Notice that there are 8 models training, and the one with the lowest
validation loss at step 600 continues training to 3000 steps.

## Serving SLMs on Modal during and after training

Because our weights are stored in a distributed Volume,
we can deploy an inference endpoint based off of them without any extra work --
and we can even check in on models while we're still training them! # For more on storing model weights on Modal, see
[this guide](https://modal.com/docs/guide/model-weights).

### Remote inference with Modal `Cls`es

We wrap our inference in a Modal `Cls` called `ModelInference`.
The user of `ModelInference` can control which model is used by providing the
`experiment_name`.  Each unique choice creates a separate
[auto-scaling deployment](https://modal.com/docs/guide/parameterized-functions).
If the user does not specify an `experiment_name`, the latest experiment
is used.

```python
@app.cls(image=torch_image, volumes={volume_path: volume}, gpu=gpu)
class ModelInference:
    experiment_name: str = modal.parameter(default="")

    def get_latest_available_model_dirs(self, n_last):
        """Find the latest models that have a best model checkpoint saved."""
        save_model_dirs = glob.glob(f"{model_save_path}/*")
        sorted_model_dirs = sorted(save_model_dirs, key=os.path.getctime, reverse=True)

        valid_model_dirs = []
        for latest_model_dir in sorted_model_dirs:
            if Path(f"{latest_model_dir}/{best_model_filename}").exists():
                valid_model_dirs.append(Path(latest_model_dir))
            if len(valid_model_dirs) >= n_last:
                return valid_model_dirs
        return valid_model_dirs

    @modal.method()
    def get_latest_available_experiment_names(self, n_last):
        return [d.name for d in self.get_latest_available_model_dirs(n_last)]

    def load_model_impl(self):
        from .src.model import AttentionModel
        from .src.tokenizer import Tokenizer

        if self.experiment_name != "":  # user selected model
            use_model_dir = f"{model_save_path}/{self.experiment_name}"
        else:  # otherwise, pick latest
            try:
                use_model_dir = self.get_latest_available_model_dirs(1)[0]
            except IndexError:
                raise ValueError("No models available to load.")

        if self.use_model_dir == use_model_dir and self.is_fully_trained:
            return  # already loaded fully trained model.

        print(f"Loading experiment: {Path(use_model_dir).name}...")
        checkpoint = torch.load(f"{use_model_dir}/{best_model_filename}")

        self.use_model_dir = use_model_dir
        hparams = checkpoint["hparams"]
        key = (  # for backwards compatibility
            "unique_chars" if "unique_chars" in checkpoint else "chars"
        )
        unique_chars = checkpoint[key]
        steps = checkpoint["steps"]
        val_loss = checkpoint["val_loss"]
        self.is_fully_trained = checkpoint["finished_training"]

        print(
            f"Loaded model with {steps} train steps"
            f" and val loss of {val_loss:.2f}"
            f" (fully_trained={self.is_fully_trained})"
        )

        self.tokenizer = Tokenizer(unique_chars)
        self.device = "cuda" if torch.cuda.is_available() else "cpu"

        self.model = AttentionModel(self.tokenizer.vocab_size, hparams, self.device)
        self.model.load_state_dict(checkpoint["model"])
        self.model.to(self.device)

    @modal.enter()
    def load_model(self):
        self.use_model_dir = None
        self.is_fully_trained = False
        self.load_model_impl()

    @modal.method()
    def generate(self, prompt):
        self.load_model_impl()  # load updated model if available

        n_new_tokens = 1000
        return self.model.generate_from_text(self.tokenizer, prompt, n_new_tokens)

```

### Adding a simple web endpoint

The `ModelInference` class above is available for use
from any other Python environment with the right Modal credentials
and the `modal` package installed -- just use [`lookup`](https://modal.com/docs/reference/modal.Cls#lookup).

But we can also expose it as a web endpoint for easy access
from anywhere, including other programming languages or the command line.

```python
class GenerationRequest(BaseModel):
    prompt: str

@app.function(image=web_image)
@modal.fastapi_endpoint(method="POST", docs=True)
def web_generate(request: GenerationRequest):
    output = ModelInference().generate.remote(request.prompt)
    return {"output": output}

```

This endpoint can be deployed on Modal with `modal deploy`.
That will allow us to generate text via a simple `curl` command like this:

```bash
curl -X POST -H 'Content-Type: application/json' --data-binary '{"prompt": "\n"}' https://your-workspace-name--modal-nano-gpt-web-generate.modal.run
```

which will return something like:

```json
{
"output":
   "BRUTUS:
    The broy trefore anny pleasory to
    wip me state of villoor so:
    Fortols listhey for brother beat the else
    Be all, ill of lo-love in igham;
    Ah, here all that queen and hould you father offer"
}
```

It's not exactly Shakespeare, but at least it shows our model learned something!

You can choose which model to use by specifying the `experiment_name` in the query parameters of the request URL.

### Serving a Gradio UI with `asgi_app`

Second, we create a Gradio web app for generating text via a graphical user interface in the browser.
That way our fellow team members and stakeholders can easily interact with the model and give feedback,
even when we're still training the model.

You should see the URL for this UI in the output of `modal deploy`
or on your [Modal app dashboard](https://modal.com/apps) for this app.

The Gradio UI will look something like this:

![Image of Gradio Web App. Top shows model selection dropdown. Left side shows input prompt textbox. Right side shows SLM generated output. Bottom has button for starting generation process](./gradio.png)

```python
@app.function(
    image=ui_image,
    max_containers=1,
    volumes={volume_path: volume},
)
@modal.concurrent(max_inputs=1000)
@modal.asgi_app()
def ui():
    import gradio as gr
    from fastapi import FastAPI
    from fastapi.responses import FileResponse
    from gradio.routes import mount_gradio_app

    # call out to the inference in a separate Modal environment with a GPU
    def generate(text="", experiment_name=""):
        if not text:
            text = "\n"
        generated = ModelInference(experiment_name=experiment_name).generate.remote(
            text
        )
        return text + generated

    example_prompts = [
        "DUKE OF YORK:\nWhere art thou Lucas?",
        "ROMEO:\nWhat is a man?",
        "CLARENCE:\nFair is foul and foul is fair, but who are you?",
        "Brevity is the soul of wit, so what is the soul of foolishness?",
    ]

    web_app = FastAPI()

    # custom styles: an icon, a background, and a theme
    @web_app.get("/favicon.ico", include_in_schema=False)
    async def favicon():
        return FileResponse("/assets/favicon.svg")

    @web_app.get("/assets/background.svg", include_in_schema=False)
    async def background():
        return FileResponse("/assets/background.svg")

    with open("/assets/index.css") as f:
        css = f.read()

    n_last = 20
    experiment_names = ModelInference().get_latest_available_experiment_names.remote(
        n_last
    )
    theme = gr.themes.Default(
        primary_hue="green", secondary_hue="emerald", neutral_hue="neutral"
    )

    # add a Gradio UI around inference
    with gr.Blocks(theme=theme, css=css, title="SLM") as interface:
        # title
        gr.Markdown("# GPT-style Shakespeare text generation.")

        # Model Selection
        with gr.Row():
            gr.Markdown("## Model Version")
        with gr.Row():
            experiment_dropdown = gr.Dropdown(
                experiment_names, label="Select Model Version"
            )

        # input and output
        with gr.Row():
            with gr.Column():
                gr.Markdown("## Input:")
                input_box = gr.Textbox(  # input text component
                    label="",
                    placeholder="Write some Shakespeare like text or keep it empty!",
                    lines=10,
                )
            with gr.Column():
                gr.Markdown("## Output:")
                output_box = gr.Textbox(  # output text component
                    label="",
                    lines=10,
                )

        # button to trigger inference and a link to Modal
        with gr.Row():
            generate_button = gr.Button("Generate", variant="primary", scale=2)
            generate_button.click(
                fn=generate,
                inputs=[input_box, experiment_dropdown],
                outputs=output_box,
            )  # connect inputs and outputs with inference function

            gr.Button(  # shameless plug
                " Powered by Modal",
                variant="secondary",
                link="https://modal.com",
            )

        # example prompts
        with gr.Column(variant="compact"):
            # add in a few examples to inspire users
            for ii, prompt in enumerate(example_prompts):
                btn = gr.Button(prompt, variant="secondary")
                btn.click(fn=lambda idx=ii: example_prompts[idx], outputs=input_box)

    # mount for execution on Modal
    return mount_gradio_app(
        app=web_app,
        blocks=interface,
        path="/",
    )

```

## Addenda

The remainder of this code is boilerplate.

### Training Loop

There's quite a lot of code for just the training loop! If you'd rather not write this stuff yourself,
consider a training framework like [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable)
or [Hugging Face](https://huggingface.co/transformers/main_classes/trainer.html).

```python
def training_loop(
    start_step,
    n_steps,
    n_steps_before_eval,
    n_steps_before_checkpoint,
    n_eval_steps,
    dataset,
    tokenizer,
    model,
    optimizer,
    logs_manager,
    checkpoint,
    checkpoint_path,
    run_to_first_save,
):
    @torch.no_grad()
    def eval_model(model, dataset, tokenizer, n_eval_steps):
        """Evaluate model on train and validation data."""
        out = {}
        model.eval()  # Turn off gradients
        for split in ("train", "val"):
            losses = torch.zeros(n_eval_steps)
            for k in range(n_eval_steps):
                xb, yb = dataset.get_batch(split)
                logits, loss = model.forward(xb, yb)
                losses[k] = loss
            out[split] = losses.mean()

        # Generate some output samples
        out["sample"] = model.generate_from_text(tokenizer, "\n", 1000)

        model.train()  # Turn on gradients
        return out

    t_last = timer()
    for step in range(start_step, n_steps + 1):
        # sample a batch of data
        xb, yb = dataset.get_batch("train")

        # evaluate the loss, calculate & apply gradients
        logits, loss = model.forward(xb, yb)
        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()

        # log training loss
        logs_manager.add_train_scalar("Cross Entropy Loss", loss.item(), step)

        # evaluate model on validation set
        if step % n_steps_before_eval == 0:
            out = eval_model(model, dataset, tokenizer, n_eval_steps)
            log_evals(out, step, t_last, logs_manager)
            t_last = timer()

        # save model with checkpoint information
        if step > 0 and step % n_steps_before_checkpoint == 0:
            checkpoint["steps"] = step
            checkpoint["val_loss"] = out["val"]

            # mark as finished if we hit n steps.
            checkpoint["finished_training"] = step >= n_steps

            L.info(
                f"Saving checkpoint to {checkpoint_path}\t {checkpoint['finished_training']})"
            )
            save_checkpoint(checkpoint, checkpoint_path)

            if run_to_first_save:
                L.info("Stopping early...")
                break
    return out

def save_checkpoint(checkpoint, checkpoint_path):
    torch.save(checkpoint, checkpoint_path)
    volume.commit()

def build_model(hparams, vocab_size, device):
    """Initialize the model and move it to the device."""
    model = AttentionModel(vocab_size, hparams, device)
    model.to(device)
    return model

def setup_optimizer(model, learning_rate):
    """Set up the optimizer for the model."""
    return torch.optim.AdamW(model.parameters(), lr=learning_rate)

```

### Miscellaneous
The remaining code includes small helper functions for training the model.

```python
def prepare_data(input_file_path: Path, volume: modal.Volume) -> str:
    """Download and read the dataset."""
    volume.reload()
    if not input_file_path.exists():
        L.info("Downloading Shakespeare dataset...")
        data_url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
        urllib.request.urlretrieve(data_url, input_file_path)
        volume.commit()
    return input_file_path.read_text()

def make_best_symbolic_link(model_save_dir, model_filename, experiment_name):
    # create symlink to the best model so it's easy to find for web serving
    os.symlink(
        str(model_save_dir / model_filename),
        str(model_save_path / experiment_name / best_model_filename),
    )
    volume.commit()  # commit the symlink

def init_checkpoint(model, tokenizer, optimizer, start_step, hparams):
    return {
        "model": model.state_dict(),
        "unique_chars": tokenizer.unique_chars,
        "optimizer": optimizer.state_dict(),
        "val_loss": float("inf"),
        "steps": start_step,
        "hparams": hparams,
        "finished_training": False,
    }

def log_evals(result, step, t_last, logs_manager):
    runtime_s = timer() - t_last
    L.info(
        f"{step:5d}) // {runtime_s:>5.2f}s // Train Loss: {result['train']:.2f} // Val Loss: {result['val']:.2f}"
    )
    logs_manager.add_val_scalar("Cross Entropy Loss", result["val"], step)
    logs_manager.add_val_text("Sample Output", result["sample"], step)
    logs_manager.flush()
    volume.commit()  # Make sure TensorBoard container will see it.

    return result

```

### Image Embeddings Infinity

# Modal Cookbook: Recipe for Inference Throughput Maximization
In certain applications, the bottom line comes to throughput: process a set of inputs as fast as possible.
Let's explore how to maximize throughput by using Modal on an embedding example, and see just how fast
we can encode the [Microsoft Cats & Dogs dataset](https://huggingface.co/datasets/microsoft/cats_vs_dogs)
using the [Infinity inference engine](https://github.com/michaelfeil/infinity "github/michaelfeil/infinity").

## Conclusions
### BLUF (Bottom Line Up Front)
Set concurrency (`max_concurrent_inputs`) to 4, and set `batch_size` between 50-500.
To set `max_containers`, divide the total number of inputs by `max_concurrent_inputs*batchsize`
(note: if you have a massive dataset, keep an eye out for diminishing returns on `max_containers`; but
Modal should handle that for you!).
Be sure to preprocess your data in the same manner that the model is expecting (e.g., resizing images).
If you only want to use one container, increase `batch_size` until you are maxing
out the GPU (but keep concurrency, `max_concurrent_inputs`, capped around 4). The example herein achieves
upward of 750 images / second overall throughput (not including initial Volume setup time).

### Why?
While batchsize maximizes GPU utilization, the time to form a batch (ie reading images)
will ultimately overtake inference, whether due to I/O, sending data across a wire, etc.
We can make up for this by using idle GPU cores to store additional copies of the model:
this _GPU packing_ is achieved via an async queue and the [@modal.concurrent(max_inputs:int) ](https://modal.com/docs/guide/concurrent-inputs#input-concurrency "Modal: input concurrency")
decorator. Once you nail down `batch_size` you can crank up the number of containers to distribute the
computational load. High values of concurrency has diminishing returns, we believe,
because we are already throttling the CPU with multi-threaded dataloading. The demo herein
achieves upward of 750 images / second, and that will increase for larger datasets where the model loading
time becomes increasingly negligable.

## Local env imports
Import everything we need for the locally-run Python (everything in our local_entrypoint function at the bottom).

```python
import asyncio
import os
from concurrent.futures import ThreadPoolExecutor
from pathlib import Path
from time import perf_counter
from typing import Iterator, TypeVar

import modal

```

## Key Parameters
There are three ways to parallelize inference for this usecase: via batching (which happens internal to Infinity),
by packing individual GPU(s) with multiple copies of the model, and by fanning out across multiple containers.
Here are some parameters for controlling these factors:
* `max_concurrent_inputs` sets the [@modal.concurrent(max_inputs:int) ](https://modal.com/docs/guide/concurrent-inputs#input-concurrency "Modal: input concurrency") argument for the inference app. This takes advantage of the asynchronous nature of the Infinity embedding inference app.
* `gpu` is a string specifying the GPU to be used.
* `max_containers` caps the number of containers allowed to spin-up.
* `memory_request` amount of RAM requested per container
* `core_request` number of logical cores requested per container
* `threads_per_core` oversubscription factor for parallelized I/O (image reading)
* `batch_size` is a parameter passed to the [Infinity inference engine](https://github.com/michaelfeil/infinity "github/michaelfeil/infinity"), and it means the usual thing for machine learning inference: a group of images are processed through the neural network together.
* `image_cap` caps the number of images used in this example (e.g. for debugging/testing)

```python
max_concurrent_inputs: int = 4
gpu: str = "L4"
max_containers: int = 50
memory_request: float = 5 * 1024  # MB->GB
core_request: float = 4
threads_per_core: int = 8
batch_size: int = 100
image_cap: int = -1

```

This timeout caps the maximum time a single function call is allowed to take. In this example, that
includes reading a batch-worth of data and running inference on it. When `batch_size` is large (e.g. 5000)
and with a large value of `max_concurrent_inputs`, where a batch may sit in a queue for a while,
this could take several minutes.

```python
timeout_seconds: int = 10 * 60

```

## Data and Model Specification
This model parameter should point to a model on HuggingFace that is supported by Infinity.
Note that your selected model might require specialized imports when
designing the image in the next section. This [OpenAI model](https://huggingface.co/openai/clip-vit-base-patch16 "OpenAI ViT")
takes about 4-10s to load into memory.

```python
model_name = "openai/clip-vit-base-patch16"  # 599 MB
model_input_shape = (224, 224)

```

We will use a high-performance [Modal Volume](https://modal.com/docs/guide/volumes#volumes "Modal.Volume")
both to cache model weights and to store images we want to encode. The details of
setting this volume up are below. For more on storing model weights on Modal, see
[this guide](https://modal.com/docs/guide/model-weights).
Here, we just need to name it so that we can instantiate
the Modal application.
You may need to [set up a secret](https://modal.com/secrets/) to access HuggingFace datasets

```python
hf_secret = modal.Secret.from_name("huggingface-secret")
```

Change this global variable to use a different HF dataset:

```python
hf_dataset_name = "microsoft/cats_vs_dogs"
```

This name is important for referencing the volume in other apps or for [browsing](https://modal.com/storage):

```python
vol_name = "example-embedding-data"
```

This is the location within the container that this Volume will be mounted:

```python
vol_mnt = Path("/data")
```

Finally, the Volume object can be created:

```python
data_volume = modal.Volume.from_name(vol_name, create_if_missing=True)

```

## Define the image

```python
infinity_image = (
    modal.Image.debian_slim(python_version="3.10")
    .pip_install(
        [
            "pillow==11.3.0",  # for Infinity input typehint
            "datasets==4.0.0",  # for huggingface data download
            "hf_transfer==0.1.9",  # for fast huggingface data download
            "huggingface_hub[hf_xet]==0.33.2",
            "tqdm==4.67.1",  # progress bar for dataset download
            "infinity_emb[all]==0.0.76",  # for Infinity inference lib
            "sentencepiece==0.2.0",  # for this particular chosen model
            "torchvision==0.22.1",  # for fast image loading
        ]
    )
    .env(
        {
            "HF_HOME": vol_mnt.as_posix(),  # For model and data caching in our Volume
            "HF_HUB_ENABLE_HF_TRANSFER": "1",  # For fast data transfer
        }
    )
)

```

Initialize the app

```python
app = modal.App(
    "example-image-embeddings-infinity",
    image=infinity_image,
    volumes={vol_mnt: data_volume},
    secrets=[hf_secret],
)

```

Imports inside the container

```python
with infinity_image.imports():
    from infinity_emb import AsyncEmbeddingEngine, EngineArgs
    from infinity_emb.primitives import Dtype, InferenceEngine
    from PIL.Image import Image
    from torchvision.io import read_image
    from torchvision.transforms.functional import to_pil_image

## Dataset Downloading and Setup
```

## Data setup
We use a [Modal Volume](https://modal.com/docs/guide/volumes#volumes "Modal.Volume")
to store images we want to encode. We download them from Huggingface into a Volume and then preprocess
them to 224 x 224 JPEGs. The selected model, `openai/clip-vit-base-patch16`, was trained on 224 x 224
sized images. If you skip this preprocess resize step, Infinity will handle image resizing for you-
at a severe penalty to inference throughput.

Note that Modal Volumes are optimized for datasets on the order of 50,000 - 500,000
files and directories. If you have a larger dataset, you may need to consider other storage
options such as a [CloudBucketMount](https://modal.com/docs/examples/rosettafold).

```python
@app.function(
    image=infinity_image,
    volumes={vol_mnt: data_volume},
    max_containers=1,  # We only want one container to handle volume setup
    cpu=core_request,  # HuggingFace will use multi-process parallelism to download
    timeout=timeout_seconds,  # if using a large HF dataset, this may need to be longer
)
def catalog_jpegs(dataset_namespace: str, cache_dir: str, image_cap: int):
    """
    This function checks the volume for JPEGs and, if needed, calls `download_to_volume`
    which pulls a HuggingFace dataset into the mounted volume.
    """

    def download_to_volume(dataset_namespace: str, cache_dir: str):
        """
        This function caches a hugginface dataset to the path specified in your `HF_HOME` environment
        variable, which we set when creating the image so as to point to a Modal Volume.
        """
        from datasets import load_dataset
        from torchvision.io import write_jpeg
        from torchvision.transforms import Compose, PILToTensor, Resize
        from tqdm import tqdm

        # Load cache to HF_HOME
        ds = load_dataset(
            dataset_namespace,
            split="train",
            num_proc=os.cpu_count(),  # this will be capped by huggingface based on the number of shards
        )

        # Create an `extraction` cache dir where we will create explicit JPEGs
        mounted_cache_dir = vol_mnt / cache_dir
        mounted_cache_dir.mkdir(exist_ok=True, parents=True)

        # Preprocessing pipeline: resize now instead of on-the-fly
        preprocessor = Compose(
            [
                Resize(model_input_shape),
                PILToTensor(),
            ]
        )

        def preprocess_img(idx, example):
            """
            Applies preprocessor and write as jpeg with TurboJPEG (via torchvision).
            """
            # Define output path
            write_path = mounted_cache_dir / f"img{idx:07d}.jpg"
            if write_path.is_file():
                return

            # Here, `example["image"]` is a `PIL.Image.Image`
            preprocessed = preprocessor(example["image"].convert("RGB"))

            # Write to modal.Volume
            write_jpeg(preprocessed, write_path)

        # This is a parallelized pre-processing loop that opens compressed images,
        # preprocesses them to the size expected by our model, and writes as a JPEG.
        for idx, ex in tqdm(enumerate(ds), total=len(ds), desc="Caching images"):
            if (image_cap > 0) and (idx >= image_cap):
                break
            preprocess_img(idx, ex)

        data_volume.commit()

    ds_preptime_st = perf_counter()

    def list_all_jpegs(subdir: os.PathLike = "/") -> list[os.PathLike]:
        """
        Searches a subdir within your volume for all JPEGs.
        """
        return [
            x.path
            for x in data_volume.listdir(subdir.as_posix())
            if x.path.endswith(".jpg")
        ]

    # Check for extracted-JPEG cache dir within the volume
    if (vol_mnt / cache_dir).is_dir():
        im_path_list = list_all_jpegs(cache_dir)
        n_ims = len(im_path_list)
    else:
        n_ims = 0
        print("The cache dir was not found...")

    # If needed, download dataset to a vol
    if (n_ims < image_cap) or (n_ims == 0):
        print(f"Found {n_ims} JPEGs; checking for more on HuggingFace.")
        download_to_volume(dataset_namespace, cache_dir)
        # Try again
        im_path_list = list_all_jpegs(cache_dir)
        n_ims = len(im_path_list)

    # [optional] Cap the number of images to process
    print(f"Found {n_ims} JPEGs in the Volume.", end="")
    if image_cap > 0:
        im_path_list = im_path_list[: min(image_cap, len(im_path_list))]
    print(f"using {len(im_path_list)}.")

    # Time it
    ds_time_elapsed = perf_counter() - ds_preptime_st
    return im_path_list, ds_time_elapsed

T = TypeVar("T")  # generic type for chunked typehints

def chunked(seq: list[T], subseq_size: int) -> Iterator[list[T]]:
    """
    Helper function that chunks a sequence into subsequences of length `subseq_size`.
    """
    for i in range(0, len(seq), subseq_size):
        yield seq[i : i + subseq_size]

```

## Inference app
Here we define an app.cls that wraps Infinity's AsyncEmbeddingEngine.
Note that the variable `max_concurrent_inputs` is used to set `max_inputs`
in (1) the [modal.concurrent](https://modal.com/docs/guide/concurrent-inputs#input-concurrency)
decorator, and (2) the `n_engines` class property.
In `init_engines`, we are creating exactly one inference
engine for each concurrently-passed batch of data. This is critical for packing a GPU with
multiple simultaneously operating models. The [@modal.enter](https://modal.com/docs/reference/modal.enter#modalenter)
decorator ensures that this method is called once per container, on startup (and `exit` is
run once, on shutdown).

```python
@app.cls(
    gpu=gpu,
    cpu=core_request,
    memory=5 * 1024,  # MB -> GB
    image=infinity_image,
    volumes={vol_mnt: data_volume},
    timeout=timeout_seconds,
    max_containers=max_containers,
)
@modal.concurrent(max_inputs=max_concurrent_inputs)
class InfinityEngine:
    n_engines: int = max_concurrent_inputs

    @modal.enter()
    async def init_engines(self):
        """
        On container start, starts `self.n_engines` copies of the selected model
        and puts them in an async queue.
        """
        print(f"Loading {self.n_engines} models... ", end="")
        self.engine_queue: asyncio.Queue[AsyncEmbeddingEngine] = asyncio.Queue()
        start = perf_counter()
        for _ in range(self.n_engines):
            engine = AsyncEmbeddingEngine.from_args(
                EngineArgs(
                    model_name_or_path=model_name,
                    batch_size=batch_size,
                    model_warmup=False,
                    engine=InferenceEngine.torch,
                    dtype=Dtype.float16,
                    device="cuda",
                )
            )
            await engine.astart()
            await self.engine_queue.put(engine)
        print(f"Took {perf_counter() - start:.4}s.")

    def read_batch(self, im_path_list: list[os.PathLike]) -> list["Image"]:
        """
        Read a batch of data. Infinity is expecting PIL.Image.Image type
        inputs, but it's faster to read from disk with torchvision's `read_image`
        and convert to PIL than it is to read directly with PIL.

        This process is parallelized over the batch with multithreaded data reading.
        The number of threads is 4 per core, which is based on the batchsize.
        """

        def readim(impath: os.PathLike):
            """Read with torch, convert back to PIL for Infinity"""
            return to_pil_image(read_image(str(vol_mnt / impath)))

        with ThreadPoolExecutor(
            max_workers=os.cpu_count() * threads_per_core
        ) as executor:
            images = list(executor.map(readim, im_path_list))

        return images

    @modal.method()
    async def embed(self, images: list[os.PathLike]) -> tuple[float, float]:
        """
        This is the workhorse function. We select a model, prepare a batch,
        execute inference, and return the time elapsed. You probably want
        to return the embeddings in your usecase.
        """
        # (0) Grab an engine from the queue
        engine = await self.engine_queue.get()

        try:
            # (1) Load batch of image data
            images = self.read_batch(images)

            # (2) Encode the batch
            st = perf_counter()
            embedding, _ = await engine.image_embed(images=images)
            embed_elapsed = perf_counter() - st
        finally:
            # No matter what happens, return the engine to the queue
            await self.engine_queue.put(engine)

        # (3) You may wish to return the embeddings themselves here
        return embed_elapsed, len(images)

    @modal.exit()
    async def exit(self) -> None:
        """
        Shut down each of the engines.
        """
        for _ in range(self.n_engines):
            engine = await self.engine_queue.get()
            await engine.astop()

```

## Local Entrypoint
This backbone code is run on your machine. It starts up the app,
catalogs the data, and via the remote `map` call, parses the data
with the Infinity embedding engine. The embedder.embed executions
across the batches are autoscaled depending on the app parameters
`max_containers` and `max_concurrent_inputs`.

```python
@app.local_entrypoint()
def main():
    start_time = perf_counter()

    # (1) Catalog data: modify `catalog_jpegs` to fetch batches of your data.
    extracted_path = Path("extracted") / hf_dataset_name
    im_path_list, vol_setup_time = catalog_jpegs.remote(
        dataset_namespace=hf_dataset_name, cache_dir=extracted_path, image_cap=image_cap
    )
    print(f"Took {vol_setup_time:.2f}s to setup volume.")
    n_ims = len(im_path_list)

    # (2) Init the model inference app
    start_time = perf_counter()
    embedder = InfinityEngine()

    # (3) Embed batches via remote `map` call
    times, batchsizes = [], []
    for time, batchsize in embedder.embed.map(chunked(im_path_list, batch_size)):
        times.append(time)
        batchsizes.append(batchsize)

    # (4) Log
    if n_ims > 0:
        total_duration = perf_counter() - start_time
        total_throughput = n_ims / total_duration
        embed_throughputs = [
            batchsize / time for batchsize, time in zip(batchsizes, times)
        ]
        avg_throughput = sum(embed_throughputs) / len(embed_throughputs)

        log_msg = (
            f"EmbeddingRacetrack{gpu}::batch_size={batch_size}::"
            f"n_ims={n_ims}::concurrency={max_concurrent_inputs}::"
            f"max_containers={max_containers}::cores={core_request}\n"
            f"\tTotal time:\t{total_duration / 60:.2f} min\n"
            f"\tVolume setup time:\t{vol_setup_time / 60:.2f} min\n"
            f"\tOverall throughput:\t{total_throughput:.2f} im/s\n"
            f"\tEmbedding-only throughput (avg):\t{avg_throughput:.2f} im/s\n"
        )

        print(log_msg)

```

### Image To Image

# Edit images with Flux Kontext

In this example, we run the Flux Kontext model in _image-to-image_ mode:
the model takes in a prompt and an image and edits the image to better match the prompt.

For example, the model edited the first image into the second based on the prompt
"_A cute dog wizard inspired by Gandalf from Lord of the Rings, featuring detailed fantasy elements in Studio Ghibli style_".

 <img src="https://modal-cdn.com/dog-wizard-ghibli-flux-kontext.jpg" alt="A photo of a dog transformed into a cartoon of a cute dog wizard" />

The model is Black Forest Labs' [FLUX.1-Kontext-dev](https://huggingface.co/black-forest-labs/FLUX.1-Kontext-dev).
Learn more about the model [here](https://bfl.ai/announcements/flux-1-kontext-dev).

## Define a container image

First, we define the environment the model inference will run in,
the [container image](https://modal.com/docs/guide/custom-container).

```python
from io import BytesIO
from pathlib import Path

import modal

diffusers_commit_sha = "00f95b9755718aabb65456e791b8408526ae6e76"

image = (
    modal.Image.from_registry(
        "nvidia/cuda:12.8.1-devel-ubuntu22.04",
        add_python="3.12",
    )
    .entrypoint([])  # remove verbose logging by base image on entry
    .apt_install("git")
    .uv_pip_install(
        "accelerate~=1.8.1",
        f"git+https://github.com/huggingface/diffusers.git@{diffusers_commit_sha}",
        "huggingface-hub[hf-transfer]~=0.33.1",
        "Pillow~=11.2.1",
        "safetensors~=0.5.3",
        "transformers~=4.53.0",
        "sentencepiece~=0.2.0",
        "torch==2.7.1",
        "optimum-quanto==0.2.7",
        extra_options="--index-strategy unsafe-best-match",
        extra_index_url="https://download.pytorch.org/whl/cu128",
    )
)

MODEL_NAME = "black-forest-labs/FLUX.1-Kontext-dev"
MODEL_REVISION = "f9fdd1a95e0dfd7653cb0966cda2486745122695"

CACHE_DIR = Path("/cache")
cache_volume = modal.Volume.from_name("hf-hub-cache", create_if_missing=True)
volumes = {CACHE_DIR: cache_volume}

secrets = [modal.Secret.from_name("huggingface-secret")]

image = image.env(
    {
        "HF_HUB_ENABLE_HF_TRANSFER": "1",  # Allows faster model downloads
        "HF_HOME": str(CACHE_DIR),  # Points the Hugging Face cache to a Volume
    }
)

app = modal.App("example-image-to-image")

with image.imports():
    import torch
    from diffusers import FluxKontextPipeline
    from diffusers.utils import load_image
    from PIL import Image

```

## Setting up and running Flux Kontext

The Modal `Cls` defined below contains all the logic to set up and run Flux Kontext.

The [container lifecycle](https://modal.com/docs/guide/lifecycle-functions#container-lifecycle-beta) decorator
(`@modal.enter()`) ensures that the model is loaded into memory when a container starts, before it picks up any inputs.

The `inference` method runs the actual model inference. It takes in an image as a collection of `bytes` and a string `prompt` and returns
a new image (also as a collection of `bytes`).

To avoid excessive cold-starts, we set the `scaledown_window` to 240 seconds, meaning once a GPU has loaded the model it will stay
online for 4 minutes before spinning down.

```python
@app.cls(
    image=image, gpu="B200", volumes=volumes, secrets=secrets, scaledown_window=240
)
class Model:
    @modal.enter()
    def enter(self):
        print(f"Downloading {MODEL_NAME} if necessary...")

        dtype = torch.bfloat16

        self.seed = 42
        self.device = "cuda"

        self.pipe = FluxKontextPipeline.from_pretrained(
            MODEL_NAME,
            revision=MODEL_REVISION,
            torch_dtype=dtype,
            cache_dir=CACHE_DIR,
        ).to(self.device)

    @modal.method()
    def inference(
        self,
        image_bytes: bytes,
        prompt: str,
        guidance_scale: float = 3.5,
        num_inference_steps: int = 20,
    ) -> bytes:
        init_image = load_image(Image.open(BytesIO(image_bytes))).resize((512, 512))

        image = self.pipe(
            image=init_image,
            prompt=prompt,
            guidance_scale=guidance_scale,
            num_inference_steps=num_inference_steps,
            output_type="pil",
            generator=torch.Generator(device=self.device).manual_seed(self.seed),
        ).images[0]

        byte_stream = BytesIO()
        image.save(byte_stream, format="PNG")
        image_bytes = byte_stream.getvalue()

        return image_bytes

```

## Running the model from the command line

You can run the model from the command line with

```bash
modal run image_to_image.py
```

Use `--help` for additional details.

```python
@app.local_entrypoint()
def main(
    image_path=Path(__file__).parent / "demo_images/dog.png",
    output_path=Path("/tmp/stable-diffusion/output.png"),
    prompt: str = "A cute dog wizard inspired by Gandalf from Lord of the Rings, featuring detailed fantasy elements in Studio Ghibli style",
):
    print(f"🎨 reading input image from {image_path}")
    input_image_bytes = Path(image_path).read_bytes()
    print(f"🎨 editing image with prompt '{prompt}'")
    output_image_bytes = Model().inference.remote(input_image_bytes, prompt)

    if isinstance(output_path, str):
        output_path = Path(output_path)

    dir = output_path.parent
    dir.mkdir(exist_ok=True, parents=True)

    print(f"🎨 saving output image to {output_path}")
    output_path.write_bytes(output_image_bytes)

```

### Image To Video

# Animate images with Lightricks LTX-Video via CLI, API, and web UI

This example shows how to run [LTX-Video](https://huggingface.co/Lightricks/LTX-Video) on Modal
to generate videos from your local command line, via an API, and in a web UI.

Generating a 5 second video takes ~1 minute from cold start.
Once the container is warm, a 5 second video takes ~15 seconds.

Here is a sample we generated:

<center>
<video controls autoplay loop muted>
<source src="https://modal-cdn.com/example_image_to_video.mp4" type="video/mp4" />
</video>
</center>

## Basic setup

```python
import io
import random
import time
from pathlib import Path
from typing import Annotated, Optional

import fastapi
import modal

```

All Modal programs need an [`App`](https://modal.com/docs/reference/modal.App) —
an object that acts as a recipe for the application.

```python
app = modal.App("example-image-to-video")

```

### Configuring dependencies

The model runs remotely, on Modal's cloud, which means we need to
[define the environment it runs in](https://modal.com/docs/guide/images).

Below, we start from a lightweight base Linux image
and then install our system and Python dependencies,
like Hugging Face's `diffusers` library and `torch`.

```python
image = (
    modal.Image.debian_slim(python_version="3.12")
    .apt_install("python3-opencv")
    .pip_install(
        "accelerate==1.4.0",
        "diffusers==0.32.2",
        "fastapi[standard]==0.115.8",
        "huggingface-hub[hf_transfer]==0.29.1",
        "imageio==2.37.0",
        "imageio-ffmpeg==0.6.0",
        "opencv-python==4.11.0.86",
        "pillow==11.1.0",
        "sentencepiece==0.2.0",
        "torch==2.6.0",
        "torchvision==0.21.0",
        "transformers==4.49.0",
    )
)

```

## Storing model weights on Modal

We also need the parameters of the model remotely.
They can be loaded at runtime from Hugging Face,
based on a repository ID and a revision (aka a commit SHA).

```python
MODEL_ID = "Lightricks/LTX-Video"
MODEL_REVISION_ID = "a6d59ee37c13c58261aa79027d3e41cd41960925"

```

Hugging Face will also cache the weights to disk once they're downloaded.
But Modal Functions are serverless, and so even disks are ephemeral,
which means the weights would get re-downloaded every time we spin up a new instance.

We can fix this -- without any modifications to Hugging Face's model loading code! --
by pointing the Hugging Face cache at a [Modal Volume](https://modal.com/docs/guide/volumes). For more on storing model weights on Modal, see
[this guide](https://modal.com/docs/guide/model-weights).

```python
model_volume = modal.Volume.from_name("hf-hub-cache", create_if_missing=True)

MODEL_PATH = "/models"  # where the Volume will appear on our Functions' filesystems

image = image.env(
    {
        "HF_HUB_ENABLE_HF_TRANSFER": "1",  # faster downloads
        "HF_HUB_CACHE": MODEL_PATH,
    }
)

```

## Storing model outputs on Modal

Contemporary video models can take a long time to run and they produce large outputs.
That makes them a great candidate for storage on Modal Volumes as well.
Python code running outside of Modal can also access this storage, as we'll see below.

```python
OUTPUT_PATH = "/outputs"
output_volume = modal.Volume.from_name("outputs", create_if_missing=True)

```

## Implementing LTX-Video inference on Modal

We wrap the inference logic in a Modal [Cls](https://modal.com/docs/guide/lifecycle-functions)
that ensures models are loaded and then moved to the GPU once when a new instance
starts, rather than every time we run it.

The `run` function just wraps a `diffusers` pipeline.
It saves the generated video to a Modal Volume, and returns the filename.

We also include a `web` wrapper that makes it possible
to trigger inference via an API call.
For details, see the `/docs` route of the URL ending in `inference-web.modal.run`
that appears when you deploy the app.

```python
with image.imports():  # loaded on all of our remote Functions
    import diffusers
    import torch
    from PIL import Image

MINUTES = 60

@app.cls(
    image=image,
    gpu="H100",
    timeout=10 * MINUTES,
    scaledown_window=10 * MINUTES,
    volumes={MODEL_PATH: model_volume, OUTPUT_PATH: output_volume},
)
class Inference:
    @modal.enter()
    def load_pipeline(self):
        self.pipe = diffusers.LTXImageToVideoPipeline.from_pretrained(
            MODEL_ID,
            revision=MODEL_REVISION_ID,
            torch_dtype=torch.bfloat16,
        ).to("cuda")

    @modal.method()
    def run(
        self,
        image_bytes: bytes,
        prompt: str,
        negative_prompt: Optional[str] = None,
        num_frames: Optional[int] = None,
        num_inference_steps: Optional[int] = None,
        seed: Optional[int] = None,
    ) -> str:
        negative_prompt = (
            negative_prompt
            or "worst quality, inconsistent motion, blurry, jittery, distorted"
        )
        width = 768
        height = 512
        num_frames = num_frames or 25
        num_inference_steps = num_inference_steps or 50
        seed = seed or random.randint(0, 2**32 - 1)
        print(f"Seeding RNG with: {seed}")
        torch.manual_seed(seed)

        image = diffusers.utils.load_image(Image.open(io.BytesIO(image_bytes)))

        video = self.pipe(
            image=image,
            prompt=prompt,
            negative_prompt=negative_prompt,
            width=width,
            height=height,
            num_frames=num_frames,
            num_inference_steps=num_inference_steps,
        ).frames[0]

        mp4_name = (
            f"{seed}_{''.join(c if c.isalnum() else '-' for c in prompt[:100])}.mp4"
        )
        diffusers.utils.export_to_video(
            video, f"{Path(OUTPUT_PATH) / mp4_name}", fps=24
        )
        output_volume.commit()
        torch.cuda.empty_cache()  # reduce fragmentation
        return mp4_name

    @modal.fastapi_endpoint(method="POST", docs=True)
    def web(
        self,
        image_bytes: Annotated[bytes, fastapi.File()],
        prompt: str,
        negative_prompt: Optional[str] = None,
        num_frames: Optional[int] = None,
        num_inference_steps: Optional[int] = None,
        seed: Optional[int] = None,
    ) -> fastapi.Response:
        mp4_name = self.run.local(  # run in the same container
            image_bytes=image_bytes,
            prompt=prompt,
            negative_prompt=negative_prompt,
            num_frames=num_frames,
            num_inference_steps=num_inference_steps,
            seed=seed,
        )
        return fastapi.responses.FileResponse(
            path=f"{Path(OUTPUT_PATH) / mp4_name}",
            media_type="video/mp4",
            filename=mp4_name,
        )

```

## Generating videos from the command line

We add a [local entrypoint](https://modal.com/docs/reference/modal.App#local_entrypoint)
that calls the `Inference.run` method to run inference from the command line.
The function's parameters are automatically turned into a CLI.

Run it with

```bash
modal run image_to_video.py --prompt "A cat looking out the window at a snowy mountain" --image-path /path/to/cat.jpg
```

You can also pass `--help` to see the full list of arguments.

```python
@app.local_entrypoint()
def entrypoint(
    image_path: str,
    prompt: str,
    negative_prompt: Optional[str] = None,
    num_frames: Optional[int] = None,
    num_inference_steps: Optional[int] = None,
    seed: Optional[int] = None,
    twice: bool = True,
):
    import os
    import urllib.request

    print(f"🎥 Generating a video from the image at {image_path}")
    print(f"🎥 using the prompt {prompt}")

    if image_path.startswith(("http://", "https://")):
        image_bytes = urllib.request.urlopen(image_path).read()
    elif os.path.isfile(image_path):
        image_bytes = Path(image_path).read_bytes()
    else:
        raise ValueError(f"{image_path} is not a valid file or URL.")

    inference_service = Inference()

    for _ in range(1 + twice):
        start = time.time()
        mp4_name = inference_service.run.remote(
            image_bytes=image_bytes,
            prompt=prompt,
            negative_prompt=negative_prompt,
            num_frames=num_frames,
            seed=seed,
        )
        duration = time.time() - start
        print(f"🎥 Generated video in {duration:.3f}s")

        output_dir = Path("/tmp/image_to_video")
        output_dir.mkdir(exist_ok=True, parents=True)
        output_path = output_dir / mp4_name
        # read in the file from the Modal Volume, then write it to the local disk
        output_path.write_bytes(b"".join(output_volume.read_file(mp4_name)))
        print(f"🎥 Video saved to {output_path}")

```

## Generating videos via an API

The Modal `Cls` above also included a [`fastapi_endpoint`](https://modal.com/docs/examples/basic_web),
which adds a simple web API to the inference method.

To try it out, run

```bash
modal deploy image_to_video.py
```

copy the printed URL ending in `inference-web.modal.run`,
and add `/docs` to the end. This will bring up the interactive
Swagger/OpenAPI docs for the endpoint.

## Generating videos in a web UI

Lastly, we add a simple front-end web UI (written in Alpine.js) for
our image to video backend.

This is also deployed when you run

```bash
modal deploy image_to_video.py.
```

The `Inference` class will serve multiple users from its own auto-scaling pool of warm GPU containers automatically,
and they will spin down when there are no requests.

```python
frontend_path = Path(__file__).parent / "frontend"

web_image = (
    modal.Image.debian_slim(python_version="3.12")
    .pip_install("jinja2==3.1.5", "fastapi[standard]==0.115.8")
    .add_local_dir(  # mount frontend/client code
        frontend_path, remote_path="/assets"
    )
)

@app.function(image=web_image)
@modal.concurrent(max_inputs=1000)
@modal.asgi_app()
def ui():
    import fastapi.staticfiles
    import fastapi.templating

    web_app = fastapi.FastAPI()
    templates = fastapi.templating.Jinja2Templates(directory="/assets")

    @web_app.get("/")
    async def read_root(request: fastapi.Request):
        return templates.TemplateResponse(
            "index.html",
            {
                "request": request,
                "inference_url": Inference().web.get_web_url(),
                "model_name": "LTX-Video Image to Video",
                "default_prompt": "A young girl stands calmly in the foreground, looking directly at the camera, as a house fire rages in the background.",
            },
        )

    web_app.mount(
        "/static",
        fastapi.staticfiles.StaticFiles(directory="/assets"),
        name="static",
    )

    return web_app

```

### Imagenet

This scripts demonstrates how to ingest the famous ImageNet (https://www.image-net.org/)
dataset into a mounted volume.

It requires a Kaggle account's API token stored as a modal.Secret in order to download part
of the dataset from Kaggle's servers using the `kaggle` CLI.

It is recommended to iterate on this code from a modal.Function running Jupyter server.
This better supports experimentation and maintains state in the face of errors:
11_notebooks/jupyter_inside_modal.py

```python
import os
import pathlib
import shutil
import subprocess
import sys
import threading
import time
import zipfile

import modal

bucket_creds = modal.Secret.from_name(
    "aws-s3-modal-examples-datasets", environment_name="main"
)
bucket_name = "modal-examples-datasets"
volume = modal.CloudBucketMount(
    bucket_name,
    secret=bucket_creds,
)
image = modal.Image.debian_slim().apt_install("tree").pip_install("kaggle", "tqdm")
app = modal.App(
    "example-imagenet",
    image=image,
    secrets=[modal.Secret.from_name("kaggle-api-token")],
)

def start_monitoring_disk_space(interval: int = 30) -> None:
    """Start monitoring the disk space in a separate thread."""
    task_id = os.environ["MODAL_TASK_ID"]

    def log_disk_space(interval: int) -> None:
        while True:
            statvfs = os.statvfs("/")
            free_space = statvfs.f_frsize * statvfs.f_bavail
            print(
                f"{task_id} free disk space: {free_space / (1024**3):.2f} GB",
                file=sys.stderr,
            )
            time.sleep(interval)

    monitoring_thread = threading.Thread(target=log_disk_space, args=(interval,))
    monitoring_thread.daemon = True
    monitoring_thread.start()

def copy_concurrent(src: pathlib.Path, dest: pathlib.Path) -> None:
    """
    A modified shutil.copytree which copies in parallel to increase bandwidth
    and compensate for the increased IO latency of volume mounts.
    """
    from multiprocessing.pool import ThreadPool

    class MultithreadedCopier:
        def __init__(self, max_threads):
            self.pool = ThreadPool(max_threads)
            self.copy_jobs = []

        def copy(self, source, dest):
            res = self.pool.apply_async(
                shutil.copy2,
                args=(source, dest),
                callback=lambda r: print(f"{source} copied to {dest}"),
                # NOTE: this should `raise` an exception for proper reliability.
                error_callback=lambda exc: print(
                    f"{source} failed: {exc}", file=sys.stderr
                ),
            )
            self.copy_jobs.append(res)

        def __enter__(self):
            return self

        def __exit__(self, exc_type, exc_val, exc_tb):
            self.pool.close()
            self.pool.join()

    with MultithreadedCopier(max_threads=24) as copier:
        shutil.copytree(src, dest, copy_function=copier.copy, dirs_exist_ok=True)

def extractall(fzip, dest, desc="Extracting"):
    from tqdm.auto import tqdm
    from tqdm.utils import CallbackIOWrapper

    dest = pathlib.Path(dest).expanduser()
    with (
        zipfile.ZipFile(fzip) as zipf,
        tqdm(
            desc=desc,
            unit="B",
            unit_scale=True,
            unit_divisor=1024,
            total=sum(getattr(i, "file_size", 0) for i in zipf.infolist()),
        ) as pbar,
    ):
        for i in zipf.infolist():
            if not getattr(i, "file_size", 0):  # directory
                zipf.extract(i, os.fspath(dest))
            else:
                full_path = dest / i.filename
                full_path.parent.mkdir(exist_ok=True, parents=True)
                with zipf.open(i) as fi, open(full_path, "wb") as fo:
                    shutil.copyfileobj(CallbackIOWrapper(pbar.update, fi), fo)

@app.function(
    volumes={"/mnt/": volume},
    timeout=60 * 60 * 8,  # 8 hours,
    ephemeral_disk=1000 * 1024,  # 1TB
)
def import_transform_load() -> None:
    start_monitoring_disk_space()
    kaggle_api_token_data = os.environ["KAGGLE_API_TOKEN"]
    kaggle_token_filepath = pathlib.Path.home() / ".kaggle" / "kaggle.json"
    kaggle_token_filepath.parent.mkdir(exist_ok=True)
    kaggle_token_filepath.write_text(kaggle_api_token_data)

    tmp_path = pathlib.Path("/tmp/imagenet/")
    vol_path = pathlib.Path("/mnt/imagenet/")
    filename = "imagenet-object-localization-challenge.zip"
    dataset_path = vol_path / filename
    if dataset_path.exists():
        dataset_size = dataset_path.stat().st_size
        if dataset_size < (150 * 1024 * 1024 * 1024):
            dataset_size_gib = dataset_size / (1024 * 1024 * 1024)
            raise RuntimeError(
                f"Partial download of dataset .zip. It is {dataset_size_gib}GiB but should be > 150GiB"
            )
    else:
        subprocess.run(
            f"kaggle competitions download -c imagenet-object-localization-challenge --path {tmp_path}",
            shell=True,
            check=True,
        )
        vol_path.mkdir(exist_ok=True)
        shutil.copy(tmp_path / filename, dataset_path)

    # Extract dataset
    extracted_dataset_path = tmp_path / "extracted"
    extracted_dataset_path.mkdir(parents=True, exist_ok=True)
    print(f"Extracting .zip into {extracted_dataset_path}...")
    extractall(dataset_path, extracted_dataset_path)
    print(f"Extracted {dataset_path} to {extracted_dataset_path}")
    subprocess.run(f"tree -L 3 {extracted_dataset_path}", shell=True, check=True)

    final_dataset_path = vol_path / "extracted"
    final_dataset_path.mkdir(exist_ok=True)
    copy_concurrent(extracted_dataset_path, final_dataset_path)
    subprocess.run(f"tree -L 3 {final_dataset_path}", shell=True, check=True)
    print("Dataset is loaded ✅")

```

### Import Sklearn

# Install scikit-learn in a custom image

This builds a custom image which installs the sklearn (scikit-learn) Python package in it.
It's an example of how you can use packages, even if you don't have them installed locally.

First, the imports

```python
import time

import modal

```

Next, define an app, with a custom image that installs `sklearn`.

```python
app = modal.App(
    "example-import-sklearn",
    image=modal.Image.debian_slim().apt_install("libgomp1").pip_install("scikit-learn"),
)

```

The `app.image.imports()` lets us conditionally import in the global scope.
This is needed because we might not have sklearn and numpy installed locally,
but we know they are installed inside the custom image.

```python
with app.image.imports():
    import numpy as np
    from sklearn import datasets, linear_model

```

Now, let's define a function that uses one of scikit-learn's built-in datasets
and fits a very simple model (linear regression) to it

```python
@app.function()
def fit():
    print("Inside run!")
    t0 = time.time()
    diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
    diabetes_X = diabetes_X[:, np.newaxis, 2]
    regr = linear_model.LinearRegression()
    regr.fit(diabetes_X, diabetes_y)
    return time.time() - t0

```

Finally, let's trigger the run locally. We also time this. Note that the first time we run this,
it will build the image. This might take 1-2 min. When we run this subsequent times, the image
is already build, and it will run much much faster.

```python
if __name__ == "__main__":
    t0 = time.time()
    with app.run():
        t = fit.remote()
        print("Function time spent:", t)
    print("Full time spent:", time.time() - t0)

```

### Import Torch

# Example (import_torch.py)

This is the source code for **06_gpu_and_ml.import_torch**.
```python
import modal

app = modal.App("example-import-torch")

torch_image = modal.Image.debian_slim().pip_install(
    "torch==2.7",
    extra_index_url="https://download.pytorch.org/whl/cu128",
    force_build=True,  # trigger a build every time, just for demonstration purposes
    # remove if you're using this in production!
)

@app.function(gpu="B200", image=torch_image)
def torch() -> list[list[int]]:
    import math

    import torch

    print(torch.cuda.get_device_properties("cuda:0"))

    matrix = torch.randn(1024, 1024) / math.sqrt(1024)
    matrix = matrix @ matrix

    return matrix.detach().cpu().tolist()

@app.local_entrypoint()
def main():
    print(torch.remote()[:1])

```

### Install Cuda

# Installing the CUDA Toolkit on Modal

This code sample is intended to quickly show how different layers of the CUDA stack are used on Modal.
For greater detail, see our [guide to using CUDA on Modal](https://modal.com/docs/guide/cuda).

All Modal Functions with GPUs already have the NVIDIA CUDA drivers,
NVIDIA System Management Interface, and CUDA Driver API installed.

```python
import modal

app = modal.App("example-install-cuda")

@app.function(gpu="T4")
def nvidia_smi():
    import subprocess

    subprocess.run(["nvidia-smi"], check=True)

```

This is enough to install and use many CUDA-dependent libraries, like PyTorch.

```python
@app.function(gpu="T4", image=modal.Image.debian_slim().pip_install("torch"))
def torch_cuda():
    import torch

    print(torch.cuda.get_device_properties("cuda:0"))

```

If your application or its dependencies need components of the CUDA toolkit,
like the `nvcc` compiler driver, installed as system libraries or command-line tools,
you'll need to install those manually.

We recommend the official NVIDIA CUDA Docker images from Docker Hub.
You'll need to add Python 3 and pip with the `add_python` option because the image
doesn't have these by default.

```python
ctk_image = modal.Image.from_registry(
    "nvidia/cuda:12.4.0-devel-ubuntu22.04", add_python="3.11"
).entrypoint([])  # removes chatty prints on entry

@app.function(gpu="T4", image=ctk_image)
def nvcc_version():
    import subprocess

    return subprocess.run(["nvcc", "--version"], check=True)

```

You can check that all these functions run by invoking this script with `modal run`.

```python
@app.local_entrypoint()
def main():
    nvidia_smi.remote()
    torch_cuda.remote()
    nvcc_version.remote()

```

### Install Flash Attn

# Install Flash Attention on Modal

FlashAttention is an optimized CUDA library for Transformer
scaled-dot-product attention. Dao AI Lab now publishes pre-compiled
wheels, which makes installation quick.  This script shows how to
1. Pin an exact wheel that matches CUDA 12 / PyTorch 2.6 / Python 3.13.
2. Build a Modal image that installs torch, numpy, and FlashAttention.
3. Launch a GPU function to confirm the kernel runs on a GPU.

```python
import modal

app = modal.App("example-install-flash-attn")

```

You need to specify an exact release wheel. You can find
[more on their github](https://github.com/Dao-AILab/flash-attention/releases).

```python
flash_attn_release = (
    "https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/"
    "flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp313-cp313-linux_x86_64.whl"
)

image = modal.Image.debian_slim(python_version="3.13").pip_install(
    "torch==2.6.0", "numpy==2.2.4", flash_attn_release
)

```

And here is a demo verifying that it works:

```python
@app.function(gpu="L40S", image=image)
def run_flash_attn():
    import torch
    from flash_attn import flash_attn_func

    batch_size, seqlen, nheads, headdim, nheads_k = 2, 4, 3, 16, 3

    q = torch.randn(batch_size, seqlen, nheads, headdim, dtype=torch.float16).to("cuda")
    k = torch.randn(batch_size, seqlen, nheads_k, headdim, dtype=torch.float16).to(
        "cuda"
    )
    v = torch.randn(batch_size, seqlen, nheads_k, headdim, dtype=torch.float16).to(
        "cuda"
    )

    out = flash_attn_func(q, k, v)
    assert out.shape == (batch_size, seqlen, nheads, headdim)

```

### Instructor Generate

# Structured Data Extraction using `instructor`

This example demonstrates how to use the `instructor` library to extract structured, schematized data from unstructured text.

Structured output is a powerful but under-appreciated feature of LLMs.
Structured output allows LLMs and multimodal models to connect to traditional software,
for example enabling the ingestion of unstructured data like text files into structured databases.
Applied properly, it makes them an extreme example of the [Robustness Principle](https://en.wikipedia.org/wiki/Robustness_principle)
Jon Postel formulated for TCP: "Be conservative in what you send, be liberal in what you accept".

The unstructured data used in this example code is the code from the examples in the Modal examples repository --
including this example's code!

The output includes a JSONL file containing, on each line, the metadata extracted from the code in one example.
This can be consumed downstream by other software systems, like a database or a dashboard.
We've used it to maintain and update our [examples repository](https://github.com/modal-labs/modal-examples).

## Environment setup

We set up the environment our code will run in first.
In Modal, we define environments via [container images](https://modal.com/docs/guide/custom-container),
much like Docker images, by iteratively chaining together commands.

Here there's just one command, installing `instructor` and the Python SDK for Anthropic's LLM API.

```python
from pathlib import Path
from typing import Literal, Optional

import modal
from pydantic import BaseModel, Field

image = modal.Image.debian_slim(python_version="3.11").pip_install(
    "instructor~=1.7.2", "anthropic==0.42.0"
)

```

This example uses models from Anthropic, so if you want to run it yourself,
you'll need an Anthropic API key and a Modal [`Secret`](https://modal.com/docs/guide/secrets)
called `my-anthropic-secret` to hold share it with your Modal Functions.

```python
app = modal.App(
    "example-instructor-generate",
    image=image,
    secrets=[
        modal.Secret.from_name("anthropic-secret", required_keys=["ANTHROPIC_API_KEY"])
    ],
)

```

## Running Modal functions from the command line

We'll run the example by calling `modal run instructor_generate.py` from the command line.

When we invoke `modal run` on a Python file, we run the function
marked with `@app.local_entrypoint`.

This is the only code that runs locally -- it coordinates
the activity of the rest of our code, which runs in Modal's cloud.

The logic is fairly simple: collect up the code for our examples,
and then use `instructor` to extract metadata from them,
which we then write to a file.

By default, the language model is Claude 3 Haiku, the smallest model
in the Claude 3 family.  We include the option to run `with_opus`,
which gives much better results, but it is off by default because
Opus is also ~60x more expensive, at ~$30 per million tokens.

```python
@app.local_entrypoint()
def main(limit: int = 1, with_opus: bool = False):
    # find all of the examples in the repo
    examples = get_examples()
    # optionally limit the number of examples we process
    if limit == 1:
        examples = [None]  # just run on this example
    else:
        examples = examples[:limit]
    # use Modal to map our extraction function over the examples concurrently
    results = extract_example_metadata.map(
        (  # iterable of file contents
            Path(example.filename).read_text() if example else None
            for example in examples
        ),
        (  # iterable of filenames
            example.stem if example else None for example in examples
        ),
        kwargs={"with_opus": with_opus},
    )

    # save the results to a local file
    results_path = Path("/tmp") / "instructor_generate" / "results.jsonl"
    results_dir = results_path.parent
    if not results_dir.exists():
        results_dir.mkdir(parents=True)

    print(f"writing results to {results_path}")
    with open(results_path, "w") as f:
        for result in results:
            print(result)
            f.write(result + "\n")

```

## Extracting JSON from unstructured text with `instructor` and Pydantic

The real meat of this example is in this section, in the `extract_example_metadata` function and its schemas.

We define a schema for the data we want the LLM to extract, using Pydantic.
Instructor ensures that the LLM's output matches this schema.

We can use the type system provided by Python and Pydantic to express many useful features
of the data we want to extract -- ranging from wide-open fields like a `str`ing-valued `summary`
to constrained fields like `difficulty`, which can only take on value between 1 and 5.

```python
class ExampleMetadataExtraction(BaseModel):
    """Extracted metadata about an example from the Modal examples repo."""

    summary: str = Field(..., description="A brief summary of the example.")
    has_thorough_explanation: bool = Field(
        ...,
        description="The example contains, in the form of inline comments with markdown formatting, a thorough explanation of what the code does.",
    )
    tags: list[
        Literal[
            "use-case-inference-lms",
            "use-case-inference-audio",
            "use-case-inference-images-video-3d",
            "use-case-finetuning",
            "use-case-job-queues-batch-processing",
            "use-case-sandboxed-code-execution",
        ]
    ] = Field(..., description="The use cases associated with the example")
    freshness: float = Field(
        ...,
        description="The freshness of the example, from 0 to 1. This is relative to your knowledge cutoff. Examples are less fresh if they use older libraries and tools.",
    )

```

That schema describes the data to be extracted by the LLM, but not all data is best extracted by an LLM.
For example, the filename is easily determined in software.

So we inject that information into the output after the LLM has done its work. That necessitates
an additional schema, which inherits from the first.

```python
class ExampleMetadata(ExampleMetadataExtraction):
    """Metadata about an example from the Modal examples repo."""

    filename: Optional[str] = Field(..., description="The filename of the example.")

```

With these schemas in hand, it's straightforward to write the function that extracts the metadata.
Note that we decorate it with `@app.function` to make it run on Modal.

```python
@app.function(max_containers=5)  # watch those LLM API rate limits!
def extract_example_metadata(
    example_contents: Optional[str] = None,
    filename: Optional[str] = None,
    with_opus=False,
):
    import instructor
    from anthropic import Anthropic

    # if no example is provided, use the contents of this example
    if example_contents is None:
        example_contents = Path(__file__).read_text()
        filename = Path(__file__).name

    client = instructor.from_anthropic(Anthropic())
    model = "claude-3-opus-20240229" if with_opus else "claude-3-haiku-20240307"

    # add the schema as the `response_model` argument in what otherwise looks like a normal LLM API call
    extracted_metadata = client.messages.create(
        model=model,
        temperature=0.0,
        max_tokens=1024,
        response_model=ExampleMetadataExtraction,
        messages=[
            {
                "role": "user",
                "content": f"Extract the metadata for this example.\n\n-----EXAMPLE BEGINS-----{example_contents}-----EXAMPLE ENDS-----\n\n",
            },
        ],
    )

    # inject the filename
    full_metadata = ExampleMetadata(**extracted_metadata.dict(), filename=filename)

    # return it as JSON
    return full_metadata.model_dump_json()

```

## Addenda

The rest of the code used in this example is not particularly interesting:
just a utility function to find all of the examples, which we invoke in the `local_entrypoint` above.

```python
def get_examples(silent=True):
    """Find all of the examples using a utility from this repo.

    We use importlib to avoid the need to define the repo as a package."""
    import importlib

    examples_root = Path(__file__).parent.parent.parent
    spec = importlib.util.spec_from_file_location(
        "utils", f"{examples_root}/internal/utils.py"
    )
    example_utils = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(example_utils)
    examples = [
        example
        for example in example_utils.get_examples()
        if example.type != 2  # filter out non-code assets
    ]
    return examples

```

### Jsonformer Generate

# Structured output generation with Jsonformer

[Jsonformer](https://github.com/1rgs/jsonformer) is a tool that generates structured synthetic data using LLMs.
You provide a JSON spec and it generates a JSON object following the spec. It's a
great tool for developing, benchmarking, and testing applications.

```python
from typing import Any

import modal

```

We will be using one of [Databrick's Dolly](https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm)
models, choosing for the smallest version with 3B parameters. Feel free to use any of the other models
available from the [Huggingface Hub Dolly repository](https://huggingface.co/databricks).

```python
MODEL_ID: str = "databricks/dolly-v2-3b"
CACHE_PATH: str = "/root/cache"

```

## Build image and cache model

We'll download models from the Huggingface Hub and store them in our image.
This skips the downloading of models during inference and reduces cold boot
times.

```python
def download_model():
    from transformers import AutoModelForCausalLM, AutoTokenizer

    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID, use_cache=True, device_map="auto"
    )
    model.save_pretrained(CACHE_PATH, safe_serialization=True)

    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True, use_cache=True)
    tokenizer.save_pretrained(CACHE_PATH, safe_serialization=True)

```

Define our image; install dependencies.

```python
image = (
    modal.Image.debian_slim(python_version="3.10")
    .pip_install(
        "jsonformer==0.9.0",
        "transformers",
        "torch",
        "accelerate",
        "safetensors",
    )
    .run_function(download_model)
)
app = modal.App("example-jsonformer-generate")

```

## Generate examples

The generate function takes two arguments `prompt` and `json_schema`, where
`prompt` is used to describe the domain of your data (for example, "plants")
and the schema contains the JSON schema you want to populate.

```python
@app.function(gpu="A10G", image=image)
def generate(prompt: str, json_schema: dict[str, Any]) -> dict[str, Any]:
    from jsonformer import Jsonformer
    from transformers import AutoModelForCausalLM, AutoTokenizer

    model = AutoModelForCausalLM.from_pretrained(
        CACHE_PATH, use_cache=True, device_map="auto"
    )
    tokenizer = AutoTokenizer.from_pretrained(
        MODEL_ID, use_fast=True, use_cache=True, device_map="auto"
    )

    jsonformer = Jsonformer(model, tokenizer, json_schema, prompt)
    generated_data = jsonformer()

    return generated_data

```

Add Modal entrypoint for invoking your script, and done!

```python
@app.local_entrypoint()
def main():
    prompt = "Generate random plant information based on the following schema:"
    json_schema = {
        "type": "object",
        "properties": {
            "height_cm": {"type": "number"},
            "bearing_fruit": {"type": "boolean"},
            "classification": {
                "type": "object",
                "properties": {
                    "species": {"type": "string"},
                    "kingdom": {"type": "string"},
                    "family": {"type": "string"},
                    "genus": {"type": "string"},
                },
            },
        },
    }

    result = generate.remote(prompt, json_schema)
    print(result)

```

### Jupyter Inside Modal

## Overview

Quick snippet showing how to connect to a Jupyter notebook server running inside a Modal container,
especially useful for exploring the contents of Modal Volumes.
This uses [Modal Tunnels](https://modal.com/docs/guide/tunnels#tunnels-beta)
to create a tunnel between the running Jupyter instance and the internet.

If you want to your Jupyter notebook to run _locally_ and execute remote Modal Functions in certain cells, see the `basic.ipynb` example :)

```python
import os
import subprocess
import time

import modal

app = modal.App(
    "example-jupyter-inside-modal",
    image=modal.Image.debian_slim(python_version="3.12").pip_install(
        "jupyter", "bing-image-downloader~=1.1.2"
    ),
)
volume = modal.Volume.from_name(
    "modal-examples-jupyter-inside-modal-data", create_if_missing=True
)

CACHE_DIR = "/root/cache"
JUPYTER_TOKEN = "1234"  # Change me to something non-guessable!

@app.function(volumes={CACHE_DIR: volume})
def seed_volume():
    # Bing it!
    from bing_image_downloader import downloader

    # This will save into the Modal volume and allow you view the images
    # from within Jupyter at a path like `/root/cache/modal labs/Image_1.png`.
    downloader.download(
        query="modal labs",
        limit=10,
        output_dir=CACHE_DIR,
        force_replace=False,
        timeout=60,
        verbose=True,
    )
    volume.commit()

```

This is all that's needed to create a long-lived Jupyter server process in Modal
that you can access in your Browser through a secure network tunnel.
This can be useful when you want to interactively engage with Volume contents
without having to download it to your host computer.

```python
@app.function(max_containers=1, volumes={CACHE_DIR: volume}, timeout=1_500)
def run_jupyter(timeout: int):
    jupyter_port = 8888
    with modal.forward(jupyter_port) as tunnel:
        jupyter_process = subprocess.Popen(
            [
                "jupyter",
                "notebook",
                "--no-browser",
                "--allow-root",
                "--ip=0.0.0.0",
                f"--port={jupyter_port}",
                "--NotebookApp.allow_origin='*'",
                "--NotebookApp.allow_remote_access=1",
            ],
            env={**os.environ, "JUPYTER_TOKEN": JUPYTER_TOKEN},
        )

        print(f"Jupyter available at => {tunnel.url}")

        try:
            end_time = time.time() + timeout
            while time.time() < end_time:
                time.sleep(5)
            print(f"Reached end of {timeout} second timeout period. Exiting...")
        except KeyboardInterrupt:
            print("Exiting...")
        finally:
            jupyter_process.kill()

@app.local_entrypoint()
def main(timeout: int = 10_000):
    # Write some images to a volume, for demonstration purposes.
    seed_volume.remote()
    # Run the Jupyter Notebook server
    run_jupyter.remote(timeout=timeout)

```

Doing `modal run jupyter_inside_modal.py` will run a Modal app which starts
the Juypter server at an address like https://u35iiiyqp5klbs.r3.modal.host.
Visit this address in your browser, and enter the security token
you set for `JUPYTER_TOKEN`.

### Jupyter Sandbox

# Run a Jupyter notebook in a Modal Sandbox

This example demonstrates how to run a Jupyter notebook in a Modal
[Sandbox](https://modal.com/docs/guide/sandbox).

## Setting up the Sandbox

All Sandboxes are associated with an App.

We look up our app by name, creating it if it doesn't exist.

```python
import json
import secrets
import time
import urllib.request

import modal

app = modal.App.lookup("example-jupyter-sandbox", create_if_missing=True)

```

We define a custom Docker image that has Jupyter and some other dependencies installed.
Using a pre-defined image allows us to avoid re-installing packages on every Sandbox startup.

```python
image = (
    modal.Image.debian_slim(python_version="3.12").pip_install("jupyter~=1.1.0")
    # .pip_install("pandas", "numpy", "seaborn")  # Any other deps
)

```

## Starting a Jupyter server in a Sandbox

Since we'll be exposing a Jupyter server over the Internet, we need to create a password.
We'll use `secrets` from the standard library to create a token
and then store it in a Modal [Secret](https://modal.com/docs/guide/secrets).

```python
token = secrets.token_urlsafe(13)
token_secret = modal.Secret.from_dict({"JUPYTER_TOKEN": token})

```

Now, we can start our Sandbox. Note our use of the `encrypted_ports` argument, which
allows us to securely expose the Jupyter server to the public Internet. We use
`modal.enable_output()` to print the Sandbox's image build logs to the console.

```python
JUPYTER_PORT = 8888

print("🏖️  Creating sandbox")

with modal.enable_output():
    sandbox = modal.Sandbox.create(
        "jupyter",
        "notebook",
        "--no-browser",
        "--allow-root",
        "--ip=0.0.0.0",
        f"--port={JUPYTER_PORT}",
        "--NotebookApp.allow_origin='*'",
        "--NotebookApp.allow_remote_access=1",
        encrypted_ports=[JUPYTER_PORT],
        secrets=[token_secret],
        timeout=5 * 60,  # 5 minutes
        image=image,
        app=app,
        gpu=None,  # add a GPU if you need it!
    )

print(f"🏖️  Sandbox ID: {sandbox.object_id}")

```

## Communicating with a Jupyter server

Next, we print out a URL that we can use to connect to our Jupyter server.
Note that we have to call [`Sandbox.tunnels`](https://modal.com/docs/reference/modal.Sandbox#tunnels)
to get the URL. The Sandbox is not publicly accessible until we do so.

```python
tunnel = sandbox.tunnels()[JUPYTER_PORT]
url = f"{tunnel.url}/?token={token}"
print(f"🏖️  Jupyter notebook is running at: {url}")

```

Jupyter servers expose a [REST API](https://jupyter-server.readthedocs.io/en/latest/developers/rest-api.html)
that you can use for programmatic manipulation.

For example, we can check the server's status by
sending a GET request to the `/api/status` endpoint.

```python
def is_jupyter_up():
    try:
        response = urllib.request.urlopen(f"{tunnel.url}/api/status?token={token}")
        if response.getcode() == 200:
            data = json.loads(response.read().decode())
            return data.get("started", False)
    except Exception:
        return False
    return False

```

We'll now wait for the Jupyter server to be ready by hitting that endpoint.

```python
timeout = 60  # seconds
start_time = time.time()
while time.time() - start_time < timeout:
    if is_jupyter_up():
        print("🏖️  Jupyter is up and running!")
        break
    time.sleep(1)
else:
    print("🏖️  Timed out waiting for Jupyter to start.")

```

You can now open this URL in your browser to access the Jupyter notebook!

When you're done, terminate the sandbox using your [Modal dashboard](https://modal.com/sandboxes)
or by running `Sandbox.from_id(sandbox.object_id).terminate()`.

### Laion400

https://laion.ai/blog/laion-400-open-dataset/

LAION-400 is a large dataset of 400M English (image, text) pairs.

As described on the dataset's homepage, it consists of 32 .parquet files
containing dataset metadata *but not* the image data itself.

After downloading the .parquet files, this script fans out 32 worker jobs
to process a single .parquet file. Processing involves fetch and transform
of image data into 256 * 256 square JPEGs.

This script is loosely based off the following instructions:
https://github.com/rom1504/img2dataset/blob/main/dataset_examples/laion400m.md

It is recommended to iterate on this code from a modal.Function running Jupyter server.
This better supports experimentation and maintains state in the face of errors:
11_notebooks/jupyter_inside_modal.py

```python
import os
import pathlib
import shutil
import subprocess
import sys
import threading
import time

import modal

bucket_creds = modal.Secret.from_name(
    "aws-s3-modal-examples-datasets", environment_name="main"
)

bucket_name = "modal-examples-datasets"

volume = modal.CloudBucketMount(
    bucket_name,
    secret=bucket_creds,
)

image = modal.Image.debian_slim().apt_install("wget").pip_install("img2dataset~=1.45.0")

app = modal.App("example-laion400", image=image)

def start_monitoring_disk_space(interval: int = 30) -> None:
    """Start monitoring the disk space in a separate thread, printing info to stdout"""
    task_id = os.environ["MODAL_TASK_ID"]

    def log_disk_space(interval: int) -> None:
        while True:
            statvfs = os.statvfs("/")
            free_space = statvfs.f_frsize * statvfs.f_bavail
            print(
                f"{task_id} free disk space: {free_space / (1024**3):.2f} GB",
                file=sys.stderr,
            )
            time.sleep(interval)

    monitoring_thread = threading.Thread(target=log_disk_space, args=(interval,))
    monitoring_thread.daemon = True
    monitoring_thread.start()

def copy_concurrent(src: pathlib.Path, dest: pathlib.Path) -> None:
    """
    A modified shutil.copytree which copies in parallel to increase bandwidth
    and compensate for the increased IO latency of volume mounts.
    """
    from multiprocessing.pool import ThreadPool

    class MultithreadedCopier:
        def __init__(self, max_threads):
            self.pool = ThreadPool(max_threads)
            self.copy_jobs = []

        def copy(self, source, dest):
            res = self.pool.apply_async(
                shutil.copy2,
                args=(source, dest),
                callback=lambda r: print(f"{source} copied to {dest}"),
                # NOTE: this should `raise` an exception for proper reliability.
                error_callback=lambda exc: print(
                    f"{source} failed: {exc}", file=sys.stderr
                ),
            )
            self.copy_jobs.append(res)

        def __enter__(self):
            return self

        def __exit__(self, exc_type, exc_val, exc_tb):
            self.pool.close()
            self.pool.join()

    with MultithreadedCopier(max_threads=24) as copier:
        shutil.copytree(src, dest, copy_function=copier.copy, dirs_exist_ok=True)

@app.function(
    volumes={"/mnt": volume},
    # 20 hours — img2dataset is extremely slow to work through all images.
    timeout=60 * 60 * 20,
    ephemeral_disk=512 * 1024,
)
def run_img2dataset_on_part(
    i: int,
    partfile: str,
) -> None:
    start_monitoring_disk_space(interval=60)
    while not pathlib.Path(partfile).exists():
        print(f"{partfile} not yet visible...", file=sys.stderr)
        time.sleep(1)
    # Each part works in its own subdirectory because img2dataset creates a working
    # tmpdir at <output_folder>/_tmp and we don't want consistency issues caused by
    # all concurrently processing parts read/writing from the same temp directory.
    tmp_laion400m_data_path = pathlib.Path(f"/tmp/laion400/laion400m-data/{i}/")
    tmp_laion400m_data_path.mkdir(exist_ok=True, parents=True)
    # Increasing retries comes at a *large* performance cost.
    retries = 0
    # TODO: Support --incremental mode. https://github.com/rom1504/img2dataset?tab=readme-ov-file#incremental-mode
    command = (
        f'img2dataset --url_list {partfile} --input_format "parquet" '
        '--url_col "URL" --caption_col "TEXT" --output_format webdataset '
        f"--output_folder {tmp_laion400m_data_path} --processes_count 16 --thread_count 128 --image_size 256 "
        f'--retries={retries} --save_additional_columns \'["NSFW","similarity","LICENSE"]\' --enable_wandb False'
    )
    print(f"Running img2dataset command: \n\n{command}")
    subprocess.run(command, shell=True, check=True)
    print("Completed img2dataset, copying into mounted volume...")
    laion400m_data_path = pathlib.Path("/mnt/laion400/laion400m-data/")
    copy_concurrent(tmp_laion400m_data_path, laion400m_data_path)

@app.function(
    volumes={"/mnt": volume},
    timeout=60 * 60 * 16,  # 16 hours
)
def import_transform_load() -> None:
    start_monitoring_disk_space()
    # We initially download into a tmp directory outside of the volume to avoid
    # any filesystem incompatibilities between the `wget` application and the bucket
    # filesystem mount.
    tmp_laion400m_meta_path = pathlib.Path("/tmp/laion400/laion400m-meta")
    laion400m_meta_path = pathlib.Path("/mnt/laion400/laion400m-meta")
    if not laion400m_meta_path.exists():
        laion400m_meta_path.mkdir(parents=True, exist_ok=True)
        # WARNING: We skip the certificate check for the-eye.eu because its TLS certificate expired as of mid-May 2024.
        subprocess.run(
            f"wget -l1 -r --no-check-certificate --no-parent https://the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta/ -P {tmp_laion400m_meta_path}",
            shell=True,
            check=True,
        )

        parquet_files = list(tmp_laion400m_meta_path.glob("**/*.parquet"))
        print(
            f"Downloaded {len(parquet_files)} parquet files into {tmp_laion400m_meta_path}."
        )
        # Perform a simple copy operation to move the data into the bucket.
        copy_concurrent(tmp_laion400m_meta_path, laion400m_meta_path)

    parquet_files = list(laion400m_meta_path.glob("**/*.parquet"))
    print(f"Stored {len(parquet_files)} parquet files into {laion400m_meta_path}.")
    print(f"Spawning {len(parquet_files)} to enrich dataset...")
    list(run_img2dataset_on_part.starmap((i, f) for i, f in enumerate(parquet_files)))

```

### Langserve

# Deploy LangChain and LangGraph applications with LangServe

This code demonstrates how to deploy a
[LangServe](https://python.langchain.com/docs/langserve/) application on Modal.
LangServe makes it easy to wrap LangChain and LangGraph applications in a FastAPI server,
and Modal makes it easy to deploy FastAPI servers.

The LangGraph application that it serves is from our [sandboxed LLM coding agent example](https://modal.com/docs/examples/agent).

You can find the code for the agent and several other code files associated with this example in the
[`codelangchain` directory of our examples repo](https://github.com/modal-labs/modal-examples/tree/main/13_sandboxes/codelangchain).

```python
import modal

from .agent import construct_graph, create_sandbox
from .src.common import image

app = modal.App("example-codelangchain-langserve")

image = image.pip_install("langserve[all]==0.3.0")

@app.function(
    image=image,
    secrets=[  # see the agent.py file for more information on Secrets
        modal.Secret.from_name("openai-secret", required_keys=["OPENAI_API_KEY"]),
        modal.Secret.from_name("langsmith-secret", required_keys=["LANGCHAIN_API_KEY"]),
    ],
)
@modal.asgi_app()
def serve():
    from fastapi import FastAPI, responses
    from fastapi.middleware.cors import CORSMiddleware
    from langchain_core.runnables import RunnableLambda
    from langserve import add_routes

    # create a FastAPI app
    web_app = FastAPI(
        title="CodeLangChain Server",
        version="1.0",
        description="Writes code and checks if it runs.",
    )

    # set all CORS enabled origins
    web_app.add_middleware(
        CORSMiddleware,
        allow_origins=["*"],
        allow_credentials=True,
        allow_methods=["*"],
        allow_headers=["*"],
        expose_headers=["*"],
    )

    def inp(question: str) -> dict:
        return {"keys": {"question": question, "iterations": 0}}

    def out(state: dict) -> str:
        if "finish" in state:
            return state["finish"]["keys"]["response"]
        elif len(state) > 0 and "finish" in state[-1]:
            return state[-1]["finish"]["keys"]["response"]
        else:
            return str(state)

    graph = construct_graph(create_sandbox(app), debug=False).compile()

    chain = RunnableLambda(inp) | graph | RunnableLambda(out)

    add_routes(
        web_app,
        chain,
        path="/codelangchain",
    )

    # redirect the root to the interactive playground
    @web_app.get("/")
    def redirect():
        return responses.RedirectResponse(url="/codelangchain/playground")

    # return the FastAPI app and Modal will deploy it for us
    return web_app

```

### Learn Math

# Training a mathematical reasoning model using the verifiers library with sandboxed code execution

This example demonstrates how to train mathematical reasoning models on Modal using the [verifiers library](https://github.com/willccbb/verifiers) with [Modal Sandboxes](https://modal.com/docs/guide/sandbox) for executing generated code.
The [verifiers library](https://github.com/willccbb/verifiers) is a set of tools and abstractions for training LLMs with reinforcement learning in verifiable multi-turn environments via [GRPO](https://arxiv.org/abs/2402.03300).

This example demonstrates how to:
- Launch a distributed GRPO training job on Modal with 4× H100 GPUs.
- Use vLLM for inference during training.
- Cache HuggingFace, vLLM, and store the model weights in [Volumes](https://modal.com/docs/guide/volumes).
- Run inference by loading the trained model from Volumes.

## Setup
We start by importing modal and the dependencies from the verifiers library. Then, we create a Modal App and an image with a NVIDIA CUDA base image.
We install the dependencies for the `verifiers` and `flash-attn` libraries, following the verifiers [README](https://github.com/willccbb/verifiers?tab=readme-ov-file#getting-started).

```python
import modal

app = modal.App(name="example-learn-math")
cuda_version = "12.8.0"
flavor = "devel"
operating_sys = "ubuntu22.04"
tag = f"{cuda_version}-{flavor}-{operating_sys}"

flash_attn_release = (
    "https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.1.post1/"
    "flash_attn-2.7.1.post1+cu12torch2.6cxx11abiTRUE-cp311-cp311-linux_x86_64.whl"
)  # We use a pre-built binary for flash-attn to install it in the image.

image = (
    modal.Image.from_registry(f"nvidia/cuda:{tag}", add_python="3.11")
    .apt_install("git", "clang")
    .pip_install(
        "huggingface_hub[hf_transfer]==0.33.5",
        "setuptools==69.0.3",
        "wheel==0.45.1",
        "ninja==1.11.1.4",
        "packaging==25.0",
        "verifiers[all]==0.1.1",
        flash_attn_release,
    )
    .env(
        {
            "HF_HUB_ENABLE_HF_TRANSFER": "1",
            "VLLM_ALLOW_INSECURE_SERIALIZATION": "1",
            "HF_HOME": "/root/.cache/huggingface",
        }
    )
)

```

## Caching HuggingFace, vLLM, and storing model weights. For more on storing model weights on Modal, see
[this guide](https://modal.com/docs/guide/model-weights).
We create Modal Volumes to persist:
- HuggingFace downloads
- vLLM cache
- Model weights

We define the model name and a tool that the model can use to execute Python code that it generates.
See this [this training script](/docs/examples/trainer_script_grpo) for more details.

```python
HF_CACHE_DIR = "/root/.cache/huggingface"
HF_CACHE_VOL = modal.Volume.from_name("huggingface-cache", create_if_missing=True)

VLLM_CACHE_DIR = "/root/.cache/vllm"
VLLM_CACHE_VOL = modal.Volume.from_name("vllm-cache", create_if_missing=True)

WEIGHTS_DIR = "/root/math_weights"
WEIGHTS_VOL = modal.Volume.from_name(
    "example-trainer-script-grpo-weights", create_if_missing=True
)

MODEL_NAME = "willcb/Qwen3-0.6B"
TOOL_DESCRIPTIONS = """
- sandbox_exec: Execute Python code to perform calculations.
"""

```

## Training
Following the [verifiers example](https://github.com/willccbb/verifiers/blob/main/verifiers/examples/math_python.py), we will need a training script and a config file.
For sandboxed code execution, we will use [this training script](/docs/examples/trainer_script_grpo) and the config file defined [here](https://github.com/willccbb/verifiers/blob/main/configs/zero3.yaml).

We create a function that uses 4 H100 GPUs and mounts the defined Volumes. Then, we write the training script and the config file to the `/root/` directory.
We use the `willcb/Qwen3-0.6B` model from HuggingFace, setting up inference via a vLLM server. Once, the model is served, we will launch the training script using `accelerate`.
We can use the App ID as a unique identifier for saving and loading the model weights.
When the training is complete, we will run a single inference from the training set to test our training run.

```python
@app.function(
    gpu="H100:4",
    image=image,
    volumes={
        HF_CACHE_DIR: HF_CACHE_VOL,
        VLLM_CACHE_DIR: VLLM_CACHE_VOL,
        WEIGHTS_DIR: WEIGHTS_VOL,
    },
    timeout=3600,
    secrets=[modal.Secret.from_name("wandb-secret-rl")],
)
def math_group_verifier(trainer_script: str, config_file: str, run_id: str = None):
    import os
    import subprocess

    import wandb
    from verifiers.prompts import DEFAULT_TOOL_PROMPT_TEMPLATE
    from verifiers.utils import load_example_dataset

    with open("/root/trainer_script.py", "w") as f:
        f.write(trainer_script)
    with open("/root/config.yaml", "w") as f:
        f.write(config_file)

    wandb.init(project="example-trainer-script-grpo")
    wandb.config = {"epochs": 10}

    vllm_proc = subprocess.Popen(
        ["vf-vllm", "--model", MODEL_NAME, "--port", "8000", "--enforce-eager"],
        env={**os.environ, "CUDA_VISIBLE_DEVICES": "0", "NCCL_CUMEM_ENABLE": "0"},
    )

    run_id = app.app_id if run_id is None else run_id

    result = subprocess.run(
        [
            "accelerate",
            "launch",
            "--config-file",
            "/root/config.yaml",
            "/root/trainer_script.py",
            "--run-id",
            run_id,
        ],
        env={
            **os.environ,
            "CUDA_VISIBLE_DEVICES": "1,2,3",
            "NCCL_DEBUG": "INFO",
            "NCCL_CUMEM_ENABLE": "0",
        },
    )
    vllm_proc.terminate()
    vllm_proc.wait()

    print("Training completed! Running a single inference from test set...")

    dataset = load_example_dataset(
        "math", split="train"
    ).select(
        range(1)
    )  # We use the first example from the training set for inference to test our training run.

    example = dataset[0]
    question = example["question"]
    prompt = (
        DEFAULT_TOOL_PROMPT_TEMPLATE.format(tool_descriptions=TOOL_DESCRIPTIONS)
        + "\n\nProblem: "
        + question
        + "\n\n<think>\n\n<answer>"
    )

    result = inference.remote(prompt, run_id)
    print(result)

```

## Inference
We define an `inference` Modal function that runs on a single GPU and mounts the weights volume.
Then, we load the trained model from the volume, falling back to the base model if necessary.
To build the prompt, we apply `DEFAULT_TOOL_PROMPT_TEMPLATE` with `TOOL_DESCRIPTIONS` and the problem text.
Finally, we tokenize the prompt, generate a response with sampling (temperature, top-p, repetition penalty), then decode and return the answer.

```python
@app.function(
    gpu="H100",
    image=image,
    volumes={
        HF_CACHE_DIR: HF_CACHE_VOL,
        WEIGHTS_DIR: WEIGHTS_VOL,
    },
    timeout=60 * 10,
)
def inference(prompt: str, run_id: str = None):
    """Test the trained model with the same format as training"""
    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer
    from verifiers.prompts import DEFAULT_TOOL_PROMPT_TEMPLATE

    prompt = (
        DEFAULT_TOOL_PROMPT_TEMPLATE.format(tool_descriptions=TOOL_DESCRIPTIONS)
        + "\n\nProblem: "
        + prompt
        + "\n\n<think>\n\n<answer>"
    )

    model_path = f"{WEIGHTS_DIR}/{app.app_id if run_id is None else run_id}"
    print(f"Loading model from {model_path}")
    try:
        tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
        model = AutoModelForCausalLM.from_pretrained(
            model_path,
            torch_dtype=torch.bfloat16,
            device_map="auto",
            trust_remote_code=True,
        )
        print(f"✓ Loaded trained model from {model_path}")
    except Exception as e:
        print(f"Could not load trained model: {e}")
        print("Loading base model instead...")
        tokenizer = AutoTokenizer.from_pretrained(
            MODEL_NAME, cache_dir=HF_CACHE_DIR, trust_remote_code=True
        )
        model = AutoModelForCausalLM.from_pretrained(
            MODEL_NAME,
            torch_dtype=torch.bfloat16,
            device_map="auto",
            trust_remote_code=True,
        )

    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    def generate_response(prompt_text):
        inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device)

        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=2048,
                do_sample=True,
                temperature=0.3,
                top_p=0.9,
                repetition_penalty=1.1,
                pad_token_id=tokenizer.eos_token_id,
            )

        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        return response[len(prompt_text) :].strip()

    model_response = generate_response(prompt + "\n\n<think>\n\n<answer>")
    return model_response

```

## Usage
We create a main function that serves as the entrypoint for the app.
It supports two modes:
- train: kick off math_group_verifier with the provided training script and config file
- inference: invoke inference with prompt string or prompt file

To run the training, we can use the following command:
```bash
modal run learn_math.py --mode=train --trainer-script=trainer_script_grpo.py --config-file=config_grpo.yaml
```
To run the inference with a custom prompt, we can use the following command after setting the model path inside our volume:
```bash
modal run learn_math.py --mode=inference --prompt "Find the value of x that satisfies the equation: 2x + 5 = 17" --model-path "test_run"
```
To run the inference with a custom prompt from a file, we can use the following command:
```bash
modal run learn_math.py --mode=inference --prompt-file "prompt.txt"
```

```python
@app.local_entrypoint()
def main(
    mode: str = "train",
    prompt: str = None,
    prompt_file: str = None,
    run_id: str = None,
    trainer_script: str = "trainer_script_grpo.py",
    config_file: str = "config_grpo.yaml",
):
    if mode == "inference":
        if prompt_file:
            try:
                with open(prompt_file, "r") as f:
                    prompt_text = f.read().strip()
                print(f"Using prompt from file: {prompt_file}")
            except FileNotFoundError:
                print(f"Error: File {prompt_file} not found")
                return
        elif prompt:
            prompt_text = prompt
            print("Using prompt from command line argument")
        else:
            prompt_text = "Find the value of x that satisfies the equation: 2x + 5 = 17"

        print("=" * 50)
        print("Running inference...")
        print("=" * 50)
        print("PROMPT:")
        print(prompt_text)
        print("-" * 30)
        model_response = inference.remote(prompt_text, run_id)
        print("MODEL RESPONSE:")
        print(model_response)
        print("-" * 30)

    elif mode == "train":
        print(
            f"Training with trainer script:\n{trainer_script}\nand config file:\n{config_file}"
        )
        with open(trainer_script, "r") as f:
            trainer_content = f.read()
        with open(config_file, "r") as f:
            config_content = f.read()

        math_group_verifier.remote(trainer_content, config_content, run_id)

```

### Llama Cpp

# Run large and small language models with llama.cpp (DeepSeek-R1, Phi-4)

This example demonstrate how to run small (Phi-4) and large (DeepSeek-R1)
language models on Modal with [`llama.cpp`](https://github.com/ggerganov/llama.cpp).

By default, this example uses DeepSeek-R1 to produce a "Flappy Bird" game in Python --
see the video below. The code used in the video is [here](https://gist.github.com/charlesfrye/a3788c61019c32cb7947f4f5b1c04818),
along with the model's raw outputs.
Note that getting the game to run required a small bugfix from a human --
our jobs are still safe, for now.

<center>
<a href="https://gist.github.com/charlesfrye/a3788c61019c32cb7947f4f5b1c04818" aria-label="View the generated code"> <video controls autoplay loop muted> <source src="https://modal-cdn.com/example-flap-py.mp4" type="video/mp4"> </video> </a>
</center>

```python
from pathlib import Path
from typing import Optional

import modal

```

## What GPU can run DeepSeek-R1? What GPU can run Phi-4?

Our large model is a real whale:
[DeepSeek-R1](https://api-docs.deepseek.com/news/news250120),
which has 671B total parameters and so consumes over 100GB of storage,
even when [quantized down to one ternary digit (1.58 bits)](https://unsloth.ai/blog/deepseekr1-dynamic)
per parameter.

To make sure we have enough room for it and its activations/KV cache,
we select four L40S GPUs, which together have 192 GB of memory.

[Phi-4](https://huggingface.co/microsoft/phi-4),
on the other hand, is a svelte 14B total parameters,
or roughly 5 GB when quantized down to
[two bits per parameter](https://huggingface.co/unsloth/phi-4-GGUF).

That's small enough that it can be comfortably run on a CPU,
especially for a single-user setup like the one we'll build here.

```python
GPU_CONFIG = "L40S:4"  # for DeepSeek-R1, literal `None` for phi-4

```

## Calling a Modal Function from the command line

To start, we define our `main` function --
the Python function that we'll run locally to
trigger our inference to run on Modal's cloud infrastructure.

This function, like the others that form our inference service
running on Modal, is part of a Modal [App](https://modal.com/docs/guide/apps).
Specifically, it is a `local_entrypoint`.
Any Python code can call Modal Functions remotely,
but local entrypoints get a command-line interface for free.

```python
app = modal.App("example-llama-cpp")

@app.local_entrypoint()
def main(
    prompt: Optional[str] = None,
    model: str = "DeepSeek-R1",  # or "phi-4"
    n_predict: int = -1,  # max number of tokens to predict, -1 is infinite
    args: Optional[str] = None,  # string of arguments to pass to llama.cpp's cli
):
    """Run llama.cpp inference on Modal for phi-4 or deepseek r1."""
    import shlex

    org_name = "unsloth"
    # two sample models: the diminuitive phi-4 and the chonky deepseek r1
    if model.lower() == "phi-4":
        model_name = "phi-4-GGUF"
        quant = "Q2_K"
        model_entrypoint_file = f"phi-4-{quant}.gguf"
        model_pattern = f"*{quant}*"
        revision = None
        parsed_args = DEFAULT_PHI_ARGS if args is None else shlex.split(args)
    elif model.lower() == "deepseek-r1":
        model_name = "DeepSeek-R1-GGUF"
        quant = "UD-IQ1_S"
        model_entrypoint_file = (
            f"{model}-{quant}/DeepSeek-R1-{quant}-00001-of-00003.gguf"
        )
        model_pattern = f"*{quant}*"
        revision = "02656f62d2aa9da4d3f0cdb34c341d30dd87c3b6"
        parsed_args = DEFAULT_DEEPSEEK_R1_ARGS if args is None else shlex.split(args)
    else:
        raise ValueError(f"Unknown model {model}")

    repo_id = f"{org_name}/{model_name}"
    download_model.remote(repo_id, [model_pattern], revision)

    # call out to a `.remote` Function on Modal for inference
    result = llama_cpp_inference.remote(
        model_entrypoint_file,
        prompt,
        n_predict,
        parsed_args,
        store_output=model.lower() == "deepseek-r1",
    )
    output_path = Path("/tmp") / f"llama-cpp-{model}.txt"
    output_path.parent.mkdir(parents=True, exist_ok=True)
    print(f"🦙 writing response to {output_path}")
    output_path.write_text(result)

```

You can trigger inference from the command line with

```bash
modal run llama_cpp.py
```

To try out Phi-4 instead, use the `--model` argument:

```bash
modal run llama_cpp.py --model="phi-4"
```

Note that this will run for up to 30 minutes, which costs ~$5.
To allow it to proceed even if your local terminal fails,
add the `--detach` flag after `modal run`.
See below for details on getting the outputs.

You can pass prompts with the `--prompt` argument and set the maximum number of tokens
with the `--n-predict` argument.

Additional arguments for `llama-cli` are passed as a string like `--args="--foo 1 --bar"`.

For convenience, we set a number of sensible defaults for DeepSeek-R1,
following the suggestions by the team at unsloth,
who [quantized the model to 1.58 bit](https://unsloth.ai/blog/deepseekr1-dynamic).

```python
DEFAULT_DEEPSEEK_R1_ARGS = [  # good default llama.cpp cli args for deepseek-r1
    "--cache-type-k",
    "q4_0",
    "--threads",
    "12",
    "-no-cnv",
    "--prio",
    "2",
    "--temp",
    "0.6",
    "--ctx-size",
    "8192",
]

DEFAULT_PHI_ARGS = [  # good default llama.cpp cli args for phi-4
    "--threads",
    "16",
    "-no-cnv",
    "--ctx-size",
    "16384",
]

```

## Compiling llama.cpp with CUDA support

In order to run inference, we need the model's weights
and we need code to run inference with those weights.

[`llama.cpp`](https://github.com/ggerganov/llama.cpp)
is a no-frills C++ library for running large language models.
It supports highly-quantized versions of models ideal for running
single-user language modeling services on CPU or GPU.

We compile it, with CUDA support, and add it to a Modal
[container image](https://modal.com/docs/guide/images)
using the code below.

For more details on using CUDA on Modal, including why
we need to use the `nvidia/cuda` registry image in this case
(hint: it's for the [`nvcc` compiler](https://modal.com/gpu-glossary/host-software/nvcc)),
see the [Modal guide to using CUDA](https://modal.com/docs/guide/cuda).

```python
LLAMA_CPP_RELEASE = "b4568"
MINUTES = 60

cuda_version = "12.4.0"  # should be no greater than host CUDA version
flavor = "devel"  #  includes full CUDA toolkit
operating_sys = "ubuntu22.04"
tag = f"{cuda_version}-{flavor}-{operating_sys}"

image = (
    modal.Image.from_registry(f"nvidia/cuda:{tag}", add_python="3.12")
    .apt_install("git", "build-essential", "cmake", "curl", "libcurl4-openssl-dev")
    .run_commands("git clone https://github.com/ggerganov/llama.cpp")
    .run_commands(
        "cmake llama.cpp -B llama.cpp/build "
        "-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON "
    )
    .run_commands(  # this one takes a few minutes!
        "cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli"
    )
    .run_commands("cp llama.cpp/build/bin/llama-* llama.cpp")
    .entrypoint([])  # remove NVIDIA base container entrypoint
)

```

## Storing models on Modal

To make the model weights available on Modal,
we download them from Hugging Face.

Modal is serverless, so disks are by default ephemeral.
To make sure our weights don't disappear between runs,
which would trigger a long download, we store them in a
Modal [Volume](https://modal.com/docs/guide/volumes).

For more on how to use Modal Volumes to store model weights,
see [this guide](https://modal.com/docs/guide/model-weights).

```python
model_cache = modal.Volume.from_name("llamacpp-cache", create_if_missing=True)
cache_dir = "/root/.cache/llama.cpp"

download_image = (
    modal.Image.debian_slim(python_version="3.11")
    .pip_install("huggingface_hub[hf_transfer]==0.26.2")
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
)

@app.function(
    image=download_image, volumes={cache_dir: model_cache}, timeout=30 * MINUTES
)
def download_model(repo_id, allow_patterns, revision: Optional[str] = None):
    from huggingface_hub import snapshot_download

    print(f"🦙 downloading model from {repo_id} if not present")

    snapshot_download(
        repo_id=repo_id,
        revision=revision,
        local_dir=cache_dir,
        allow_patterns=allow_patterns,
    )

    model_cache.commit()  # ensure other Modal Functions can see our writes before we quit

    print("🦙 model loaded")

```

## Storing model outputs on Modal

Contemporary large reasoning models are slow --
for the sample "flappy bird" prompt we provide,
results are sometimes produced only after several (or even tens of) minutes.

That makes their outputs worth storing.
In addition to sending them back to clients,
like our local command line,
we'll store the results on a Modal Volume for safe-keeping.

```python
results = modal.Volume.from_name("llamacpp-results", create_if_missing=True)
results_dir = "/root/results"

```

You can retrieve the results later in a number of ways.

You can use the Volume CLI:

```bash
modal volume ls llamacpp-results
```

You can attach the Volume to a Modal `shell`
to poke around in a familiar terminal environment:

```bash
modal shell --volume llamacpp-results
# then cd into /mnt
```

Or you can access it from any other Python environment
by using the same `modal.Volume` call as above to instantiate it:

```python
results = modal.Volume.from_name("llamacpp-results")
print(dir(results))  # show methods
```

## Running llama.cpp as a Modal Function

Now, let's put it all together.

At the top of our `llama_cpp_inference` function,
we add an `app.function` decorator to attach all of our infrastructure:

- the `image` with the dependencies
- the `volumes` with the weights and where we can put outputs
- the `gpu` we want, if any

We also specify a `timeout` after which to cancel the run.

Inside the function, we call the `llama.cpp` CLI
with `subprocess.Popen`. This requires a bit of extra ceremony
because we want to both show the output as we run
and store the output to save and return to the local caller.
For details, see the [Addenda section](#addenda) below.

Alternatively, you might set up an OpenAI-compatible server
using base `llama.cpp` or its [Python wrapper library](https://github.com/abetlen/llama-cpp-python)
along with one of [Modal's decorators for web hosting](https://modal.com/docs/guide/webhooks).

```python
@app.function(
    image=image,
    volumes={cache_dir: model_cache, results_dir: results},
    gpu=GPU_CONFIG,
    timeout=30 * MINUTES,
)
def llama_cpp_inference(
    model_entrypoint_file: str,
    prompt: Optional[str] = None,
    n_predict: int = -1,
    args: Optional[list[str]] = None,
    store_output: bool = True,
):
    import subprocess
    from uuid import uuid4

    if prompt is None:
        prompt = DEFAULT_PROMPT  # see end of file
    if "deepseek" in model_entrypoint_file.lower():
        prompt = "<｜User｜>" + prompt + "<think>"
    if args is None:
        args = []

    # set layers to "off-load to", aka run on, GPU
    if GPU_CONFIG is not None:
        n_gpu_layers = 9999  # all
    else:
        n_gpu_layers = 0

    if store_output:
        result_id = str(uuid4())
        print(f"🦙 running inference with id:{result_id}")

    command = [
        "/llama.cpp/llama-cli",
        "--model",
        f"{cache_dir}/{model_entrypoint_file}",
        "--n-gpu-layers",
        str(n_gpu_layers),
        "--prompt",
        prompt,
        "--n-predict",
        str(n_predict),
    ] + args

    print("🦙 running commmand:", command, sep="\n\t")
    p = subprocess.Popen(
        command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=False
    )

    stdout, stderr = collect_output(p)

    if p.returncode != 0:
        raise subprocess.CalledProcessError(p.returncode, command, stdout, stderr)

    if store_output:  # save results to a Modal Volume if requested
        print(f"🦙 saving results for {result_id}")
        result_dir = Path(results_dir) / result_id
        result_dir.mkdir(
            parents=True,
        )
        (result_dir / "out.txt").write_text(stdout)
        (result_dir / "err.txt").write_text(stderr)

    return stdout

```

# Addenda

The remainder of this code is less interesting from the perspective
of running LLM inference on Modal but necessary for the code to run.

For example, it includes the default "Flappy Bird in Python" prompt included in
[unsloth's announcement](https://unsloth.ai/blog/deepseekr1-dynamic)
of their 1.58 bit quantization of DeepSeek-R1.

```python
DEFAULT_PROMPT = """Create a Flappy Bird game in Python. You must include these things:

    You must use pygame.
    The background color should be randomly chosen and is a light shade. Start with a light blue color.
    Pressing SPACE multiple times will accelerate the bird.
    The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.
    Place on the bottom some land colored as dark brown or yellow chosen randomly.
    Make a score shown on the top right side. Increment if you pass pipes and don't hit them.
    Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.
    When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.

The final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section."""

def stream_output(stream, queue, write_stream):
    """Reads lines from a stream and writes to a queue and a write stream."""
    for line in iter(stream.readline, b""):
        line = line.decode("utf-8", errors="replace")
        write_stream.write(line)
        write_stream.flush()
        queue.put(line)
    stream.close()

def collect_output(process):
    """Collect up the stdout and stderr of a process while still streaming it out."""
    import sys
    from queue import Queue
    from threading import Thread

    stdout_queue = Queue()
    stderr_queue = Queue()

    stdout_thread = Thread(
        target=stream_output, args=(process.stdout, stdout_queue, sys.stdout)
    )
    stderr_thread = Thread(
        target=stream_output, args=(process.stderr, stderr_queue, sys.stderr)
    )
    stdout_thread.start()
    stderr_thread.start()

    stdout_thread.join()
    stderr_thread.join()
    process.wait()

    stdout_collected = "".join(stdout_queue.queue)
    stderr_collected = "".join(stderr_queue.queue)

    return stdout_collected, stderr_collected

```

### Long-Training

# Run long, resumable training jobs on Modal

Individual Modal Function calls have a [maximum timeout of 24 hours](https://modal.com/docs/guide/timeouts).
You can still run long training jobs on Modal by making them interruptible and resumable
(aka [_reentrant_](https://en.wikipedia.org/wiki/Reentrancy_%28computing%29)).

This is usually done via checkpointing: saving the model state to disk at regular intervals.
We recommend implementing checkpointing logic regardless of the duration of your training jobs.
This prevents loss of progress in case of interruptions or [preemptions](https://modal.com/docs/guide/preemption).

In this example, we'll walk through how to implement this pattern in
[PyTorch Lightning](https://lightning.ai/docs/pytorch/2.4.0/).

But the fundamental pattern is simple and can be applied to any training framework:

1. Periodically save checkpoints to a Modal [Volume](https://modal.com/docs/guide/volumes)
2. When your training function starts, check the Volume for the latest checkpoint
3. Add [retries](https://modal.com/docs/guide/retries) to your training function

## Resuming from checkpoints in a training loop

The `train` function below shows some very simple training logic
using the built-in checkpointing features of PyTorch Lightning.

Lightning uses a special filename, `last.ckpt`,
to indicate which checkpoint is the most recent.
We check for this file and resume training from it if it exists.

```python
from pathlib import Path
from typing import Optional

import modal

def train(experiment):
    experiment_dir = CHECKPOINTS_PATH / experiment
    last_checkpoint = experiment_dir / "last.ckpt"

    if last_checkpoint.exists():
        print(f"⚡️ resuming training from the latest checkpoint: {last_checkpoint}")
        train_model(
            DATA_PATH,
            experiment_dir,
            resume_from_checkpoint=last_checkpoint,
        )
        print("⚡️ training finished successfully")
    else:
        print("⚡️ starting training from scratch")
        train_model(DATA_PATH, experiment_dir)

```

This implementation works fine in a local environment.
Running it serverlessly and durably on Modal -- with access to auto-scaling cloud GPU infrastructure
-- does not require any adjustments to the code.
We just need to ensure that data and checkpoints are saved in Modal _Volumes_.

## Modal Volumes are distributed file systems

Modal [Volumes](https://modal.com/docs/guide/volumes) are distributed file systems --
you can read and write files from them just like local disks,
but they are accessible to all of your Modal Functions.
Their performance is tuned for [Write-Once, Read-Many](https://en.wikipedia.org/wiki/Write_once_read_many) workloads
with small numbers of large files.

You can attach them to any Modal Function that needs access.

But first, you need to create them:

```python
volume = modal.Volume.from_name("example-long-training", create_if_missing=True)

```

## Porting training to Modal

To attach a Modal Volume to our training function, we need to port it over to run on Modal.

That means we need to define our training function's dependencies
(as a [container image](https://modal.com/docs/guide/custom-container))
and attach it to an application (a [`modal.App`](https://modal.com/docs/guide/apps)).

Modal Functions that run on GPUs [already have CUDA drivers installed](https://modal.com/docs/guide/cuda),
so dependency specification is straightforward.
We just `pip_install` PyTorch and PyTorch Lightning.

```python
image = modal.Image.debian_slim(python_version="3.12").pip_install(
    "lightning~=2.4.0", "torch~=2.4.0", "torchvision==0.19.0"
)

app = modal.App("example-long-training", image=image)

```

Next, we attach our training function to this app with `app.function`.

We define all of the serverless infrastructure-specific details of our training at this point.
For resumable training, there are three key pieces: attaching volumes, adding retries, and setting the timeout.

We want to attach the Volume to our Function so that the data and checkpoints are saved into it.
In this sample code, we set these paths via global variables, but in another setting,
these might be set via environment variables or other configuration mechanisms.

```python
volume_path = Path("/experiments")
DATA_PATH = volume_path / "data"
CHECKPOINTS_PATH = volume_path / "checkpoints"

volumes = {volume_path: volume}

```

Then, we define how we want to restart our training in case of interruption.
We can use `modal.Retries` to add automatic retries to our Function.
We set the delay time to `0.0` seconds, because on pre-emption or timeout we want to restart immediately.
We set `max_retries` to the current maximum, which is `10`.

```python
retries = modal.Retries(initial_delay=0.0, max_retries=10)

```

Timeouts on Modal are set in seconds, with a minimum of 10 seconds and a maximum of 24 hours.
When running training jobs that last up to week, we'd set that timeout to 24 hours,
which would give our training job a maximum of 10 days to complete before we'd need to manually restart.

For this example, we'll set it to 30 seconds. When running the example, you should observe a few interruptions.

```python
timeout = 30  # seconds

```

Now, we put all of this together by wrapping `train` and decorating it
with `app.function` to add all the infrastructure. We add `max_inputs=1` to ensure that our retries
will always kickoff in a fresh container.

```python
@app.function(
    volumes=volumes, gpu="a10g", timeout=timeout, retries=retries, max_inputs=1
)
def train_interruptible(*args, **kwargs):
    train(*args, **kwargs)

```

## Kicking off interruptible training

We define a [`local_entrypoint`](https://modal.com/docs/guide/apps#entrypoints-for-ephemeral-apps)
to kick off the training job from the local Python environment.

```python
@app.local_entrypoint()
def main(experiment: Optional[str] = None):
    if experiment is None:
        from uuid import uuid4

        experiment = uuid4().hex[:8]
    print(f"⚡️ starting interruptible training experiment {experiment}")
    train_interruptible.spawn(experiment).get()

```

It's important to use `.spawn(...).get()` because `.remote` created Function Calls
expire after 24 hours.

You can run this with
```bash
modal run --detach 06_gpu_and_ml/long-training.py
```

You should see the training job start and then be interrupted,
producing a large stack trace in the terminal in red font.
The job will restart within a few seconds.

The `--detach` flag ensures training will continue even if you close your terminal or turn off your computer.
Try detaching and then watch the logs in the [Modal dashboard](https://modal.com/apps).

## Details of PyTorch Lightning implementation

This basic pattern works for any training framework or for custom training jobs --
or for any reentrant work that can save state to disk.

But to make the example complete, we include all the details of the PyTorch Lightning implementation below.

PyTorch Lightning offers [built-in checkpointing](https://pytorch-lightning.readthedocs.io/en/1.2.10/common/weights_loading.html).
You can specify the checkpoint file path that you want to resume from using the `ckpt_path` parameter of
[`trainer.fit`](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.trainer.trainer.Trainer.html)
Additionally, you can specify the checkpointing interval with the `every_n_epochs` parameter of
[`ModelCheckpoint`](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.callbacks.ModelCheckpoint.html).

```python
def get_checkpoint(checkpoint_dir):
    from lightning.pytorch.callbacks import ModelCheckpoint

    return ModelCheckpoint(
        dirpath=checkpoint_dir,
        save_last=True,
        every_n_epochs=10,
        filename="{epoch:02d}",
    )

def train_model(data_dir, checkpoint_dir, resume_from_checkpoint=None):
    import lightning as L

    autoencoder = get_autoencoder()
    train_loader = get_train_loader(data_dir=data_dir)
    checkpoint_callback = get_checkpoint(checkpoint_dir)

    trainer = L.Trainer(
        limit_train_batches=100, max_epochs=100, callbacks=[checkpoint_callback]
    )
    if resume_from_checkpoint is not None:
        trainer.fit(
            model=autoencoder,
            train_dataloaders=train_loader,
            ckpt_path=resume_from_checkpoint,
        )
    else:
        trainer.fit(autoencoder, train_loader)

def get_autoencoder(checkpoint_path=None):
    import lightning as L
    from torch import nn, optim

    class LitAutoEncoder(L.LightningModule):
        def __init__(self):
            super().__init__()
            self.encoder = nn.Sequential(
                nn.Linear(28 * 28, 64), nn.ReLU(), nn.Linear(64, 3)
            )
            self.decoder = nn.Sequential(
                nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 28 * 28)
            )

        def training_step(self, batch, batch_idx):
            x, _ = batch
            x = x.view(x.size(0), -1)
            z = self.encoder(x)
            x_hat = self.decoder(z)
            loss = nn.functional.mse_loss(x_hat, x)
            self.log("train_loss", loss)
            return loss

        def configure_optimizers(self):
            optimizer = optim.Adam(self.parameters(), lr=1e-3)
            return optimizer

    return LitAutoEncoder()

def get_train_loader(data_dir):
    from torch import utils
    from torchvision.datasets import MNIST
    from torchvision.transforms import ToTensor

    print("⚡ setting up data")
    dataset = MNIST(data_dir, download=True, transform=ToTensor())
    train_loader = utils.data.DataLoader(dataset, num_workers=4)
    return train_loader

```

### Ltx

# Generate videos from prompts with Lightricks LTX-Video

This example demonstrates how to run the [LTX-Video](https://github.com/Lightricks/LTX-Video)
video generation model by [Lightricks](https://www.lightricks.com/) on Modal.

LTX-Video is fast! Generating a twenty second 480p video at moderate quality
takes as little as two seconds on a warm container.

Here's one that we generated:

<center>
<video controls autoplay loop muted>
<source src="https://modal-cdn.com/blonde-woman-blinking.mp4" type="video/mp4" />
</video>
</center>

## Setup

We start by importing dependencies we need locally,
defining a Modal [App](https://modal.com/docs/guide/apps),
and defining the container [Image](https://modal.com/docs/guide/images)
that our video model will run in.

```python
import string
import time
from pathlib import Path
from typing import Optional

import modal

app = modal.App("example-ltx")

image = (
    modal.Image.debian_slim(python_version="3.12")
    .pip_install(
        "accelerate==1.6.0",
        "diffusers==0.33.1",
        "hf_transfer==0.1.9",
        "imageio==2.37.0",
        "imageio-ffmpeg==0.5.1",
        "sentencepiece==0.2.0",
        "torch==2.7.0",
        "transformers==4.51.3",
    )
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
)

```

## Storing data on Modal Volumes

On Modal, we save large or expensive-to-compute data to
[distributed Volumes](https://modal.com/docs/guide/volumes)
that are accessible both locally and remotely.

We'll store the LTX-Video model's weights and the outputs we generate
on Modal Volumes.

We store the outputs on a Modal Volume so that clients
don't need to sit around waiting for the video to be generated.

```python
VOLUME_NAME = "ltx-outputs"
outputs = modal.Volume.from_name(VOLUME_NAME, create_if_missing=True)
OUTPUTS_PATH = Path("/outputs")

```

We store the weights on a Modal Volume so that we don't
have to fetch them from the Hugging Face Hub every time
a container boots. This download takes about two minutes,
depending on traffic and network speed.

```python
MODEL_VOLUME_NAME = "ltx-model"
model = modal.Volume.from_name(MODEL_VOLUME_NAME, create_if_missing=True)

```

We don't have to change any of the Hugging Face code to do this --
we just set the location of Hugging Face's cache to be on a Volume
using the `HF_HOME` environment variable.

```python
MODEL_PATH = Path("/models")
image = image.env({"HF_HOME": str(MODEL_PATH)})

```

For more on storing model weights on Modal, see
[this guide](https://modal.com/docs/guide/model-weights).

## Setting up our LTX class

We use the `@cls` decorator to specify the infrastructure our inference function needs,
as defined above.

That decorator also gives us control over the
[lifecycle](https://modal.com/docs/guide/lifecycle-functions)
of our cloud container.

Specifically, we use the `enter` method to load the model into GPU memory
(from the Volume if it's present or the Hub if it's not)
before the container is marked ready for inputs.

This helps reduce tail latencies caused by cold starts.
For details and more tips, see [this guide](https://modal.com/docs/guide/cold-start#cold-start-performance).

The actual inference code is in a `modal.method` of the class.

```python
MINUTES = 60  # seconds

@app.cls(
    image=image,  # use our container Image
    volumes={OUTPUTS_PATH: outputs, MODEL_PATH: model},  # attach our Volumes
    gpu="H100",  # use a big, fast GPU
    timeout=10 * MINUTES,  # run inference for up to 10 minutes
    scaledown_window=15 * MINUTES,  # stay idle for 15 minutes before scaling down
)
class LTX:
    @modal.enter()
    def load_model(self):
        import torch
        from diffusers import DiffusionPipeline

        self.pipe = DiffusionPipeline.from_pretrained(
            "Lightricks/LTX-Video", torch_dtype=torch.bfloat16
        )
        self.pipe.to("cuda")

    @modal.method()
    def generate(
        self,
        prompt,
        negative_prompt="",
        num_inference_steps=200,
        guidance_scale=4.5,
        num_frames=19,
        width=704,
        height=480,
    ):
        from diffusers.utils import export_to_video

        frames = self.pipe(
            prompt=prompt,
            negative_prompt=negative_prompt,
            num_inference_steps=num_inference_steps,
            guidance_scale=guidance_scale,
            num_frames=num_frames,
            width=width,
            height=height,
        ).frames[0]

        # save to disk using prompt as filename
        mp4_name = slugify(prompt)
        export_to_video(frames, Path(OUTPUTS_PATH) / mp4_name)
        outputs.commit()
        return mp4_name

```

## Generate videos from the command line

We trigger LTX-Video inference from our local machine by running the code in
the local entrypoint below with `modal run`.

It will spin up a new replica to generate a video.
Then it will, by default, generate a second video to demonstrate
the lower latency when hitting a warm container.

You can trigger inference with:

```bash
modal run ltx
```

All outputs are saved both locally and on a Modal Volume.
You can explore the contents of Modal Volumes from your Modal Dashboard
or from the command line with the `modal volume` command.

```bash
modal volume ls ltx-outputs
```

See `modal volume --help` for details.

Optional command line flags for the script can be viewed with:

```bash
modal run ltx --help
```

Using these flags, you can tweak your generation from the command line:

```bash
modal run --detach ltx --prompt="a cat playing drums in a jazz ensemble" --num-inference-steps=64
```

```python
@app.local_entrypoint()
def main(
    prompt: Optional[str] = None,
    negative_prompt="worst quality, blurry, jittery, distorted",
    num_inference_steps: int = 10,  # 10 when testing, 100 or more when generating
    guidance_scale: float = 2.5,
    num_frames: int = 150,  # produces ~10s of video
    width: int = 704,
    height: int = 480,
    twice: bool = True,  # run twice to show cold start latency
):
    if prompt is None:
        prompt = DEFAULT_PROMPT

    ltx = LTX()

    def run():
        print(f"🎥 Generating a video from the prompt '{prompt}'")
        start = time.time()
        mp4_name = ltx.generate.remote(
            prompt=prompt,
            negative_prompt=negative_prompt,
            num_inference_steps=num_inference_steps,
            guidance_scale=guidance_scale,
            num_frames=num_frames,
            width=width,
            height=height,
        )
        duration = time.time() - start
        print(f"🎥 Client received video in {int(duration)}s")
        print(f"🎥 LTX video saved to Modal Volume at {mp4_name}")

        local_dir = Path("/tmp/ltx")
        local_dir.mkdir(exist_ok=True, parents=True)
        local_path = local_dir / mp4_name
        local_path.write_bytes(b"".join(outputs.read_file(mp4_name)))
        print(f"🎥 LTX video saved locally at {local_path}")

    run()

    if twice:
        print("🎥 Generating a video from a warm container")
        run()

```

## Addenda

The remainder of the code in this file is utility code.

```python
DEFAULT_PROMPT = (
    "The camera pans over a snow-covered mountain range,"
    " revealing a vast expanse of snow-capped peaks and valleys."
    " The mountains are covered in a thick layer of snow,"
    " with some areas appearing almost white while others have a slightly darker, almost grayish hue."
    " The peaks are jagged and irregular, with some rising sharply into the sky"
    " while others are more rounded."
    " The valleys are deep and narrow, with steep slopes that are also covered in snow."
    " The trees in the foreground are mostly bare, with only a few leaves remaining on their branches."
)

def slugify(prompt):
    for char in string.punctuation:
        prompt = prompt.replace(char, "")
    prompt = prompt.replace(" ", "_")
    prompt = prompt[:230]  # some OSes limit filenames to <256 chars
    mp4_name = str(int(time.time())) + "_" + prompt + ".mp4"
    return mp4_name

```

### Mochi

# Text-to-video generation with Mochi

This example demonstrates how to run the [Mochi 1](https://github.com/genmoai/models)
video generation model by [Genmo](https://www.genmo.ai/) on Modal.

Here's one that we generated, inspired by our logo:

<center>
<video controls autoplay loop muted>
<source src="https://modal-cdn.com/modal-logo-splat.mp4" type="video/mp4" />
</video>
</center>

Note that the Mochi model, at time of writing,
requires several minutes on one H100 to produce
a high-quality clip of even a few seconds.
So a single video generation therefore costs about $0.33
at our ~$5/hr rate for H100s.

Keep your eyes peeled for improved efficiency
as the open source community works on this new model.
We welcome PRs to improve the performance of this example!

## Setting up the environment for Mochi

At the time of writing, Mochi is supported natively in the [`diffusers`](https://github.com/huggingface/diffusers) library,
but only in a pre-release version.
So we'll need to install `diffusers` and `transformers` from GitHub.

```python
import string
import time
from pathlib import Path

import modal

app = modal.App("example-mochi")

image = (
    modal.Image.debian_slim(python_version="3.11")
    .apt_install("git")
    .pip_install(
        "torch==2.5.1",
        "accelerate==1.1.1",
        "hf_transfer==0.1.8",
        "sentencepiece==0.2.0",
        "imageio==2.36.0",
        "imageio-ffmpeg==0.5.1",
        "git+https://github.com/huggingface/transformers@30335093276212ce74938bdfd85bfd5df31a668a",
        "git+https://github.com/huggingface/diffusers@99c0483b67427de467f11aa35d54678fd36a7ea2",
    )
    .env(
        {
            "HF_HUB_ENABLE_HF_TRANSFER": "1",
            "HF_HOME": "/models",
        }
    )
)

```

## Saving outputs

On Modal, we save large or expensive-to-compute data to
[distributed Volumes](https://modal.com/docs/guide/volumes)

We'll use this for saving our Mochi weights, as well as our video outputs.

```python
VOLUME_NAME = "mochi-outputs"
outputs = modal.Volume.from_name(VOLUME_NAME, create_if_missing=True)
OUTPUTS_PATH = Path("/outputs")  # remote path for saving video outputs

MODEL_VOLUME_NAME = "mochi-model"
model = modal.Volume.from_name(MODEL_VOLUME_NAME, create_if_missing=True)
MODEL_PATH = Path("/models")  # remote path for saving model weights

MINUTES = 60
HOURS = 60 * MINUTES

```

## Downloading the model

We download the model weights into Volume cache to speed up cold starts. For more on storing model weights on Modal, see
[this guide](https://modal.com/docs/guide/model-weights).

This download takes five minutes or more, depending on traffic
and network speed.

If you want to launch the download first,
before running the rest of the code,
use the following command from the folder containing this file:

```bash
modal run --detach mochi::download_model
```

The `--detach` flag ensures the download will continue
even if you close your terminal or shut down your computer
while it's running.

```python
with image.imports():
    import torch
    from diffusers import MochiPipeline
    from diffusers.utils import export_to_video

@app.function(
    image=image,
    volumes={
        MODEL_PATH: model,
    },
    timeout=20 * MINUTES,
)
def download_model(revision="83359d26a7e2bbe200ecbfda8ebff850fd03b545"):
    # uses HF_HOME to point download to the model volume
    MochiPipeline.from_pretrained(
        "genmo/mochi-1-preview",
        torch_dtype=torch.bfloat16,
        revision=revision,
    )

```

## Setting up our Mochi class

We'll use the `@cls` decorator to define a [Modal Class](https://modal.com/docs/guide/lifecycle-functions)
which we use to control the lifecycle of our cloud container.

We configure it to use our image, the distributed volume, and a single H100 GPU.

```python
@app.cls(
    image=image,
    volumes={
        OUTPUTS_PATH: outputs,  # videos will be saved to a distributed volume
        MODEL_PATH: model,
    },
    gpu="H100",
    timeout=1 * HOURS,
)
class Mochi:
    @modal.enter()
    def load_model(self):
        # our HF_HOME env var points to the model volume as the cache
        self.pipe = MochiPipeline.from_pretrained(
            "genmo/mochi-1-preview",
            torch_dtype=torch.bfloat16,
        )
        self.pipe.enable_model_cpu_offload()
        self.pipe.enable_vae_tiling()

    @modal.method()
    def generate(
        self,
        prompt,
        negative_prompt="",
        num_inference_steps=200,
        guidance_scale=4.5,
        num_frames=19,
    ):
        frames = self.pipe(
            prompt=prompt,
            negative_prompt=negative_prompt,
            num_inference_steps=num_inference_steps,
            guidance_scale=guidance_scale,
            num_frames=num_frames,
        ).frames[0]

        # save to disk using prompt as filename
        mp4_name = slugify(prompt)
        export_to_video(frames, Path(OUTPUTS_PATH) / mp4_name)
        outputs.commit()
        return mp4_name

```

## Running Mochi inference

We can trigger Mochi inference from our local machine by running the code in
the local entrypoint below.

It ensures the model is downloaded to a remote volume,
spins up a new replica to generate a video, also saved remotely,
and then downloads the video to the local machine.

You can trigger it with:
```bash
modal run --detach mochi
```

Optional command line flags can be viewed with:
```bash
modal run mochi --help
```

Using these flags, you can tweak your generation from the command line:
```bash
modal run --detach mochi --prompt="a cat playing drums in a jazz ensemble" --num-inference-steps=64
```

```python
@app.local_entrypoint()
def main(
    prompt="Close-up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k.",
    negative_prompt="",
    num_inference_steps=200,
    guidance_scale=4.5,
    num_frames=19,  # produces ~1s of video
):
    mochi = Mochi()
    mp4_name = mochi.generate.remote(
        prompt=str(prompt),
        negative_prompt=str(negative_prompt),
        num_inference_steps=int(num_inference_steps),
        guidance_scale=float(guidance_scale),
        num_frames=int(num_frames),
    )
    print(f"🍡 video saved to volume at {mp4_name}")

    local_dir = Path("/tmp/mochi")
    local_dir.mkdir(exist_ok=True, parents=True)
    local_path = local_dir / mp4_name
    local_path.write_bytes(b"".join(outputs.read_file(mp4_name)))
    print(f"🍡 video saved locally at {local_path}")

```

## Addenda

The remainder of the code in this file is utility code.

```python
def slugify(prompt):
    for char in string.punctuation:
        prompt = prompt.replace(char, "")
    prompt = prompt.replace(" ", "_")
    prompt = prompt[:230]  # since filenames can't be longer than 255 characters
    mp4_name = str(int(time.time())) + "_" + prompt + ".mp4"
    return mp4_name

```

### Modal Tailscale

# Add Modal Apps to Tailscale

This example demonstrates how to integrate Modal with Tailscale (https://tailscale.com).
It outlines the steps to configure Modal containers so that they join the Tailscale network.

We use a custom entrypoint to automatically add containers to a Tailscale network (tailnet).
This configuration enables the containers to interact with one another and with
additional applications within the same tailnet.

```python
import modal

```

Install Tailscale and copy custom entrypoint script ([entrypoint.sh](https://github.com/modal-labs/modal-examples/blob/main/10_integrations/tailscale/entrypoint.sh)). The script must be
executable.

```python
image = (
    modal.Image.debian_slim(python_version="3.11")
    .apt_install("curl")
    .run_commands("curl -fsSL https://tailscale.com/install.sh | sh")
    .pip_install("requests==2.32.3", "PySocks==1.7.1")
    .add_local_file("./entrypoint.sh", "/root/entrypoint.sh", copy=True)
    .run_commands("chmod a+x /root/entrypoint.sh")
    .entrypoint(["/root/entrypoint.sh"])
)
app = modal.App("example-modal-tailscale", image=image)

```

Packages might not be installed locally. This catches import errors and
only attempts imports in the container.

```python
with image.imports():
    import socket

    import socks

```

Configure Python to use the SOCKS5 proxy globally.

```python
if not modal.is_local():
    socks.set_default_proxy(socks.SOCKS5, "0.0.0.0", 1080)
    socket.socket = socks.socksocket

```

Run your function adding a Tailscale secret. We suggest creating a [reusable and ephemeral key](https://tailscale.com/kb/1111/ephemeral-nodes).

```python
@app.function(
    secrets=[
        modal.Secret.from_name("tailscale-auth", required_keys=["TAILSCALE_AUTHKEY"]),
        modal.Secret.from_dict(
            {
                "ALL_PROXY": "socks5://localhost:1080/",
                "HTTP_PROXY": "http://localhost:1080/",
                "http_proxy": "http://localhost:1080/",
            }
        ),
    ],
)
def connect_to_machine():
    import requests

    # Connect to other machines in your tailnet.
    resp = requests.get("http://my-tailscale-machine:5000")
    print(resp.content)

```

Run this script with `modal run modal_tailscale.py`. You will see Tailscale logs
when the container start indicating that you were able to login successfully and
that the proxies (SOCKS5 and HTTP) have created been successfully. You will also
be able to see Modal containers in your Tailscale dashboard in the "Machines" tab.
Every new container launched will show up as a new "machine". Containers are
individually addressable using their Tailscale name or IP address.

### Modal Webrtc

```python
import asyncio
import json
import queue
from abc import ABC, abstractmethod
from typing import Optional

import modal
from fastapi import FastAPI, WebSocket
from fastapi.websockets import WebSocketState

class ModalWebRtcPeer(ABC):
    """
    Base class for implementing WebRTC peer connections in Modal using aiortc.
    Implement using the `app.cls` decorator.

    This class provides a complete WebRTC peer implementation that handles:
    - Peer connection lifecycle management (creation, negotiation, cleanup)
    - Signaling via Modal Queue for SDP offer/answer exchange and ICE candidate handling
    - Automatic STUN server configuration (defaults to Google's STUN server)
    - Stream setup and management

    Required methods to override:
    - setup_streams(): Implementation for setting up media tracks and streams

    Optional methods to override:
    - initialize(): Custom initialization logic when peer is created
    - run_streams(): Implementation for stream runtime logic
    - get_turn_servers(): Implementation to provide custom TURN server configuration
    - exit(): Custom cleanup logic when peer is shutting down

    The peer connection is established through a ModalWebRtcSignalingServer that manages
    the signaling process between this peer and client peers.
    """

    @modal.enter()
    async def _initialize(self):
        import shortuuid

        self.id = shortuuid.uuid()
        self.pcs = {}
        self.pending_candidates = {}

        # call custom init logic
        await self.initialize()

    async def initialize(self):
        """Override to add custom logic when creating a peer"""

    @abstractmethod
    async def setup_streams(self, peer_id):
        """Override to add custom logic when creating a connection and setting up streams"""
        raise NotImplementedError

    async def run_streams(self, peer_id):
        """Override to add custom logic when running streams"""

    async def get_turn_servers(self, peer_id=None, msg=None) -> Optional[list]:
        """Override to customize TURN servers"""

    async def _setup_peer_connection(self, peer_id):
        """Creates an RTC peer connection via an ICE server"""
        from aiortc import RTCConfiguration, RTCIceServer, RTCPeerConnection

        # aiortc automatically uses google's STUN server,
        # but we can also specify our own
        config = RTCConfiguration(
            iceServers=[RTCIceServer(urls="stun:stun.l.google.com:19302")]
        )
        self.pcs[peer_id] = RTCPeerConnection(configuration=config)
        self.pending_candidates[peer_id] = []
        await self.setup_streams(peer_id)

        print(
            f"{self.id}: Created peer connection and setup streams from {self.id} to {peer_id}"
        )

    @modal.method()
    async def run(self, q: modal.Queue, peer_id: str):
        """Run the RTC peer after establishing a connection by passing WebSocket messages over a Queue."""
        print(f"{self.id}: Running modal peer instance for client peer: {peer_id}...")

        await self._connect_over_queue(q, peer_id)
        await self._run_streams(peer_id)

    async def _connect_over_queue(self, q, peer_id):
        """Connect this peer to another by passing messages along a Modal Queue."""

        msg_handlers = {  # message types we need to handle
            "offer": self.handle_offer,  # SDP offer
            "ice_candidate": self.handle_ice_candidate,  # trickled ICE candidate
            "identify": self.get_identity,  # identify challenge
            "get_turn_servers": self.get_turn_servers,  # TURN server request
        }

        while True:
            try:
                if self.pcs.get(peer_id) and (
                    self.pcs[peer_id].connectionState
                    in ["connected", "closed", "failed"]
                ):
                    print(f"{self.id}: Closing connection to {peer_id} over queue...")
                    q.put("close", partition="server")
                    break

                # read and parse websocket message passed over queue
                msg = json.loads(await q.get.aio(partition=peer_id, timeout=0.5))
                # dispatch the message to its handler
                if handler := msg_handlers.get(msg.get("type")):
                    response = await handler(peer_id, msg)
                else:
                    print(f"{self.id}: Unknown message type: {msg.get('type')}")
                    response = None

                # pass the message back over the queue to the server
                if response is not None:
                    await q.put.aio(json.dumps(response), partition="server")
            except queue.Empty:
                print(f"{self.id}: Queue empty, waiting for message...")
                pass
            except Exception as e:
                print(
                    f"{self.id}: Error handling message from {peer_id}: {type(e)}: {e}"
                )
                continue

    async def _run_streams(self, peer_id):
        """Run WebRTC streaming with a peer."""
        print(f"{self.id}:  running streams to {peer_id}...")

        await self.run_streams(peer_id)

        # run until connection is closed or broken
        while self.pcs[peer_id].connectionState == "connected":
            await asyncio.sleep(0.1)

        print(f"{self.id}:  ending streaming to {peer_id}")

    async def handle_offer(self, peer_id, msg):
        """Handles a peers SDP offer message by producing an SDP answer."""
        from aiortc import RTCSessionDescription

        print(f"{self.id}:  handling SDP offer from {peer_id}...")

        await self._setup_peer_connection(peer_id)
        await self.pcs[peer_id].setRemoteDescription(
            RTCSessionDescription(msg["sdp"], msg["type"])
        )
        answer = await self.pcs[peer_id].createAnswer()
        await self.pcs[peer_id].setLocalDescription(answer)
        sdp = self.pcs[peer_id].localDescription.sdp

        return {"sdp": sdp, "type": answer.type, "peer_id": self.id}

    async def handle_ice_candidate(self, peer_id, msg):
        """Add an ICE candidate sent by a peer."""
        from aiortc import RTCIceCandidate
        from aiortc.sdp import candidate_from_sdp

        candidate = msg.get("candidate")

        if not candidate:
            raise ValueError

        print(
            f"{self.id}:  received ice candidate from {peer_id}: {candidate['candidate_sdp']}..."
        )

        # parse ice candidate
        ice_candidate: RTCIceCandidate = candidate_from_sdp(candidate["candidate_sdp"])
        ice_candidate.sdpMid = candidate["sdpMid"]
        ice_candidate.sdpMLineIndex = candidate["sdpMLineIndex"]

        if not self.pcs.get(peer_id):
            self.pending_candidates[peer_id].append(ice_candidate)
        else:
            if len(self.pending_candidates[peer_id]) > 0:
                [
                    await self.pcs[peer_id].addIceCandidate(c)
                    for c in self.pending_candidates[peer_id]
                ]
                self.pending_candidates[peer_id] = []
            await self.pcs[peer_id].addIceCandidate(ice_candidate)

    async def get_identity(self, peer_id=None, msg=None):
        """Reply to an identify message with own id."""
        return {"type": "identify", "peer_id": self.id}

    async def generate_offer(self, peer_id):
        print(f"{self.id}:  generating offer for {peer_id}...")

        await self._setup_peer_connection(peer_id)
        offer = await self.pcs[peer_id].createOffer()
        await self.pcs[peer_id].setLocalDescription(offer)
        sdp = self.pcs[peer_id].localDescription.sdp

        return {"sdp": sdp, "type": offer.type, "peer_id": self.id}

    async def handle_answer(self, peer_id, answer):
        from aiortc import RTCSessionDescription

        print(f"{self.id}:  handling answer from {peer_id}...")
        # set remote peer description
        await self.pcs[peer_id].setRemoteDescription(
            RTCSessionDescription(sdp=answer["sdp"], type=answer["type"])
        )

    @modal.exit()
    async def _exit(self):
        print(f"{self.id}: Shutting down...")
        await self.exit()

        if self.pcs:
            print(f"{self.id}: Closing peer connections...")
            await asyncio.gather(*[pc.close() for pc in self.pcs.values()])
            self.pcs = {}

    async def exit(self):
        """Override with any custom logic when shutting down container."""

class ModalWebRtcSignalingServer:
    """
    WebRTC signaling server implementation that mediates connections between client peers
    and Modal-based WebRTC peers. Implement using the `app.cls` decorator.

    This server:
    - Provides a WebSocket endpoint (/ws/{peer_id}) for client connections
    - Spawns Modal-based peer instances for each client connection
    - Handles the WebRTC signaling process by relaying messages between clients and Modal peers
    - Manages the lifecycle of Modal peer instances

    To use this class:
    1. Create a subclass implementing get_modal_peer_class() to return your ModalWebRtcPeer implementation
    2. Optionally override initialize() for custom server setup
    3. Optionally add a frontend route to the `web_app` attribute
    """

    @modal.enter()
    def _initialize(self):
        self.web_app = FastAPI()

        # handle signaling through websocket endpoint
        @self.web_app.websocket("/ws/{peer_id}")
        async def ws(client_websocket: WebSocket, peer_id: str):
            try:
                await client_websocket.accept()
                print(f"Server: Accepted websocket connection from {peer_id}...")
                await self._mediate_negotiation(client_websocket, peer_id)
            except Exception as e:
                print(
                    f"Server: Error accepting websocket connection from {peer_id}: {type(e)}: {e}"
                )
                await client_websocket.close()

        self.initialize()

    def initialize(self):
        pass

    @abstractmethod
    def get_modal_peer_class(self) -> type[ModalWebRtcPeer]:
        """
        Abstract method to return the `ModalWebRtcPeer` implementation to use.
        """
        raise NotImplementedError(
            "Implement `get_modal_peer` to use `ModalWebRtcSignalingServer`"
        )

    @modal.asgi_app()
    def web(self):
        return self.web_app

    async def _mediate_negotiation(self, websocket: WebSocket, peer_id: str):
        modal_peer_class = self.get_modal_peer_class()
        if not any(
            base.__name__ == "ModalWebRtcPeer" for base in modal_peer_class.__bases__
        ):
            raise ValueError(
                "Modal peer class must be an implementation of `ModalWebRtcPeer`"
            )

        with modal.Queue.ephemeral() as q:
            print(f"Server: Spawning modal peer instance for client peer {peer_id}...")
            modal_peer = modal_peer_class()
            modal_peer.run.spawn(q, peer_id)

            await asyncio.gather(
                relay_websocket_to_queue(websocket, q, peer_id),
                relay_queue_to_websocket(websocket, q, peer_id),
            )

async def relay_websocket_to_queue(websocket: WebSocket, q: modal.Queue, peer_id: str):
    while True:
        try:
            # get websocket message off queue and parse as json
            msg = await asyncio.wait_for(websocket.receive_text(), timeout=0.5)
            await q.put.aio(msg, partition=peer_id)
        except asyncio.TimeoutError:
            pass
        except Exception as e:
            if WebSocketState.DISCONNECTED in [
                websocket.application_state,
                websocket.client_state,
            ]:
                print("Server: Websocket connection closed")
                return
            else:
                print(f"Server: Error relaying from websocket to queue: {type(e)}: {e}")

async def relay_queue_to_websocket(websocket: WebSocket, q: modal.Queue, peer_id: str):
    while True:
        try:
            # get websocket message off queue and parse from json
            modal_peer_msg = await q.get.aio(partition="server", timeout=0.5)
            if modal_peer_msg.startswith("close"):
                print(
                    "Server: Close received on queue, closing websocket connection..."
                )
                await websocket.close()
                return

            await websocket.send_text(modal_peer_msg)
        except queue.Empty:
            pass
        except Exception as e:
            if WebSocketState.DISCONNECTED in [
                websocket.application_state,
                websocket.client_state,
            ]:
                print("Server: Websocket connection closed")
                return
            else:
                print(f"Server: Error relaying from queue to websocket: {type(e)}: {e}")

```

### Multion News Agent

# MultiOn: Twitter News Agent

In this example, we use Modal to deploy a cron job that periodically checks for AI news everyday and tweets it on Twitter using the MultiOn Agent API.

## Import and define the app

Let's start off with imports, and defining a Modal app.

```python
import os

import modal

app = modal.App("example-multion-news-agent")

```

## Searching for AI News

Let's also define an image that has the `multion` package installed, so we can query the API.

```python
multion_image = modal.Image.debian_slim().pip_install("multion")

```

We can now define our main entrypoint, which uses [MultiOn](https://www.multion.ai/)
to scrape AI news everyday and post it on our Twitter account.
We specify a [schedule](https://modal.com/docs/guide/cron) in the function decorator, which
means that our function will run automatically at the given interval.

## Set up MultiOn

[MultiOn](https://multion.ai/) is a Web Action Agent that can take actions on behalf of the user.
You can watch it in action [here](https://www.youtube.com/watch?v=Rm67ry6bogw).

The MultiOn API enables building the next level of web automation & custom AI agents capable of performing complex actions on the internet with just a few lines of code.

To get started, first create an account with [MultiOn](https://www.multion.ai/),
install the [MultiOn chrome extension](https://chrome.google.com/webstore/detail/ddmjhdbknfidiopmbaceghhhbgbpenmm)
and login to your Twitter account in your browser.
To use the API, create a MultiOn API Key
and store it as a Modal Secret on [the dashboard](https://modal.com/secrets)

```python
@app.function(image=multion_image, secrets=[modal.Secret.from_name("MULTION_API_KEY")])
def news_tweet_agent():
    # Import MultiOn
    import multion

    # Login to MultiOn using the API key
    multion.login(use_api=True, multion_api_key=os.environ["MULTION_API_KEY"])

    # Enable the Agent to run locally
    multion.set_remote(False)

    params = {
        "url": "https://www.multion.ai",
        "cmd": "Go to twitter (im already signed in). Search for the last tweets i made (check the last 10 tweets). Remember them so then you can go a search for super interesting AI news. Search the news on up to 3 different sources. If you see that the source has not really interesting AI news or i already made a tweet about that, then go to a different one. When you finish the research, go and make a few small and interesting AI tweets with the info you gathered. Make sure the tweet is small but informative and interesting for AI enthusiasts. Don't do more than 5 tweets",
        "maxSteps": 100,
    }

    response = multion.browse(params)

    print(f"MultiOn response: {response}")

```

## Test running

We can now test run our scheduled function as follows: `modal run multion_news_agent.py.py::app.news_tweet_agent`

## Defining the schedule and deploying

Let's define a function that will be called by Modal every day.

```python
@app.function(schedule=modal.Cron("0 9 * * *"))
def run_daily():
    news_tweet_agent.remote()

```

In order to deploy this as a persistent cron job, you can run `modal deploy multion_news_agent.py`.

Once the job is deployed, visit the [apps page](https://modal.com/apps) page to see
its execution history, logs and other stats.

### Musicgen

# Create your own music samples with MusicGen

MusicGen is a popular open-source music-generation model family from Meta.
In this example, we show you how you can run MusicGen models on Modal GPUs,
along with a Gradio UI for playing around with the model.

We use [Audiocraft](https://github.com/facebookresearch/audiocraft),
the inference library released by Meta
for MusicGen and its kin, like AudioGen.

## Setting up dependencies

```python
from pathlib import Path
from typing import Optional
from uuid import uuid4

import modal

```

We start by defining the environment our generation runs in.
This takes some explaining since, like most cutting-edge ML environments, it is a bit fiddly.

This environment is captured by a
[container image](https://modal.com/docs/guide/custom-container),
which we build step-by-step by calling methods to add dependencies,
like `apt_install` to add system packages and `pip_install` to add
Python packages.

Note that we don't have to install anything with "CUDA"
in the name -- the drivers come for free with the Modal environment
and the rest gets installed `pip`. That makes our life a lot easier!
If you want to see the details, check out [this guide](https://modal.com/docs/guide/gpu)
in our docs.

```python
image = (
    modal.Image.debian_slim(python_version="3.11")
    .apt_install("git", "ffmpeg")
    .pip_install(
        "huggingface_hub[hf_transfer]==0.27.1",  # speed up model downloads
        "torch==2.1.0",  # version pinned by audiocraft
        "numpy<2",  # defensively cap the numpy version
        "git+https://github.com/facebookresearch/audiocraft.git@v1.3.0",  # we can install directly from GitHub!
    )
)

```

In addition to source code, we'll also need the model weights.

Audiocraft integrates with the Hugging Face ecosystem, so setting up the models
is straightforward -- the same `get_pretrained` method we use to load the weights for execution
will also download them if they aren't present.

```python
def load_model(and_return=False):
    from audiocraft.models import MusicGen

    model_large = MusicGen.get_pretrained("facebook/musicgen-large")
    if and_return:
        return model_large

```

But Modal Functions are serverless: instances spin down when they aren't being used.
If we want to avoid downloading the weights every time we start a new instance,
we need to store the weights somewhere besides our local filesystem.

So we add a Modal [Volume](https://modal.com/docs/guide/volumes)
to store the weights in the cloud. For more on storing model weights on Modal, see
[this guide](https://modal.com/docs/guide/model-weights).

```python
cache_dir = "/cache"
model_cache = modal.Volume.from_name("audiocraft-model-cache", create_if_missing=True)

```

We don't need to change any of the model loading code --
we just need to make sure the model gets stored in the right directory.

To do that, we set an environment variable that Hugging Face expects
(and another one that speeds up downloads, for good measure)
and then run the `load_model` Python function.

```python
image = image.env(
    {"HF_HUB_CACHE": cache_dir, "HF_HUB_ENABLE_HF_TRANSER": "1"}
).run_function(load_model, volumes={cache_dir: model_cache})

```

While we're at it, let's also define the environment for our UI.
We'll stick with Python and so use FastAPI and Gradio.

```python
web_image = modal.Image.debian_slim(python_version="3.11").pip_install(
    "fastapi[standard]==0.115.4", "gradio==4.44.1"
)

```

This is a totally different environment from the one we run our model in.
Say goodbye to Python dependency conflict hell!

## Running music generation on Modal

Now, we write our music generation logic.
This is bit complicated because we want to support generating long samples,
but the model has a maximum context length of thirty seconds.
We can get longer clips by feeding the model's output back as input,
auto-regressively, but we have to write that ourselves.

There are also a few bits to make this work well with Modal:

- We make an [App](https://modal.com/docs/guide/apps) to organize our deployment.
- We load the model at start, instead of during inference, with `modal.enter`,
which requires that we use a Modal [`Cls`](https://modal.com/docs/guide/lifecycle-functions).
- In the `app.cls` decorator, we specify the Image we built and attach the Volume.
We also pick a GPU to run on -- here, an NVIDIA L40S.

```python
app = modal.App("example-musicgen")
MAX_SEGMENT_DURATION = 30  # maximum context window size

@app.cls(gpu="l40s", image=image, volumes={cache_dir: model_cache})
class MusicGen:
    @modal.enter()
    def init(self):
        self.model = load_model(and_return=True)

    @modal.method()
    def generate(
        self,
        prompt: str,
        duration: int = 10,
        overlap: int = 10,
        format: str = "wav",  # or mp3
    ) -> bytes:
        f"""Generate a music clip based on the prompt.

        Clips longer than the MAX_SEGMENT_DURATION of {MAX_SEGMENT_DURATION}s
        are generated by clipping all but `overlap` seconds and running inference again."""
        context = None
        overlap = min(overlap, MAX_SEGMENT_DURATION - 1)
        remaining_duration = duration

        if remaining_duration < 0:
            return bytes()

        while remaining_duration > 0:
            # calculate duration of the next segment
            segment_duration = remaining_duration
            if context is not None:
                segment_duration += overlap

            segment_duration = min(segment_duration, MAX_SEGMENT_DURATION)

            # generate next segment
            generated_duration = (
                segment_duration if context is None else (segment_duration - overlap)
            )
            print(f"🎼 generating {generated_duration} seconds of music")
            self.model.set_generation_params(duration=segment_duration)
            next_segment = self._generate_next_segment(prompt, context, overlap)

            # update remaining duration
            remaining_duration -= generated_duration

            # combine with previous segments
            context = self._combine_segments(context, next_segment, overlap)

        output = context.detach().cpu().float()[0]

        return to_audio_bytes(
            output,
            self.model.sample_rate,
            format=format,
            # for more on audio encoding parameters, see the docs for audiocraft
            strategy="loudness",
            loudness_compressor=True,
        )

    def _generate_next_segment(self, prompt, context, overlap):
        """Generate the next audio segment, either fresh or as continuation of a context."""
        if context is None:
            return self.model.generate(descriptions=[prompt])
        else:
            overlap_samples = overlap * self.model.sample_rate
            last_chunk = context[:, :, -overlap_samples:]  # B, C, T
            return self.model.generate_continuation(
                last_chunk, self.model.sample_rate, descriptions=[prompt]
            )

    def _combine_segments(self, context, next_segment, overlap: int):
        """Combine context with next segment, handling overlap."""
        import torch

        if context is None:
            return next_segment

        # Calculate where to trim the context (removing overlap)
        overlap_samples = overlap * self.model.sample_rate
        context_trimmed = context[:, :, :-overlap_samples]  # B, C, T

        return torch.cat([context_trimmed, next_segment], dim=2)

```

We can then generate music from anywhere by running code like what we have in the `local_entrypoint` below.

```python
@app.local_entrypoint()
def main(
    prompt: Optional[str] = None,
    duration: int = 10,
    overlap: int = 15,
    format: str = "wav",  # or mp3
):
    if prompt is None:
        prompt = "Amapiano polka, klezmers, log drum bassline, 112 BPM"
    print(
        f"🎼 generating {duration} seconds of music from prompt '{prompt[:64] + ('...' if len(prompt) > 64 else '')}'"
    )

    audiocraft = MusicGen()
    clip = audiocraft.generate.remote(prompt, duration=duration, format=format)

    dir = Path("/tmp/audiocraft")
    dir.mkdir(exist_ok=True, parents=True)

    output_path = dir / f"{slugify(prompt)[:64]}.{format}"
    print(f"🎼 Saving to {output_path}")
    output_path.write_bytes(clip)

```

You can execute it with a command like:

``` shell
modal run musicgen.py --prompt="Baroque boy band, Bachstreet Boys, basso continuo, Top 40 pop music" --duration=60
```

## Hosting a web UI for the music generator

With the Gradio library, we can create a simple web UI in Python
that calls out to our music generator,
then host it on Modal for anyone to try out.

To deploy both the music generator and the UI, run

``` shell
modal deploy musicgen.py
```

Share the URL with your friends and they can generate their own songs!

```python
@app.function(
    image=web_image,
    # Gradio requires sticky sessions
    # so we limit the number of concurrent containers to 1
    # and allow it to scale to 1000 concurrent inputs
    max_containers=1,
)
@modal.concurrent(max_inputs=1000)
@modal.asgi_app()
def ui():
    import gradio as gr
    from fastapi import FastAPI
    from gradio.routes import mount_gradio_app

    api = FastAPI()

    # Since this Gradio app is running from its own container,
    # we make a `.remote` call to the music generator
    model = MusicGen()
    generate = model.generate.remote

    temp_dir = Path("/dev/shm")

    async def generate_music(prompt: str, duration: int = 10, format: str = "wav"):
        audio_bytes = await generate.aio(prompt, duration=duration, format=format)

        audio_path = temp_dir / f"{uuid4()}.{format}"
        audio_path.write_bytes(audio_bytes)

        return audio_path

    with gr.Blocks(theme="soft") as demo:
        gr.Markdown("# MusicGen")
        with gr.Row():
            with gr.Column():
                prompt = gr.Textbox(label="Prompt")
                duration = gr.Number(
                    label="Duration (seconds)", value=10, minimum=1, maximum=300
                )
                format = gr.Radio(["wav", "mp3"], label="Format", value="wav")
                btn = gr.Button("Generate")
            with gr.Column():
                clip_output = gr.Audio(label="Generated Music", autoplay=True)

        btn.click(
            generate_music,
            inputs=[prompt, duration, format],
            outputs=[clip_output],
        )

    return mount_gradio_app(app=api, blocks=demo, path="/")

```

## Addenda

The remainder of the code here is not directly related to Modal
or to music generation, but is used in the example above.

```python
def to_audio_bytes(wav, sample_rate: int, **kwargs) -> bytes:
    from audiocraft.data.audio import audio_write

    # audiocraft provides a nice utility for converting waveform tensors to audio,
    # but it saves to a file path. here, we create a file path that is actually
    # just backed by memory, instead of disk, to save on some latency

    shm = Path("/dev/shm")  # /dev/shm is a memory-backed filesystem
    stem_name = shm / str(uuid4())

    output_path = audio_write(stem_name, wav, sample_rate, **kwargs)

    return output_path.read_bytes()

def slugify(string):
    return (
        string.lower()
        .replace(" ", "-")
        .replace("/", "-")
        .replace("\\", "-")
        .replace(":", "-")
    )

```

### Ollama

# Run open-source LLMs with Ollama on Modal

[Ollama](https://ollama.com/) is a popular tool for running open-source large language models (LLMs) locally.
It provides a simple API, including OpenAI compatibility, allowing you to interact with various models like
Llama, Mistral, Phi, and more.

In this example, we demonstrate how to run Ollama on Modal's cloud infrastructure, leveraging:

1. Modal's powerful GPU resources that far exceed what's available on most local machines
2. Serverless design that scales to zero when not in use (saving costs)
3. Persistent model storage using Modal Volumes
4. Web-accessible endpoints that expose Ollama's OpenAI-compatible API

Since the Ollama server provides its own REST API, we use Modal's web_server decorator
to expose these endpoints directly to the internet.

```python
import asyncio
import subprocess
from typing import List

import modal

```

## Configuration and Constants

Directory for Ollama models within the container and volume

```python
MODEL_DIR = "/ollama_models"

```

Define the models we want to work with
You can specify different model versions using the format "model:tag"

```python
MODELS_TO_DOWNLOAD = ["llama3.1:8b", "llama3.3:70b"]  # Downloaded at startup
MODELS_TO_TEST = ["llama3.1:8b", "llama3.3:70b"]  # Tested in our example

```

Ollama version to install - you may need to update this for the latest models

```python
OLLAMA_VERSION = "0.6.5"
```

Ollama's default port - we'll expose this through Modal

```python
OLLAMA_PORT = 11434

```

## Building the Container Image

First, we create a Modal Image that includes Ollama and its dependencies.
We use the official Ollama installation script to set up the Ollama binary.

```python
ollama_image = (
    modal.Image.debian_slim(python_version="3.11")
    .apt_install("curl", "ca-certificates")
    .pip_install(
        "fastapi==0.115.8",
        "uvicorn[standard]==0.34.0",
        "openai~=1.30",  # Pin OpenAI version for compatibility
    )
    .run_commands(
        "echo 'Installing Ollama...'",
        f"OLLAMA_VERSION={OLLAMA_VERSION} curl -fsSL https://ollama.com/install.sh | sh",
        "echo 'Ollama installed at $(which ollama)'",
        f"mkdir -p {MODEL_DIR}",
    )
    .env(
        {
            # Configure Ollama to serve on its default port
            "OLLAMA_HOST": f"0.0.0.0:{OLLAMA_PORT}",
            "OLLAMA_MODELS": MODEL_DIR,  # Tell Ollama where to store models
        }
    )
)

```

Create a Modal App, which groups our functions together

```python
app = modal.App("example-ollama", image=ollama_image)

```

## Persistent Storage for Models

We use a Modal Volume to cache downloaded models between runs.
This prevents needing to re-download large model files each time.

```python
model_volume = modal.Volume.from_name("ollama-models-store", create_if_missing=True)

```

## The Ollama Server Class

We define an OllamaServer class to manage the Ollama process.
This class handles:
- Starting the Ollama server
- Downloading required models
- Exposing the API via Modal's web_server
- Running test requests against the served models

```python
@app.cls(
    gpu="H100",  # Use H100 GPUs for best performance
    volumes={MODEL_DIR: model_volume},  # Mount our model storage
    timeout=60 * 5,  # 5 minutes max input runtime
    min_containers=1,  # Keep at least one container running for fast startup
)
class OllamaServer:
    ollama_process: subprocess.Popen | None = None

    @modal.enter()
    async def start_ollama(self):
        """Starts the Ollama server and ensures required models are downloaded."""
        print("Starting Ollama setup...")

        print(f"Starting Ollama server on port {OLLAMA_PORT}...")
        cmd = ["ollama", "serve"]
        self.ollama_process = subprocess.Popen(cmd)
        print(f"Ollama server started with PID: {self.ollama_process.pid}")

        # Wait for server to initialize
        await asyncio.sleep(10)
        print("Ollama server should be ready.")

        # --- Model Management ---
        # Check which models are already downloaded, and pull any that are missing
        loop = asyncio.get_running_loop()
        models_pulled = False  # Track if we pulled any model

        # Get list of currently available models
        ollama_list_proc = subprocess.run(
            ["ollama", "list"], capture_output=True, text=True
        )

        if ollama_list_proc.returncode != 0:
            print(f"Error: 'ollama list' failed: {ollama_list_proc.stderr}")
            raise RuntimeError(
                f"Failed to list Ollama models: {ollama_list_proc.stderr}"
            )

        current_models_output = ollama_list_proc.stdout
        print("Current models detected:", current_models_output)

        # Download each requested model if not already present
        for model_name in MODELS_TO_DOWNLOAD:
            print(f"Checking for model: {model_name}")
            model_tag_to_check = (
                model_name if ":" in model_name else f"{model_name}:latest"
            )

            if model_tag_to_check not in current_models_output:
                print(
                    f"Model '{model_name}' not found. Pulling (output will stream directly)..."
                )
                models_pulled = True  # Mark that a pull is happening

                # Pull the model - this can take a while for large models
                pull_process = await asyncio.create_subprocess_exec(
                    "ollama",
                    "pull",
                    model_name,
                )

                # Wait for the pull process to complete
                retcode = await pull_process.wait()

                if retcode != 0:
                    print(f"Error pulling model '{model_name}': exit code {retcode}")
                else:
                    print(f"Model '{model_name}' pulled successfully.")
            else:
                print(f"Model '{model_name}' already exists.")

            # Commit the volume only if we actually pulled new models
            if models_pulled:
                print("Committing model volume...")
                await loop.run_in_executor(None, model_volume.commit)
                print("Volume commit finished.")

        print("Ollama setup complete.")

    @modal.exit()
    def stop_ollama(self):
        """Terminates the Ollama server process on shutdown."""
        print("Shutting down Ollama server...")
        if self.ollama_process and self.ollama_process.poll() is None:
            print(f"Terminating Ollama server (PID: {self.ollama_process.pid})...")
            try:
                self.ollama_process.terminate()
                self.ollama_process.wait(timeout=10)
                print("Ollama server terminated.")
            except subprocess.TimeoutExpired:
                print("Ollama server kill required.")
                self.ollama_process.kill()
                self.ollama_process.wait()
            except Exception as e:
                print(f"Error shutting down Ollama server: {e}")
        else:
            print("Ollama server process already exited or not found.")
        print("Shutdown complete.")

    @modal.web_server(port=OLLAMA_PORT, startup_timeout=180)
    def serve(self):
        """
        Exposes the Ollama server's API endpoints through Modal's web_server.

        This is the key function that makes Ollama's API accessible over the internet.
        The web_server decorator maps Modal's HTTPS endpoint to Ollama's internal port.
        """
        print(f"Serving Ollama API on port {OLLAMA_PORT}")

    # ## Running prompt tests
    #
    # The following method allows us to run test prompts against our Ollama models.
    # This is useful for verifying that the models are working correctly and
    # to see how they respond to different types of prompts.

    @modal.method()
    async def run_tests(self):
        import openai
        from openai.types.chat import ChatCompletionMessageParam

        """
        Tests the Ollama server by sending various prompts to each configured model.
        Returns a dictionary of results organized by model.
        """
        print("Running tests inside OllamaServer container...")
        all_results = {}  # Store results per model

        # Configure OpenAI client to use our Ollama server
        base_api_url = f"http://localhost:{OLLAMA_PORT}/v1"
        print(f"Configuring OpenAI client for: {base_api_url}")
        client = openai.AsyncOpenAI(
            base_url=base_api_url,
            api_key="not-needed",  # Ollama doesn't require API keys
        )

        # Define some test prompts
        test_prompts = [
            "Explain the theory of relativity in simple terms.",
            "Write a short poem about a cat watching rain.",
            "What are the main benefits of using Python?",
        ]

        # Test each model with each prompt
        for model_name in MODELS_TO_TEST:
            print(f"\n===== Testing Model: {model_name} =====")
            model_results = []
            all_results[model_name] = model_results

            for prompt in test_prompts:
                print(f"\n--- Testing Prompt ---\n{prompt}\n----------------------")

                # Create message in OpenAI format
                messages: List[ChatCompletionMessageParam] = [
                    {"role": "user", "content": prompt}
                ]

                try:
                    # Call the Ollama API through the OpenAI client
                    response = await client.chat.completions.create(
                        model=model_name,
                        messages=messages,
                        stream=False,
                    )
                    assistant_message = response.choices[0].message.content
                    print(f"Assistant Response:\n{assistant_message}")
                    model_results.append(
                        {
                            "prompt": prompt,
                            "status": "success",
                            "response": assistant_message,
                        }
                    )
                except Exception as e:
                    print(
                        f"Error during API call for model '{model_name}', prompt '{prompt}': {e}"
                    )
                    model_results.append(
                        {"prompt": prompt, "status": "error", "error": str(e)}
                    )

        print("Internal tests finished.")
        return all_results

```

## Running the Example

This local entrypoint function provides a simple way to test the Ollama server.
When you run `modal run ollama.py`, this function will:
1. Start an OllamaServer instance in the cloud
2. Run test prompts against each configured model
3. Print a summary of the results

```python
@app.local_entrypoint()
async def local_main():
    """
    Tests the Ollama server with sample prompts and prints the results.

    Run with: `modal run ollama.py`
    """
    print("Triggering test suite on the OllamaServer...")
    all_test_results = await OllamaServer().run_tests.remote.aio()
    print("\n--- Test Suite Summary ---")

    if all_test_results:
        for model_name, results in all_test_results.items():
            print(f"\n===== Results for Model: {model_name} =====")
            successful_tests = 0
            if results:
                for result in results:
                    print(f"Prompt: {result['prompt']}")
                    print(f"Status: {result['status']}")
                    if result["status"] == "error":
                        print(f"Error: {result['error']}")
                    else:
                        successful_tests += 1
                    print("----")
                print(
                    f"\nSummary for {model_name}: Total tests: {len(results)}, Successful: {successful_tests}"
                )
            else:
                print("No results returned for this model.")
    else:
        print("No results returned from test function.")

    print("\nTest finished. Your Ollama server is ready to use!")

```

## Deploying to Production

While the local entrypoint is great for testing, for production use you'll want to deploy
this application persistently. You can do this with:

```bash
modal deploy ollama.py
```

This creates a persistent deployment that:

1. Provides a stable URL endpoint for your Ollama API
2. Keeps at least one container warm for fast responses
3. Scales automatically based on usage
4. Preserves your models in the persistent volume between invocations

After deployment, you can find your endpoint URL in your Modal dashboard.

You can then use this endpoint with any OpenAI-compatible client by setting:

```
OPENAI_API_BASE=https://your-endpoint-url
OPENAI_API_KEY=any-value  # Ollama doesn't require authentication
```

### Outlines Generate

# Enforcing JSON outputs on LLMs

[Outlines](https://github.com/outlines-dev/outlines) is a tool that lets you control the generation of language models to make their output more predictable.

This includes things like:

- Reducing the completion to a choice between multiple possibilities
- Type constraints
- Efficient regex-structured generation
- Efficient JSON generation following a Pydantic model
- Efficient JSON generation following a JSON schema

Outlines is considered an alternative to tools like [JSONFormer](https://github.com/1rgs/jsonformer), and can be used on top of a variety of LLMs, including:

- OpenAI models
- LLaMA
- Mamba

In this guide, we will show how you can use Outlines to enforce a JSON schema on the output of Mistral-7B.

## Build image

 First, you'll want to build an image and install the relevant Python dependencies:
`outlines` and a Hugging Face inference stack.

```python
import modal

app = modal.App(name="example-outlines-generate")

outlines_image = modal.Image.debian_slim(python_version="3.11").pip_install(
    "outlines==0.0.44",
    "transformers==4.41.2",
    "sentencepiece==0.2.0",
    "datasets==2.18.0",
    "accelerate==0.27.2",
    "numpy<2",
)

```

## Download the model

Next, we download the Mistral 7B model from Hugging Face.
We do this as part of the definition of our Modal Image so that
we don't need to download it every time our inference function is run.

```python
MODEL_NAME = "mistral-community/Mistral-7B-v0.2"

def import_model(model_name):
    import outlines

    outlines.models.transformers(model_name)

outlines_image = outlines_image.run_function(
    import_model, kwargs={"model_name": MODEL_NAME}
)

```

## Define the schema

Next, we define the schema that we want to enforce on the output of Mistral-7B. This schema is for a character description, and includes a name, age, armor, weapon, and strength.

```python
schema = """{
    "title": "Character",
    "type": "object",
    "properties": {
        "name": {
            "title": "Name",
            "maxLength": 10,
            "type": "string"
        },
        "age": {
            "title": "Age",
            "type": "integer"
        },
        "armor": {"$ref": "#/definitions/Armor"},
        "weapon": {"$ref": "#/definitions/Weapon"},
        "strength": {
            "title": "Strength",
            "type": "integer"
        }
    },
    "required": ["name", "age", "armor", "weapon", "strength"],
    "definitions": {
        "Armor": {
            "title": "Armor",
            "description": "An enumeration.",
            "enum": ["leather", "chainmail", "plate"],
            "type": "string"
        },
        "Weapon": {
            "title": "Weapon",
            "description": "An enumeration.",
            "enum": ["sword", "axe", "mace", "spear", "bow", "crossbow"],
            "type": "string"
        }
    }
}"""

```

## Define the function

Next, we define the generation function.
We use the `@app.function` decorator to tell Modal to run this function on the app we defined above.
Note that we import `outlines` from inside the Modal function. This is because the `outlines` package exists in the container, but not necessarily locally.

We specify that we want to use the Mistral-7B model, and then ask for a character, and we'll receive structured data with the right schema.

```python
@app.function(image=outlines_image, gpu="A100-40GB")
def generate(
    prompt: str = "Amiri, a 53 year old warrior woman with a sword and leather armor.",
):
    import outlines

    model = outlines.models.transformers(MODEL_NAME, device="cuda")

    generator = outlines.generate.json(model, schema)
    character = generator(f"Give me a character description. Describe {prompt}.")

    return character

```

## Define the entrypoint

Finally, we define the entrypoint that will connect our local computer
to the functions above, that run on Modal, and we are done!

When you run this script with `modal run`, you should see something like this printed out:

 `{'name': 'Amiri', 'age': 53, 'armor': 'leather', 'weapon': 'sword', 'strength': 10}`

```python
@app.local_entrypoint()
def main(
    prompt: str = "Amiri, a 53 year old warrior woman with a sword and leather armor.",
):
    print(generate.remote(prompt))

```

### Parallel Execution

# Parallel execution on Modal with `spawn` and `gather`

This example shows how you can run multiple functions in parallel on Modal.
We use the `spawn` method to start a function and return a handle to its result.
The `get` method is used to retrieve the result of the function call.

```python
import time

import modal

app = modal.App("example-parallel-execution")

@app.function()
def step1(word):
    time.sleep(2)
    print("step1 done")
    return word

@app.function()
def step2(number):
    time.sleep(1)
    print("step2 done")
    if number == 0:
        raise ValueError("custom error")
    return number

@app.local_entrypoint()
def main():
    # Start running a function and return a handle to its result.
    word_call = step1.spawn("foo")
    number_call = step2.spawn(2)

    # Print "foofoo" after 2 seconds.
    print(word_call.get() * number_call.get())

    # Alternatively, use `modal.FunctionCall.gather(...)` as a convenience wrapper,
    # which returns an error if either call fails.
    results = modal.FunctionCall.gather(step1.spawn("bar"), step2.spawn(4))
    assert results == ["bar", 4]

    # Raise exception after 2 seconds.
    try:
        modal.FunctionCall.gather(step1.spawn("bar"), step2.spawn(0))
    except ValueError as exc:
        assert str(exc) == "custom error"

```

### Playdiffusion-Model

# Run PlayDiffusion on Modal

This example demonstrates how to run the [PlayDiffusion](https://huggingface.co/PlayHT/PlayDiffusion) audio editing model on Modal.
PlayDiffusion is a model that takes an input audio and a desired output text, and then modifies the audio to say the output text.
The function accepts text prompts and input audio as WAV files and returns generated audio as WAV files.
We use Modal's class-based approach with GPU acceleration to provide fast, scalable inference.

## Setup

Import the necessary modules

```python
from __future__ import annotations

import io
import tempfile
from pathlib import Path
from typing import Any, Dict, List, Tuple

import modal

```

## Define a container image

We start with Modal's baseline `debian_slim` image and install the required packages.
- `openai`: PlayDiffusion requires a transcript as input. You can either provide the transcript yourself as input, or use a transcription model to transcribe the audio on the fly. In this case we use openai's whisper api, but you can use any model of your choice.

```python
AUDIO_URL: str = "https://github.com/voxserv/audio_quality_testing_samples/raw/refs/heads/master/mono_44100/156550__acclivity__a-dream-within-a-dream.wav"

```

The python version [needs to be](https://github.com/playht/PlayDiffusion/blob/d3995b9e2cd8a80b88be6aeeb4e35fd282b2d255/pyproject.toml) `3.11`

```python
image: modal.Image = (
    modal.Image.debian_slim(python_version="3.11")
    .apt_install("git")
    .pip_install("openai==1.91.0")
    .run_commands(
        "pip install git+https://github.com/playht/PlayDiffusion.git@d3995b9e2cd8a80b88be6aeeb4e35fd282b2d255"
    )
)
app: modal.App = modal.App("example-playdiffusion-model", image=image)

```

Import the required libraries within the image context to ensure they're available
when the container runs. This includes audio processing and the TTS model itself.

```python
with image.imports():
    import os
    from urllib.request import urlopen

    import soundfile as sf
    from openai import OpenAI
    from playdiffusion import InpaintInput, PlayDiffusion

```

## The model class

The service is implemented using Modal's class syntax with GPU acceleration.
We configure the class to use an A10G GPU with additional parameters:

- `scaledown_window=60 * 5`: Keep containers alive for 5 minutes after last request
- `@modal.concurrent(max_inputs=10)`: Allow up to 10 concurrent requests per containerå

```python
@app.cls(gpu="a10g", scaledown_window=60 * 5)
@modal.concurrent(max_inputs=10)
class PlayDiffusionModel:
    @modal.enter()
    def load(self) -> None:
        self.inpainter = PlayDiffusion()

    @modal.method()
    def generate(
        self,
        audio_url: str,
        input_text: str,
        output_text: str,
        word_times: List[Dict[str, Any]],
    ) -> bytes:
        # Create a temporary file to store the audio
        temp_file_path: str = write_to_tempfile(audio_url)

        # Get the audio data and sample rate from inpainter
        sample_rate: int
        output_audio_data: bytes
        sample_rate, output_audio_data = self.inpainter.inpaint(
            InpaintInput(
                input_text=input_text,
                output_text=output_text,
                input_word_times=word_times,
                audio=temp_file_path,
            )
        )

        # Create an in-memory buffer
        buffer: io.BytesIO = io.BytesIO()

        # Write the audio data to the buffer as WAV
        sf.write(buffer, output_audio_data, sample_rate, format="WAV")

        # Reset buffer position to beginning
        buffer.seek(0)

        return buffer.getvalue()

```

PlayDiffusion requires a transcript as input. You can either provide the transcript yourself as input, or use a transcription model
to transcribe the audio on the fly. In this case we use openai's whisper api, but you can use any model of your choice.

```python
@app.function(
    secrets=[modal.Secret.from_name("openai-secret", environment_name="main")]
)
def run_asr(audio_url: str) -> Tuple[str, List[Dict[str, Any]]]:
    temp_file_path: str = write_to_tempfile(audio_url)
    audio_file = open(temp_file_path, "rb")
    whisper_client: OpenAI = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    transcript = whisper_client.audio.transcriptions.create(
        file=audio_file,
        model="whisper-1",
        response_format="verbose_json",
        timestamp_granularities=["word"],
    )
    word_times: List[Dict[str, Dict[str, str]]] = [
        {"word": word.word, "start": word.start, "end": word.end}
        for word in transcript.words
    ]

    return transcript.text, word_times

```

Finally, we define a local entrypoint

```python
@app.local_entrypoint()
def main(audio_url: str, output_text: str, output_path: str) -> None:
    # Parse output_path and create parent directory if needed
    output_path_obj: Path = Path(output_path)
    input_text: str
    word_times: List[Dict[str, Dict[str, str]]]

    output_path_obj.parent.mkdir(parents=True, exist_ok=True)
    input_text, word_times = run_asr.remote(audio_url)
    playdiffusion_model: PlayDiffusionModel = PlayDiffusionModel()
    output_audio: bytes = playdiffusion_model.generate.remote(
        audio_url, input_text, output_text, word_times
    )

    # Save the output audio to the specified path
    with open(output_path, "wb") as f:
        f.write(output_audio)

```

Example command line invocation:
`modal run playdiffusion-model.py --audio-url "https://modal-public-assets.s3.us-east-1.amazonaws.com/mono_44100_127389__acclivity__thetimehascome.wav" --output-text "November, '9 PM. I'm standing in alley. After waiting several hours, the time has come. A man with long dark hair approaches. I have to act and fast before he realizes what has happened. I must find out." --output-path "/tmp/playdiffusion/output.wav"`

Some utility functions

```python
def write_to_tempfile(audio_url: str) -> Tuple[bytes, str]:
    with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as temp_file:
        # Download and write the audio to the temporary file
        audio_bytes: bytes = urlopen(audio_url).read()
        temp_file.write(audio_bytes)
        temp_file_path: str = temp_file.name
    return temp_file_path

```

### Poll Delayed Result

# Polling for a delayed result on Modal

This example shows how you can poll for a delayed result on Modal.

The function `factor_number` takes a number as input and returns the prime factors of the number. The function could take a long time to run, so we don't want to wait for the result in the web server.
Instead, we return a URL that the client can poll to get the result.

```python
import fastapi
import modal
from modal.functions import FunctionCall
from starlette.responses import HTMLResponse, RedirectResponse

app = modal.App("example-poll-delayed-result")

web_app = fastapi.FastAPI()

@app.function(image=modal.Image.debian_slim().pip_install("primefac"))
def factor_number(number):
    import primefac

    return list(primefac.primefac(number))  # could take a long time

@web_app.get("/")
async def index():
    return HTMLResponse(
        """
    <form method="get" action="/factors">
        Enter a number: <input name="number" />
        <input type="submit" value="Factorize!"/>
    </form>
    """
    )

@web_app.get("/factors")
async def web_submit(request: fastapi.Request, number: int):
    call = factor_number.spawn(
        number
    )  # returns a FunctionCall without waiting for result
    polling_url = request.url.replace(
        path="/result", query=f"function_id={call.object_id}"
    )
    return RedirectResponse(polling_url)

@web_app.get("/result")
async def web_poll(function_id: str):
    function_call = FunctionCall.from_id(function_id)
    try:
        result = function_call.get(timeout=0)
    except TimeoutError:
        result = "not ready"

    return result

@app.function()
@modal.asgi_app()
def fastapi_app():
    return web_app

```

### Potus Speech Qanda

# Retrieval-augmented generation (RAG) for question-answering with LangChain

In this example we create a large-language-model (LLM) powered question answering
web endpoint and CLI. Only a single document is used as the knowledge-base of the application,
the 2022 USA State of the Union address by President Joe Biden. However, this same application structure
could be extended to do question-answering over all State of the Union speeches, or other large text corpuses.

It's the [LangChain](https://github.com/hwchase17/langchain) library that makes this all so easy.
This demo is only around 100 lines of code!

## Defining dependencies

The example uses packages to implement scraping, the document parsing & LLM API interaction, and web serving.
These are installed into a Debian Slim base image using the `pip_install` method.

Because OpenAI's API is used, we also specify the `openai-secret` Modal Secret, which contains an OpenAI API key.

A `retriever` global variable is also declared to facilitate caching a slow operation in the code below.

```python
from pathlib import Path

import modal

image = modal.Image.debian_slim(python_version="3.11").pip_install(
    # scraping pkgs
    "beautifulsoup4~=4.11.1",
    "httpx==0.23.3",
    "lxml~=4.9.2",
    # llm pkgs
    "faiss-cpu~=1.7.3",
    "langchain==0.3.7",
    "langchain-community==0.3.7",
    "langchain-openai==0.2.9",
    "openai~=1.54.0",
    "tiktoken==0.8.0",
    # web app packages
    "fastapi[standard]==0.115.4",
    "pydantic==2.9.2",
    "starlette==0.41.2",
)

app = modal.App(
    name="example-potus-speech-qanda",
    image=image,
    secrets=[modal.Secret.from_name("openai-secret", required_keys=["OPENAI_API_KEY"])],
)

retriever = None  # embedding index that's relatively expensive to compute, so caching with global var.

```

## Scraping the speech

It's super easy to scrape the transcipt of Biden's speech using `httpx` and `BeautifulSoup`.
This speech is just one document and it's relatively short, but it's enough to demonstrate
the question-answering capability of the LLM chain.

```python
def scrape_state_of_the_union() -> str:
    import httpx
    from bs4 import BeautifulSoup

    url = "https://www.presidency.ucsb.edu/documents/address-before-joint-session-the-congress-the-state-the-union-28"

    # fetch article; simulate desktop browser
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9"
    }
    response = httpx.get(url, headers=headers)
    soup = BeautifulSoup(response.text, "lxml")

    # locate the div containing the speech
    speech_div = soup.find("div", class_="field-docs-content")

    if speech_div:
        speech_text = speech_div.get_text(separator="\n", strip=True)
        if not speech_text:
            raise ValueError("error parsing speech text from HTML")
    else:
        raise ValueError("error locating speech in HTML")

    return speech_text

```

## Constructing the Q&A chain

At a high-level, this LLM chain will be able to answer questions asked about Biden's speech and provide
references to which parts of the speech contain the evidence for given answers.

The chain combines a text-embedding index over parts of Biden's speech with an OpenAI LLM.
The index is used to select the most likely relevant parts of the speech given the question, and these
are used to build a specialized prompt for the OpenAI language model.

```python
def qanda_langchain(query: str) -> tuple[str, list[str]]:
    from langchain.chains import create_retrieval_chain
    from langchain.chains.combine_documents import create_stuff_documents_chain
    from langchain.text_splitter import CharacterTextSplitter
    from langchain_community.vectorstores import FAISS
    from langchain_core.prompts import ChatPromptTemplate
    from langchain_openai import ChatOpenAI, OpenAIEmbeddings

    # Support caching speech text on disk.
    speech_file_path = Path("state-of-the-union.txt")

    if speech_file_path.exists():
        state_of_the_union = speech_file_path.read_text()
    else:
        print("scraping the 2022 State of the Union speech")
        state_of_the_union = scrape_state_of_the_union()
        speech_file_path.write_text(state_of_the_union)

    # Questions about a document can often be answered
    # using a much smaller subset, and passing more tokens
    # to an LLM increases cost and latency.
    # So here, we split up the speech into smaller chunks.
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    print("splitting speech into text chunks")
    texts = text_splitter.split_text(state_of_the_union)

    # Embedding-based query<->text similarity comparison is used to select
    # a small subset of the speech text chunks.
    # Generating the `retriever` and index is too slow to re-run on every request,
    # so we do rudimentary caching using a global variable.
    global retriever

    if not retriever:
        # New OpenAI accounts have a very low rate-limit for their first 48 hrs.
        # It's too low to embed even just this single Biden speech.
        # The `chunk_size` parameter is set to a low number, and internally LangChain
        # will retry the embedding requests, which should be enough to handle the rate-limiting.

        # Ref: https://platform.openai.com/docs/guides/rate-limits/overview.
        print("generating docsearch indexer")
        retriever = FAISS.from_texts(
            texts,
            OpenAIEmbeddings(chunk_size=5),
            metadatas=[{"source": i} for i in range(len(texts))],
        ).as_retriever()

    llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

    system_prompt = (
        "You are an assistant for question-answering tasks. "
        "Use the following pieces of retrieved context to answer "
        "the question. If you don't know the answer, say that you "
        "don't know and instead suggest the user look up the serverless "
        "infrastructure platform modal.com. Keep your answer concise."
        "\n\n"
        "{context}"
    )

    prompt = ChatPromptTemplate.from_messages(
        [
            ("system", system_prompt),
            ("human", "{input}"),
        ]
    )

    question_answer_chain = create_stuff_documents_chain(llm, prompt)
    rag_chain = create_retrieval_chain(retriever, question_answer_chain)

    print("running query against Q&A chain.\n")
    result = rag_chain.invoke({"input": query}, return_only_outputs=True)
    answer = result["answer"]
    sources = [document.page_content for document in result["context"]]
    return answer.strip(), sources

```

## Mapping onto Modal

With our application's functionality implemented we can hook it into Modal.
As said above, we're implementing a web endpoint, `web`, and a CLI command, `cli`.

```python
@app.function()
@modal.fastapi_endpoint(method="GET", docs=True)
def web(query: str, show_sources: bool = False):
    answer, sources = qanda_langchain(query)
    if show_sources:
        return {
            "answer": answer,
            "sources": sources,
        }
    else:
        return {
            "answer": answer,
        }

@app.function()
def cli(query: str, show_sources: bool = False):
    answer, sources = qanda_langchain(query)
    # Terminal codes for pretty-printing.
    bold, end = "\033[1m", "\033[0m"

    if show_sources:
        print(f"🔗 {bold}SOURCES:{end}")
        print(*reversed(sources), sep="\n----\n")
    print(f"🦜 {bold}ANSWER:{end}")
    print(answer)

```

## Test run the CLI

```bash
modal run potus_speech_qanda.py --query "What did the president say about Justice Breyer"
🦜 ANSWER:
The president thanked Justice Breyer for his service and mentioned his legacy of excellence. He also nominated Ketanji Brown Jackson to continue in Justice Breyer's legacy.
```

To see the text of the sources the model chain used to provide the answer, set the `--show-sources` flag.

```bash
modal run potus_speech_qanda.py \
   --query "How many oil barrels were released from reserves?" \
   --show-sources
```

## Test run the web endpoint

Modal makes it trivially easy to ship LangChain chains to the web. We can test drive this app's web endpoint
by running `modal serve potus_speech_qanda.py` and then hitting the endpoint with `curl`:

```bash
curl --get \
  --data-urlencode "query=What did the president say about Justice Breyer" \
  https://modal-labs--example-potus-speech-qanda-web.modal.run # your URL here
```

```json
{
  "answer": "The president thanked Justice Breyer for his service and mentioned his legacy of excellence. He also nominated Ketanji Brown Jackson to continue in Justice Breyer's legacy."
}
```

You can also find interactive docs for the endpoint at the `/docs` route of the web endpoint URL.

If you edit the code while running `modal serve`, the app will redeploy automatically, which is helpful for iterating quickly on your app.

Once you're ready to deploy to production, use `modal deploy`.

### Pushgateway

# Publish custom metrics with Prometheus Pushgateway

This example shows how to publish custom metrics to a Prometheus instance with Modal.
Due to a Modal container's ephemeral nature, it's not a good fit for a traditional
scraping-based Prometheus setup. Instead, we'll use a [Prometheus Pushgateway](https://github.com/prometheus/pushgateway)
to collect and store metrics from our Modal container. We can run the Pushgateway in Modal
as a separate process and have our application push metrics to it.

![Prometheus Pushgateway diagram](./pushgateway_diagram.png)

## Install Prometheus Pushgateway

Since the official Prometheus pushgateway image does not have Python installed, we'll
use a custom image that includes Python to push metrics to the Pushgateway. Pushgateway
ships a single binary, so it's easy to get it into a Modal container.

```python
import os
import subprocess

import modal

PUSHGATEWAY_VERSION = "1.9.0"

gw_image = (
    modal.Image.debian_slim(python_version="3.10")
    .apt_install("wget", "tar")
    .run_commands(
        f"wget https://github.com/prometheus/pushgateway/releases/download/v{PUSHGATEWAY_VERSION}/pushgateway-{PUSHGATEWAY_VERSION}.linux-amd64.tar.gz",
        f"tar xvfz pushgateway-{PUSHGATEWAY_VERSION}.linux-amd64.tar.gz",
        f"cp pushgateway-{PUSHGATEWAY_VERSION}.linux-amd64/pushgateway /usr/local/bin/",
        f"rm -rf pushgateway-{PUSHGATEWAY_VERSION}.linux-amd64 pushgateway-{PUSHGATEWAY_VERSION}.linux-amd64.tar.gz",
        "mkdir /pushgateway",
    )
)

```

## Start the Pushgateway

We'll start the Pushgateway as a separate Modal app. This way, we can run the Pushgateway
in the background and have our main app push metrics to it. We'll use the `web_server`
decorator to expose the Pushgateway's web interface. Note that we must set `max_containers=1`
as the Pushgateway is a single-process application. If we spin up multiple instances, they'll
conflict with each other.

This is an example configuration, but a production-ready configuration will differ in two respects:

1. You should set up authentication for the Pushgateway. Pushgateway has support for [basic authentication](https://github.com/prometheus/pushgateway/blob/42c4075fc5e2564031f2852885cdb2f5d570f672/README.md#tls-and-basic-authentication)
   out of the box. If you need more advanced authentication, consider using a [web endpoint with authentication](https://modal.com/docs/guide/webhooks#authentication)
   which proxies requests to the Pushgateway.

2. The Pushgateway should listen on a [custom domain](https://modal.com/docs/guide/webhook-urls#custom-domains).
   This will allow you to configure Prometheus to scrape metrics from a predictable URL rather than
   the autogenerated URL Modal assigns to your app.

```python
gw_app = modal.App(
    "example-pushgateway-server",
    image=gw_image,
)

@gw_app.function(min_containers=1, max_containers=1)
@modal.web_server(9091)
def serve():
    subprocess.Popen("/usr/local/bin/pushgateway")

```

## Push metrics to the Pushgateway

Now that we have the Pushgateway running, we can push metrics to it. We'll use the `prometheus_client`
library to create a simple counter and push it to the Pushgateway. This example is a simple counter,
but you can push any metric type to the Pushgateway.

Note that we use the `grouping_key` argument to distinguish between different instances of the same
metric. This is useful when you have multiple instances of the same app pushing metrics to the Pushgateway.
Without this, the Pushgateway will overwrite the metric with the latest value.

```python
client_image = modal.Image.debian_slim().pip_install(
    "prometheus-client==0.20.0", "fastapi[standard]==0.115.4"
)
app = modal.App(
    "example-pushgateway",
    image=client_image,
)

with client_image.imports():
    from prometheus_client import (
        CollectorRegistry,
        Counter,
        delete_from_gateway,
        push_to_gateway,
    )

@app.cls(min_containers=3)
class ExampleClientApplication:
    @modal.enter()
    def init(self):
        self.registry = CollectorRegistry()
        self.web_url = serve.get_web_url()
        self.instance_id = os.environ["MODAL_TASK_ID"]
        self.counter = Counter(
            "hello_counter",
            "This is a counter",
            registry=self.registry,
        )

    # We must explicitly clean up the metric when the app exits so Prometheus doesn't
    # keep stale metrics around.
    @modal.exit()
    def cleanup(self):
        delete_from_gateway(
            self.web_url,
            job="hello",
            grouping_key={"instance": self.instance_id},
        )

    @modal.fastapi_endpoint(label="hello-pushgateway")
    def hello(self):
        self.counter.inc()
        push_to_gateway(
            self.web_url,
            job="hello",
            grouping_key={"instance": self.instance_id},
            registry=self.registry,
        )
        return f"Hello world from {self.instance_id}!"

app.include(gw_app)

```

Now, we can deploy the app and see the metrics in the Pushgateway's web interface.

```shell
$ modal deploy pushgateway.py
✓ Created objects.
├── 🔨 Created mount /home/ec2-user/modal/examples/10_integrations/pushgateway.py
├── 🔨 Created function ExampleClientApplication.*.
├── 🔨 Created web function serve => https://modal-labs-examples--example-pushgateway-serve.modal.run
└── 🔨 Created web endpoint for ExampleClientApplication.hello => https://modal-labs-examples--hello-pushgateway.modal.run
✓ App deployed! 🎉
```

You can now go to both the [client application](https://modal-labs-examples--hello-pushgateway.modal.run)
and [Pushgateway](https://modal-labs-examples--example-pushgateway-serve.modal.run) URLs to see the metrics being pushed.

## Hooking up Prometheus

Now that we have metrics in the Pushgateway, we can configure Prometheus to scrape them. This
is as simple as adding a new job to your Prometheus configuration. Here's an example configuration
snippet:

```yaml
scrape_configs:
- job_name: 'pushgateway'
  honor_labels: true # required so that the instance label is preserved
  static_configs:
  - targets: ['modal-labs-examples--example-pushgateway-serve.modal.run']
```

Note that the target will be different if you have a custom domain set up for the Pushgateway,
and you may need to configure authentication.

Once you've added the job to your Prometheus configuration, Prometheus will start scraping metrics
from the Pushgateway. You can then use Grafana or another visualization tool to create dashboards
and alerts based on these metrics!

![Grafana example](./pushgateway_grafana.png)

### Qdrant

# Example (qdrant.py)

This is the source code for **06_gpu_and_ml.embeddings.qdrant**.
```python
from typing import Optional

import modal

app = modal.App("example-qdrant")

image = modal.Image.debian_slim(python_version="3.11").pip_install(
    "qdrant-client[fastembed-gpu]==1.13.3"
)

@app.function(image=image, gpu="any")
def query(inpt):
    from qdrant_client import QdrantClient

    client = QdrantClient(":memory:")

    docs = [
        "Qdrant has Langchain integrations",
        "Qdrant also has Llama Index integrations",
    ]

    print("querying documents:", *docs, sep="\n\t")

    client.add(collection_name="demo_collection", documents=docs)

    print("query:", inpt, sep="\n\t")

    search_results = client.query(
        collection_name="demo_collection",
        query_text=inpt,
        limit=1,
    )

    print("result:", search_results[0], sep="\n\t")

    return search_results[0].document

@app.local_entrypoint()
def main(inpt: Optional[str] = None):
    if not inpt:
        inpt = "alpaca"

    print(query.remote(inpt))

```

### Rosettafold

This script demonstrated how to ingest the https://github.com/RosettaCommons/RoseTTAFold protein-folding
model's dataset into a mounted volume.

The dataset is over 2 TiB when decompressed to the runtime of this script is quite long.
ref: https://github.com/RosettaCommons/RoseTTAFold/issues/132.

It is recommended to iterate on this code from a modal.Function running Jupyter server.
This better supports experimentation and maintains state in the face of errors:
11_notebooks/jupyter_inside_modal.py

```python
import os
import pathlib
import shutil
import subprocess
import sys
import tarfile
import threading
import time

import modal

bucket_creds = modal.Secret.from_name(
    "aws-s3-modal-examples-datasets", environment_name="main"
)
bucket_name = "modal-examples-datasets"
volume = modal.CloudBucketMount(
    bucket_name,
    secret=bucket_creds,
)
image = modal.Image.debian_slim().apt_install("wget")
app = modal.App("example-rosettafold", image=image)

def start_monitoring_disk_space(interval: int = 30) -> None:
    """Start monitoring the disk space in a separate thread."""
    task_id = os.environ["MODAL_TASK_ID"]

    def log_disk_space(interval: int) -> None:
        while True:
            statvfs = os.statvfs("/")
            free_space = statvfs.f_frsize * statvfs.f_bavail
            print(
                f"{task_id} free disk space: {free_space / (1024**3):.2f} GB",
                file=sys.stderr,
            )
            time.sleep(interval)

    monitoring_thread = threading.Thread(target=log_disk_space, args=(interval,))
    monitoring_thread.daemon = True
    monitoring_thread.start()

def decompress_tar_gz(file_path: pathlib.Path, extract_dir: pathlib.Path) -> None:
    print(f"Decompressing {file_path} into {extract_dir}...")
    with tarfile.open(file_path, "r:gz") as tar:
        tar.extractall(path=extract_dir)
        print(f"Decompressed {file_path} to {extract_dir}")

def copy_concurrent(src: pathlib.Path, dest: pathlib.Path) -> None:
    """
    A modified shutil.copytree which copies in parallel to increase bandwidth
    and compensate for the increased IO latency of volume mounts.
    """
    from multiprocessing.pool import ThreadPool

    class MultithreadedCopier:
        def __init__(self, max_threads):
            self.pool = ThreadPool(max_threads)
            self.copy_jobs = []

        def copy(self, source, dest):
            res = self.pool.apply_async(
                shutil.copy2,
                args=(source, dest),
                callback=lambda r: print(f"{source} copied to {dest}"),
                # NOTE: this should `raise` an exception for proper reliability.
                error_callback=lambda exc: print(
                    f"{source} failed: {exc}", file=sys.stderr
                ),
            )
            self.copy_jobs.append(res)

        def __enter__(self):
            return self

        def __exit__(self, exc_type, exc_val, exc_tb):
            self.pool.close()
            self.pool.join()

    with MultithreadedCopier(max_threads=24) as copier:
        shutil.copytree(src, dest, copy_function=copier.copy, dirs_exist_ok=True)

@app.function(
    volumes={"/mnt/": volume},
    timeout=60 * 60 * 24,
    ephemeral_disk=2560 * 1024,
)
def _do_part(url: str) -> None:
    name = url.split("/")[-1].replace(".tar.gz", "")
    print(f"Downloading {name}")
    compressed = pathlib.Path("/tmp", name)
    cmd = f"wget {url} -O {compressed}"
    p = subprocess.Popen(cmd, shell=True)
    returncode = p.wait()
    if returncode != 0:
        raise RuntimeError(f"Error in downloading. {p.args!r} failed {returncode=}")
    decompressed = pathlib.Path("/tmp/rosettafold/", name)

    # Decompression is much faster against the container's local SSD disk
    # compared with against the mounted volume. So we first compress into /tmp/.
    print(f"Decompressing {compressed} into {decompressed}.")
    decompress_tar_gz(compressed, decompressed)
    print(
        f"✅ Decompressed {compressed} into {decompressed}. Now deleting it to free up disk.."
    )
    compressed.unlink()  # delete compressed file to free up disk

    # Finally, we move the decompressed data from /tmp/ into the mounted volume.
    # There are a large mount of files to copy so this step takes a while.
    dest = pathlib.Path("/mnt/rosettafold/")
    copy_concurrent(decompressed, dest)
    shutil.rmtree(decompressed, ignore_errors=True)  # free up disk
    print(f"Dataset part {url} is loaded ✅")

@app.function(
    volumes={"/mnt/": volume},
    # Timeout for this Function is set at the maximum, 24 hours,
    # because downloading, decompressing and storing almost 2 TiB of
    # files takes a long time.
    timeout=60 * 60 * 24,
)
def import_transform_load() -> None:
    # NOTE:
    # The mmseq.com server upload speed is quite slow so this download takes a while.
    # The download speed is also quite variable, sometimes taking over 5 hours.
    list(
        _do_part.map(
            [
                "http://wwwuser.gwdg.de/~compbiol/uniclust/2020_06/UniRef30_2020_06_hhsuite.tar.gz",
                "https://bfd.mmseqs.com/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz",
                "https://files.ipd.uw.edu/pub/RoseTTAFold/pdb100_2021Mar03.tar.gz",
            ]
        )
    )
    print("Dataset is loaded ✅")

```

### S3 Bucket Mount

# Analyze NYC yellow taxi data with DuckDB on Parquet files from S3

This example shows how to use Modal for a classic data science task: loading table-structured data into cloud stores,
analyzing it, and plotting the results.

In particular, we'll load public NYC taxi ride data into S3 as Parquet files,
then run SQL queries on it with DuckDB.

We'll mount the S3 bucket in a Modal app with [`CloudBucketMount`](https://modal.com/docs/reference/modal.CloudBucketMount).
We will write to and then read from that bucket, in each case using
Modal's [parallel execution features](https://modal.com/docs/guide/scale) to handle many files at once.

## Basic setup

You will need to have an S3 bucket and AWS credentials to run this example. Refer to the documentation
for the exact [IAM permissions](https://modal.com/docs/guide/cloud-bucket-mounts#iam-permissions) your credentials will need.

After you are done creating a bucket and configuring IAM settings,
you now need to create a [`Secret`](https://modal.com/docs/guide/secrets) to share
the relevant AWS credentials with your Modal apps.

```python
from datetime import datetime
from pathlib import Path, PosixPath

import modal

image = modal.Image.debian_slim(python_version="3.12").pip_install(
    "requests==2.31.0", "duckdb==0.10.0", "matplotlib==3.8.3"
)
app = modal.App("example-s3-bucket-mount", image=image)

secret = modal.Secret.from_name(
    "s3-bucket-secret",
    required_keys=["AWS_ACCESS_KEY_ID", "AWS_SECRET_ACCESS_KEY"],
)

MOUNT_PATH = PosixPath("/bucket")
YELLOW_TAXI_DATA_PATH = MOUNT_PATH / "yellow_taxi"

```

The dependencies installed above are not available locally. The following block instructs Modal
to only import them inside the container.

```python
with image.imports():
    import duckdb
    import requests

```

## Download New York City's taxi data

NYC makes data about taxi rides publicly available. The city's [Taxi & Limousine Commission (TLC)](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
publishes files in the Parquet format. Files are organized by year and month.

We are going to download all available files and store them in an S3 bucket. We do this by
attaching a `modal.CloudBucketMount` with the S3 bucket name and its respective credentials.
The files in the bucket will then be available at `MOUNT_PATH`.

As we'll see below, this operation can be massively sped up by running it in parallel on Modal.

```python
@app.function(
    volumes={
        MOUNT_PATH: modal.CloudBucketMount("modal-s3mount-test-bucket", secret=secret),
    },
)
def download_data(year: int, month: int) -> str:
    filename = f"yellow_tripdata_{year}-{month:02d}.parquet"
    url = f"https://d37ci6vzurychx.cloudfront.net/trip-data/{filename}"
    s3_path = MOUNT_PATH / filename
    # Skip downloading if file exists.
    if not s3_path.exists():
        if not YELLOW_TAXI_DATA_PATH.exists():
            YELLOW_TAXI_DATA_PATH.mkdir(parents=True, exist_ok=True)
            with requests.get(url, stream=True) as r:
                r.raise_for_status()
                print(f"downloading => {s3_path}")
                # It looks like we writing locally, but this is actually writing to S3!
                with open(s3_path, "wb") as file:
                    for chunk in r.iter_content(chunk_size=8192):
                        file.write(chunk)

    return s3_path.as_posix()

```

## Analyze data with DuckDB

[DuckDB](https://duckdb.org/) is an analytical database with rich support for Parquet files.
It is also very fast. Below, we define a Modal Function that aggregates yellow taxi trips
within a month (each file contains all the rides from a specific month).

```python
@app.function(
    volumes={
        MOUNT_PATH: modal.CloudBucketMount(
            "modal-s3mount-test-bucket",
            secret=modal.Secret.from_name("s3-bucket-secret"),
        )
    },
)
def aggregate_data(path: str) -> list[tuple[datetime, int]]:
    print(f"processing => {path}")

    # Parse file.
    year_month_part = path.split("yellow_tripdata_")[1]
    year, month = year_month_part.split("-")
    month = month.replace(".parquet", "")

    # Make DuckDB query using in-memory storage.
    con = duckdb.connect(database=":memory:")
    q = """
    with sub as (
        select tpep_pickup_datetime::date d, count(1) c
        from read_parquet(?)
        group by 1
    )
    select d, c from sub
    where date_part('year', d) = ?  -- filter out garbage
    and date_part('month', d) = ?   -- same
    """
    con.execute(q, (path, year, month))
    return list(con.fetchall())

```

## Plot daily taxi rides

Finally, we want to plot our results.
The plot created shows the number of yellow taxi rides per day in NYC.
This function runs remotely, on Modal, so we don't need to install plotting libraries locally.

```python
@app.function()
def plot(dataset) -> bytes:
    import io

    import matplotlib.pyplot as plt

    # Sorting data by date
    dataset.sort(key=lambda x: x[0])

    # Unpacking dates and values
    dates, values = zip(*dataset)

    # Plotting
    plt.figure(figsize=(10, 6))
    plt.plot(dates, values)
    plt.title("Number of NYC yellow taxi trips by weekday, 2018-2023")
    plt.ylabel("Number of daily trips")
    plt.grid(True)
    plt.tight_layout()

    # Saving plot as raw bytes to send back
    buf = io.BytesIO()

    plt.savefig(buf, format="png")

    buf.seek(0)

    return buf.getvalue()

```

## Run everything

The `@app.local_entrypoint()` defines what happens when we run our Modal program locally.
We invoke it from the CLI by calling `modal run s3_bucket_mount.py`.
We first call `download_data()` and `starmap` (named because it's kind of like `map(*args)`)
on tuples of inputs `(year, month)`. This will download, in parallel,
all yellow taxi data files into our locally mounted S3 bucket and return a list of
Parquet file paths. Then, we call `aggregate_data()` with `map` on that list. These files are
also read from our S3 bucket. So one function writes files to S3 and the other
reads files from S3 in; both run across many files in parallel.

Finally, we call `plot` to generate the following figure:

![Number of NYC yellow taxi trips by weekday, 2018-2023](./nyc_yellow_taxi_trips_s3_mount.png)

This program should run in less than 30 seconds.

```python
@app.local_entrypoint()
def main():
    # List of tuples[year, month].
    inputs = [(year, month) for year in range(2018, 2023) for month in range(1, 13)]

    # List of file paths in S3.
    parquet_files: list[str] = []
    for path in download_data.starmap(inputs):
        print(f"done => {path}")
        parquet_files.append(path)

    # List of datetimes and number of yellow taxi trips.
    dataset = []
    for r in aggregate_data.map(parquet_files):
        dataset += r

    dir = Path("/tmp") / "s3_bucket_mount"
    if not dir.exists():
        dir.mkdir(exist_ok=True, parents=True)

    figure = plot.remote(dataset)
    path = dir / "nyc_yellow_taxi_trips_s3_mount.png"
    with open(path, "wb") as file:
        print(f"Saving figure to {path}")
        file.write(figure)

```

### Safe Code Execution

# Run arbitrary code in a sandboxed environment

This example demonstrates how to run arbitrary code
in multiple languages in a Modal [Sandbox](https://modal.com/docs/guide/sandbox).

## Setting up a multi-language environment

Sandboxes allow us to run any kind of code in a safe environment.
We'll use an image with a few different language runtimes to demonstrate this.

```python
import modal

image = modal.Image.debian_slim(python_version="3.11").apt_install(
    "nodejs", "ruby", "php"
)
app = modal.App.lookup("example-safe-code-execution", create_if_missing=True)

```

We'll now create a Sandbox with this image. We'll also enable output so we can see the image build
logs. Note that we don't pass any commands to the Sandbox, so it will stay alive, waiting for us
to send it commands.

```python
with modal.enable_output():
    sandbox = modal.Sandbox.create(app=app, image=image)

print(f"Sandbox ID: {sandbox.object_id}")

```

## Running bash, Python, Node.js, Ruby, and PHP in a Sandbox

We can now use [`Sandbox.exec`](https://modal.com/docs/reference/modal.Sandbox#exec) to run a few different
commands in the Sandbox.

```python
bash_ps = sandbox.exec("echo", "hello from bash")
python_ps = sandbox.exec("python", "-c", "print('hello from python')")
nodejs_ps = sandbox.exec("node", "-e", 'console.log("hello from nodejs")')
ruby_ps = sandbox.exec("ruby", "-e", "puts 'hello from ruby'")
php_ps = sandbox.exec("php", "-r", "echo 'hello from php';")

print(bash_ps.stdout.read(), end="")
print(python_ps.stdout.read(), end="")
print(nodejs_ps.stdout.read(), end="")
print(ruby_ps.stdout.read(), end="")
print(php_ps.stdout.read(), end="")
print()

```

The output should look something like

```
hello from bash
hello from python
hello from nodejs
hello from ruby
hello from php
```

We can use multiple languages in tandem to build complex applications.
Let's demonstrate this by piping data between Python and Node.js using bash. Here
we generate some random numbers with Python and sum them with Node.js.

```python
combined_process = sandbox.exec(
    "bash",
    "-c",
    """python -c 'import random; print(\" \".join(str(random.randint(1, 100)) for _ in range(10)))' |
    node -e 'const readline = require(\"readline\");
    const rl = readline.createInterface({input: process.stdin});
    rl.on(\"line\", (line) => {
      const sum = line.split(\" \").map(Number).reduce((a, b) => a + b, 0);
      console.log(`The sum of the random numbers is: ${sum}`);
      rl.close();
    });'""",
)

result = combined_process.stdout.read().strip()
print(result)

```

For long-running processes, you can use stdout as an iterator to stream the output.

```python
slow_printer = sandbox.exec(
    "ruby",
    "-e",
    """
    10.times do |i|
      puts "Line #{i + 1}: #{Time.now}"
      STDOUT.flush
      sleep(0.5)
    end
    """,
)

for line in slow_printer.stdout:
    print(line, end="")

```

This should print something like

```
Line 1: 2024-10-21 15:30:53 +0000
Line 2: 2024-10-21 15:30:54 +0000
...
Line 10: 2024-10-21 15:30:58 +0000
```

Since Sandboxes are safely separated from the rest of our system,
we can run very dangerous code in them!

```python
sandbox.exec("rm", "-rfv", "/", "--no-preserve-root")

```

This command has deleted the entire filesystem, so we can't run any more commands.
Let's terminate the Sandbox to clean up after ourselves.

```python
sandbox.terminate()

```

### Sandbox Pool

# Maintain a pool of warm Sandboxes that are healthy and ready to serve requests

This example demonstrates how to build a pool of "warm"
[Modal Sandboxes](https://modal.com/docs/guide/sandbox), and deploy a
[Modal web endpoint](https://modal.com/docs/guide/webhook-urls) that let's you claim
a Sandbox from the pool, getting a URL to the server running in the Sandbox.

Maintaining a pool of warm Sandboxes is useful for example if your Sandboxes need
to do significant work after being created, like downloading code, installing
dependencies, or running tests, before they are ready to serve requests.

It uses a [Modal Queue](https://modal.com/docs/guide/dicts-and-queues#modal-queues)
to store references to the warm Sandboxes, and functionality to maintain the pool
by adding and removing Sandboxes, checking the current size, etc.

The pool keeps track of the time to live for each Sandbox, and will always return
a Sandbox with enough time left.

It's structured into two Apps:
- `example-sandbox-pool` is the main App that contains all the control logic for maintaining
  the pool, exposing ways to claim Sandboxes, etc.
- `example-sandbox-pool-sandboxes` houses all the actual Sandboxes, and nothing else.

The implementation borrows from [pawalt](https://github.com/pawalt)'s [Sandbox pool
example gist](https://gist.github.com/pawalt/7a505c38bba75cafae0780a5dd40e8b8). 🙏

```python
import argparse
import time
from dataclasses import dataclass
from datetime import datetime

import modal

app = modal.App("example-sandbox-pool")

server_image = modal.Image.debian_slim(python_version="3.11").pip_install(
    "fastapi[standard]~=0.115.14",
    "requests~=2.32.4",
)

## Configuration of the pool

```

Here we define the image that will be used to run the server that runs in the
Sandbox. In this simple example, we just run the built in Python HTTP server that
returns a directory listing.

```python
sandbox_image = modal.Image.debian_slim(python_version="3.11")
SANDBOX_SERVER_PORT = 8080
HEALTH_CHECK_TIMEOUT_SECONDS = 10

```

In this example Sandboxes live for 5 minutes, and we assume that they are used for
2 minutes, meaning that if a Sandbox has less than 2 minutes left it's considered
to be expiring too soon and will be terminated.

You'll want to adjust these values depending on your use case.

```python
SANDBOX_TIMEOUT_SECONDS = 5 * 60
SANDBOX_USE_DURATION_SECONDS = 2 * 60
POOL_SIZE = 3
POOL_MAINTENANCE_SCHEDULE = modal.Period(minutes=2)

```

## Main implementation

We keep track of all warm Sandboxes in a Modal Queue of `SandboxReference` objects.

```python
pool_queue = modal.Queue.from_name(
    "example-sandbox-pool-sandboxes", create_if_missing=True
)

@dataclass
class SandboxReference:
    id: str
    url: str
    expires_at: int

```

### Health check

We add a simple health check that just ensures that the server in the Sandbox is
running and responding to requests.

If you just want to ensure the sandbox is running you could for example check
`sb.poll() is not None` instead.

```python
def is_healthy(url: str, wait_for_container_start: bool) -> bool:
    """Check if a Sandbox is healthy.

    When the Sandbox is first created, the server may not imemediately accept
    connections, so if `wait_for_container_start` is True, we retry if we fail to
    connect to the server URL.
    """
    import requests

    start_time = time.time()
    while time.time() - start_time < HEALTH_CHECK_TIMEOUT_SECONDS:
        try:
            response = requests.get(url, timeout=HEALTH_CHECK_TIMEOUT_SECONDS)
            response.raise_for_status()
            return True
        except requests.RequestException:
            if (
                not wait_for_container_start
                or time.time() - start_time >= HEALTH_CHECK_TIMEOUT_SECONDS
            ):
                return False
            time.sleep(0.1)

    return False

def is_still_good(sr: SandboxReference, check_health: bool) -> bool:
    """Check if a Sandbox is still good to use.

    It assumes that it's already been added to the pool, so we don't wait for the
    container to start.
    """
    if sr.expires_at < time.time() + SANDBOX_USE_DURATION_SECONDS:
        return False

    if check_health and not is_healthy(sr.url, wait_for_container_start=False):
        return False

    return True

```

### Adding a Sandbox to the pool

This function creates and adds a new Sandbox to the pool. It runs a health check on
the Sandbox before adding it.

We deploy the Sandboxes in a separate Modal App called `example-sandbox-pool-sandboxes`,
to separate the control app (logs, etc.) from the Sandboxes.

```python
@app.function(image=server_image, retries=3)
@modal.concurrent(max_inputs=100)
def add_sandbox_to_queue() -> None:
    sandbox_app = modal.App.lookup(
        "example-sandbox-pool-sandboxes", create_if_missing=True
    )

    sandbox_cmd = ["python", "-m", "http.server", "8080"]
    sb = modal.Sandbox.create(
        *sandbox_cmd,
        app=sandbox_app,
        image=sandbox_image,
        encrypted_ports=[SANDBOX_SERVER_PORT],
        timeout=SANDBOX_TIMEOUT_SECONDS,
    )
    expires_at = int(time.time()) + SANDBOX_TIMEOUT_SECONDS
    url = sb.tunnels()[SANDBOX_SERVER_PORT].url

    if not is_healthy(url, wait_for_container_start=True):
        raise Exception("Health check failed")

    pool_queue.put(SandboxReference(id=sb.object_id, url=url, expires_at=expires_at))

```

We also have a utility function that can be `.spawn()`ed to terminate Sandboxes.

```python
@app.function()
def terminate_sandboxes(sandbox_ids: list[str]) -> int:
    num_terminated = 0
    for id in sandbox_ids:
        sb = modal.Sandbox.from_id(id)
        sb.terminate()
        num_terminated += 1

    print(f"Terminated {num_terminated} Sandboxes")
    return num_terminated

```

### Claiming a Sandbox from the pool

We expose two ways to claim a Sandbox from the pool and get a URL to the server:

- a web endpoint
- a Function that can be called using the Modal SDK for [Python][1], [Go, or JS][2].

[1]: https://github.com/modal-labs/modal-client
[2]: https://github.com/modal-labs/libmodal

The web endpoint is deployed as a [Modal web endpoint][3], and calls the
`claim_sandbox` Function using `claim_sandbox.local()`, meaning that it's called in
the same process as the web endpoint.

The Function can be called using the Modal SDK for [Python][1], [Go, or JS][2].

[1]: https://github.com/modal-labs/modal-client
[2]: https://github.com/modal-labs/libmodal
[3]: https://modal.com/docs/guide/webhook-urls

```python
@app.function(image=server_image)
@modal.fastapi_endpoint()
@modal.concurrent(max_inputs=100)
def claim_sandbox_web_endpoint(check_health: bool = True) -> str:
    return claim_sandbox.local(check_health=check_health)

@app.function(image=server_image)
def claim_sandbox(check_health: bool = True) -> str:
    to_terminate: list[str] = []

    # Remove any expiring or unhealthy sandboxes, and return the first good one:
    while True:
        print(
            "Adding a new Sandbox to the pool to backfill "
            "(and ensure we have at least one)..."
        )
        add_sandbox_to_queue.spawn()

        # timeout=None here means we block in case we need to wait for the backfill:
        sr = pool_queue.get(timeout=None)
        if sr is None:
            continue

        if not is_still_good(sr, check_health):
            print(f"Sandbox '{sr.id}' was not good - terminating and trying another...")
            to_terminate.append(sr.id)
            continue

        break

    if to_terminate:
        terminate_sandboxes.spawn(to_terminate)

    print(f"Claimed Sandbox '{sr.id}', with URL: {sr.url}")
    return sr.url

```

### Maintaining the pool

This function grows or shrinks the pool to SANDBOX_POOL_SIZE. It first removes any
expiring or unhealthy sandboxes, then adjusts the pool size to reach the target.

It runs on a schedule to ensure the pool doesn't drift too far from the target size.

```python
@app.function(
    image=server_image,
    schedule=POOL_MAINTENANCE_SCHEDULE,
)
def maintain_pool():
    to_terminate: list[str] = []

    # First remove expiring and unhealthy sandboxes
    while True:
        sr = pool_queue.get(block=False)

        if sr is None:
            break

        if not is_still_good(sr, check_health=True):
            to_terminate.append(sr.id)
            continue

        # Found first good sandbox, but don't put it back in the queue to preserve
        # queue ordering.
        to_terminate.append(sr.id)
        break

    if to_terminate:
        print(f"Terminating {len(to_terminate)} expiring/unhealthy sandboxes...")
        terminate_sandboxes.spawn(to_terminate)

    # Now resize to target
    diff = POOL_SIZE - pool_queue.len()

    if diff > 0:
        for _ in add_sandbox_to_queue.starmap(() for _ in range(diff)):
            pass
    elif diff < 0:
        terminate_sandboxes.spawn(
            [sr.id for sr in pool_queue.get_many(n_values=-diff, timeout=0)]
        )

    print(f"Pool size after maintenance: {pool_queue.len()}")

```

## Local commands for interacting with the pool

### Deploy the app

This also runs the `maintain_pool` function to ensure the pool is at the correct size
without having to wait for the first scheduled maintenance run.

Run it with `python 13_sandboxes/sandbox_pool.py deploy`.

```python
def deploy():
    print("Deploying the app...")
    app.deploy()
    print("Done.")

    print("\nRunning initial pool maintenance...")
    maintain_pool.remote()
    print("Done.")

```

### Check the current state of the pool

Run it with `python 13_sandboxes/sandbox_pool.py check`.

```python
def check():
    print(f"Number of Sandboxes in the pool: {pool_queue.len()}")

    for sr in pool_queue.iterate():
        seconds_left = sr.expires_at - time.time()
        print(
            f"- Sandbox '{sr.id}' is at {sr.url} and expires at "
            f"{datetime.fromtimestamp(sr.expires_at).isoformat()} "
            f"({int(seconds_left)} seconds left)"
        )

```

### Claiming a Sandbox from the pool and print its URL

This is implemented as if you wanted to call the Function from a Python backend
application using the Modal SDK, i.e. using `.from_name()` to get the Function, etc.

Run it with `python 13_sandboxes/sandbox_pool.py claim`.

```python
def claim() -> None:
    deployed_claim_sandbox = modal.Function.from_name(
        "example-sandbox-pool", "claim_sandbox"
    )
    print(deployed_claim_sandbox.remote())

```

### Run a demo of the Sandbox pool.

This is implemented as if you wanted to call the Function from a Python backend
application using the Modal SDK, i.e. using `.from_name()` to get the Function, etc.

Run it with `python 13_sandboxes/sandbox_pool.py demo`.

```python
def demo():
    import urllib.request

    deploy()

    check()

    print("\nClaiming a Sandbox using the `claim_sandbox` Function...")
    deployed_claim_sandbox = modal.Function.from_name(
        "example-sandbox-pool", "claim_sandbox"
    )
    sandbox_url = deployed_claim_sandbox.remote()
    print(f"Claimed Sandbox URL: {sandbox_url}")

    print("\nCall the server in the Sandbox...")
    with urllib.request.urlopen(sandbox_url) as response:
        result = response.read().decode("utf-8")
        print(f"Sandbox server response:\n{result}")

    time.sleep(2)  # wait for the pool to be backfilled in the background
    check()

    deployed_web_endpoint = modal.Function.from_name(
        "example-sandbox-pool", "claim_sandbox_web_endpoint"
    )
    web_endpoint_url = deployed_web_endpoint.get_web_url()
    print(f"\nClaiming a Sandbox using the web endpoint at '{web_endpoint_url}'...")
    with urllib.request.urlopen(web_endpoint_url) as response:
        sandbox_url = response.read().decode("utf-8").strip(' "')
        print(f"Claimed Sandbox URL: {sandbox_url}")

    print("\nCall the server in the Sandbox...")
    with urllib.request.urlopen(sandbox_url) as response:
        result = response.read().decode("utf-8")
        print(f"Sandbox server response:\n{result}")

    time.sleep(2)
    check()

def main():
    parser = argparse.ArgumentParser(description="Manage Sandbox pool")
    parser.add_argument(
        "command",
        choices=["check", "deploy", "claim", "demo"],
        help="Command to execute",
    )
    args = parser.parse_args()

    if args.command == "check":
        check()
    elif args.command == "claim":
        claim()
    elif args.command == "deploy":
        deploy()
    elif args.command == "demo":
        demo()
    else:
        parser.print_help()

if __name__ == "__main__":
    main()

```

### Schedule Simple

# Scheduling remote jobs

This example shows how you can schedule remote jobs on Modal.
You can do this either with:

- [`modal.Period`](https://modal.com/docs/reference/modal.Period) - a time interval between function calls.
- [`modal.Cron`](https://modal.com/docs/reference/modal.Cron) - a cron expression to specify the schedule.

In the code below, the first function runs every
5 seconds, and the second function runs every minute. We use the `schedule`
argument to specify the schedule for each function. The `schedule` argument can
take a `modal.Period` object to specify a time interval or a `modal.Cron` object
to specify a cron expression.

```python
import time
from datetime import datetime

import modal

app = modal.App("example-schedule-simple")

@app.function(schedule=modal.Period(seconds=5))
def print_time_1():
    print(
        f"Printing with period 5 seconds: {datetime.now().strftime('%m/%d/%Y, %H:%M:%S')}"
    )

@app.function(schedule=modal.Cron("* * * * *"))
def print_time_2():
    print(
        f"Printing with cron every minute: {datetime.now().strftime('%m/%d/%Y, %H:%M:%S')}"
    )

if __name__ == "__main__":
    with modal.enable_output():
        with app.run():
            time.sleep(60)

```

### Segment Anything

# Run Facebook's Segment Anything Model 2 (SAM 2) on Modal

This example demonstrates how to deploy Facebook's [SAM 2](https://github.com/facebookresearch/sam2)
on Modal. SAM2 is a powerful, flexible image and video segmentation model that can be used
for various computer vision tasks like object detection, instance segmentation,
and even as a foundation for more complex computer vision applications.
SAM2 extends the capabilities of the original SAM to include video segmentation.

In particular, this example segments [this video](https://www.youtube.com/watch?v=WAz1406SjVw) of a man jumping off the cliff.

The output should look something like this:

<center>
<video controls autoplay loop muted>
<source src="https://modal-cdn.com/example-segmented-video.mp4" type="video/mp4">
</video>
</center>

## Set up dependencies for SAM 2

First, we set up the necessary dependencies, including `torch`,
`opencv`, `huggingface_hub`, `torchvision`, and the `sam2` library.

We also install `ffmpeg`, which we will use to manipulate videos,
and a Python wrapper called `ffmpeg-python` for a clean interface.

```python
from pathlib import Path

import modal

MODEL_TYPE = "facebook/sam2-hiera-large"
SAM2_GIT_SHA = (
    "c2ec8e14a185632b0a5d8b161928ceb50197eddc"  # pin commit! research code is fragile
)

image = (
    modal.Image.debian_slim(python_version="3.10")
    .apt_install("git", "wget", "python3-opencv", "ffmpeg")
    .pip_install(
        "torch~=2.4.1",
        "torchvision==0.19.1",
        "opencv-python==4.10.0.84",
        "pycocotools~=2.0.8",
        "matplotlib~=3.9.2",
        "onnxruntime==1.19.2",
        "onnx==1.17.0",
        "huggingface_hub==0.25.2",
        "ffmpeg-python==0.2.0",
        f"git+https://github.com/facebookresearch/sam2.git@{SAM2_GIT_SHA}",
    )
)
app = modal.App("example-segment-anything", image=image)

```

## Wrapping the SAM 2 model in a Modal class

Next, we define the `Model` class that will handle SAM 2 operations for both image and video.

We use the `@modal.enter()` decorators here for optimization: it makes sure the initialization
method runs only once, when a new container starts, instead of in the path of every call.
We'll also use a modal Volume to cache the model weights so that they don't need to be downloaded
repeatedly when we start new containers. For more on storing model weights on Modal, see
[this guide](https://modal.com/docs/guide/model-weights).

```python
video_vol = modal.Volume.from_name("sam2-inputs", create_if_missing=True)
cache_vol = modal.Volume.from_name("hf-hub-cache", create_if_missing=True)
cache_dir = "/cache"

@app.cls(
    image=image.env({"HF_HUB_CACHE": cache_dir}),
    volumes={"/root/videos": video_vol, cache_dir: cache_vol},
    gpu="A100",
)
class Model:
    @modal.enter()
    def initialize_model(self):
        """Download and initialize model."""
        from sam2.sam2_video_predictor import SAM2VideoPredictor

        self.video_predictor = SAM2VideoPredictor.from_pretrained(MODEL_TYPE)

    @modal.method()
    def generate_video_masks(self, video="/root/videos/input.mp4", point_coords=None):
        """Generate masks for a video."""
        import ffmpeg
        import numpy as np
        import torch
        from PIL import Image

        frames_dir = convert_video_to_frames(video)

        # scan all the JPEG files in this directory
        frame_names = [
            p
            for p in frames_dir.iterdir()
            if p.suffix in [".jpg", ".jpeg", ".JPG", ".JPEG"]
        ]
        frame_names.sort(key=lambda p: int(p.stem))

        # We are hardcoding the input point and label here
        # In a real-world scenario, you would want to display the video
        # and allow the user to click on the video to select the point
        if point_coords is None:
            width, height = Image.open(frame_names[0]).size
            point_coords = [[width // 2, height // 2]]

        points = np.array(point_coords, dtype=np.float32)
        # for labels, `1` means positive click and `0` means negative click
        labels = np.array([1] * len(points), np.int32)

        # run the model on GPU
        with (
            torch.inference_mode(),
            torch.autocast("cuda", dtype=torch.bfloat16),
        ):
            self.inference_state = self.video_predictor.init_state(
                video_path=str(frames_dir)
            )

            # add new prompts and instantly get the output on the same frame
            (
                frame_idx,
                object_ids,
                masks,
            ) = self.video_predictor.add_new_points_or_box(
                inference_state=self.inference_state,
                frame_idx=0,
                obj_id=1,
                points=points,
                labels=labels,
            )

            print(f"frame_idx: {frame_idx}, object_ids: {object_ids}, masks: {masks}")

            # run propagation throughout the video and collect the results in a dict
            video_segments = {}  # video_segments contains the per-frame segmentation results
            for (
                out_frame_idx,
                out_obj_ids,
                out_mask_logits,
            ) in self.video_predictor.propagate_in_video(self.inference_state):
                video_segments[out_frame_idx] = {
                    out_obj_id: (out_mask_logits[i] > 0.0).cpu().numpy()
                    for i, out_obj_id in enumerate(out_obj_ids)
                }

        out_dir = Path("/root/mask_frames")
        out_dir.mkdir(exist_ok=True)

        vis_frame_stride = 5  # visualize every 5th frame
        save_segmented_frames(
            video_segments,
            frames_dir,
            out_dir,
            frame_names,
            stride=vis_frame_stride,
        )

        ffmpeg.input(
            f"{out_dir}/frame_*.png",
            pattern_type="glob",
            framerate=30 / vis_frame_stride,
        ).filter(
            "scale",
            "trunc(iw/2)*2",
            "trunc(ih/2)*2",  # round to even dimensions to encode for "dumb players", https://trac.ffmpeg.org/wiki/Encode/H.264#Encodingfordumbplayers
        ).output(str(out_dir / "out.mp4"), format="mp4", pix_fmt="yuv420p").run()

        return (out_dir / "out.mp4").read_bytes()

```

## Segmenting videos from the command line

Finally, we define a [`local_entrypoint`](https://modal.com/docs/guide/apps#entrypoints-for-ephemeral-apps)
to run the segmentation from our local machine's terminal.

There are several ways to pass files between the local machine and the Modal Function.

One way is to upload the files onto a Modal [Volume](https://modal.com/docs/guide/volumes),
which acts as a distributed filesystem.

The other way is to convert the file to bytes and pass the bytes back and forth as the input or output of Python functions.
We use this method to get the video file with the segmentation results in it back to the local machine.

```python
@app.local_entrypoint()
def main(
    input_video=Path(__file__).parent / "cliff_jumping.mp4",
    x_point=250,
    y_point=200,
):
    with video_vol.batch_upload(force=True) as batch:
        batch.put_file(input_video, "input.mp4")

    model = Model()

    if x_point is not None and y_point is not None:
        point_coords = [[x_point, y_point]]
    else:
        point_coords = None

    print(f"Running SAM 2 on {input_video}")
    video_bytes = model.generate_video_masks.remote(point_coords=point_coords)

    dir = Path("/tmp/sam2_outputs")
    dir.mkdir(exist_ok=True, parents=True)
    output_path = dir / "segmented_video.mp4"
    output_path.write_bytes(video_bytes)
    print(f"Saved output video to {output_path}")

```

## Helper functions for SAM 2 inference

Above, we used some helper functions to for some of the details, like breaking the video into frames.
These are defined below.

```python
def convert_video_to_frames(self, input_video="/root/videos/input.mp4"):
    import ffmpeg

    input_video = Path(input_video)
    output_dir = (  # output on local filesystem, not on the remote Volume
        input_video.parent.parent / input_video.stem / "video_frames"
    )
    output_dir.mkdir(exist_ok=True, parents=True)

    ffmpeg.input(input_video).output(
        f"{output_dir}/%05d.jpg", qscale=2, start_number=0
    ).run()

    return output_dir

def show_mask(mask, ax, obj_id=None, random_color=False):
    import matplotlib.pyplot as plt
    import numpy as np

    if random_color:
        color = np.concatenate([np.random.random(3), np.array([0.6])], axis=0)
    else:
        cmap = plt.get_cmap("tab10")
        cmap_idx = 0 if obj_id is None else obj_id
        color = np.array([*cmap(cmap_idx)[:3], 0.6])
    h, w = mask.shape[-2:]
    mask_image = mask.reshape(h, w, 1) * color.reshape(1, 1, -1)
    ax.imshow(mask_image)

def save_segmented_frames(video_segments, frames_dir, out_dir, frame_names, stride=5):
    import io

    import matplotlib.pyplot as plt
    from PIL import Image

    frames_dir, out_dir = Path(frames_dir), Path(out_dir)

    frame_images = []
    inches_per_px = 1 / plt.rcParams["figure.dpi"]
    for out_frame_idx in range(0, len(frame_names), stride):
        frame = Image.open(frames_dir / frame_names[out_frame_idx])
        width, height = frame.size
        width, height = width - width % 2, height - height % 2
        fig, ax = plt.subplots(figsize=(width * inches_per_px, height * inches_per_px))
        ax.axis("off")
        ax.imshow(frame)

        [
            show_mask(mask, ax, obj_id=obj_id)
            for (obj_id, mask) in video_segments[out_frame_idx].items()
        ]

        # Convert plot to PNG bytes
        buf = io.BytesIO()
        fig.savefig(buf, format="png", bbox_inches="tight", pad_inches=0)
        # fig.savefig(buf, format="png")
        buf.seek(0)
        frame_images.append(buf.getvalue())
        plt.close(fig)

    for ii, frame in enumerate(frame_images):
        (out_dir / f"frame_{str(ii).zfill(3)}.png").write_bytes(frame)

```

### Serve Streamlit

# Run and share Streamlit apps

This example shows you how to run a Streamlit app with `modal serve`, and then deploy it as a serverless web app.

![example streamlit app](./streamlit.png)

This example is structured as two files:

1. This module, which defines the Modal objects (name the script `serve_streamlit.py` locally).

2. `app.py`, which is any Streamlit script to be mounted into the Modal
function ([download script](https://github.com/modal-labs/modal-examples/blob/main/10_integrations/streamlit/app.py)).

```python
import shlex
import subprocess
from pathlib import Path

import modal

```

## Define container dependencies

The `app.py` script imports three third-party packages, so we include these in the example's
image definition and then add the `app.py` file itself to the image.

```python
streamlit_script_local_path = Path(__file__).parent / "app.py"
streamlit_script_remote_path = "/root/app.py"

image = (
    modal.Image.debian_slim(python_version="3.11")
    .pip_install("streamlit~=1.35.0", "numpy~=1.26.4", "pandas~=2.2.2")
    .add_local_file(
        streamlit_script_local_path,
        streamlit_script_remote_path,
    )
)

app = modal.App(name="example-serve-streamlit", image=image)

if not streamlit_script_local_path.exists():
    raise RuntimeError(
        "app.py not found! Place the script with your streamlit app in the same directory."
    )

```

## Spawning the Streamlit server

Inside the container, we will run the Streamlit server in a background subprocess using
`subprocess.Popen`. We also expose port 8000 using the `@web_server` decorator.

```python
@app.function()
@modal.concurrent(max_inputs=100)
@modal.web_server(8000)
def run():
    target = shlex.quote(streamlit_script_remote_path)
    cmd = f"streamlit run {target} --server.port 8000 --server.enableCORS=false --server.enableXsrfProtection=false"
    subprocess.Popen(cmd, shell=True)

```

## Iterate and Deploy

While you're iterating on your screamlit app, you can run it "ephemerally" with `modal serve`. This will
run a local process that watches your files and updates the app if anything changes.

```shell
modal serve serve_streamlit.py
```

Once you're happy with your changes, you can deploy your application with

```shell
modal deploy serve_streamlit.py
```

If successful, this will print a URL for your app that you can navigate to from
your browser 🎉 .

### Sgl Vlm

# Run Qwen2-VL on SGLang for Visual QA

Vision-Language Models (VLMs) are like LLMs with eyes:
they can generate text based not just on other text,
but on images as well.

This example shows how to run a VLM on Modal using the
[SGLang](https://github.com/sgl-project/sglang) library.

Here's a sample inference, with the image rendered directly (and at low resolution) in the terminal:

![Sample output answering a question about a photo of the Statue of Liberty](https://modal-public-assets.s3.amazonaws.com/sgl_vlm_qa_sol.png)

## Setup

First, we'll import the libraries we need locally
and define some constants.

```python
import os
import time
import warnings
from pathlib import Path
from typing import Optional
from uuid import uuid4

import modal

```

VLMs are generally larger than LLMs with the same cognitive capability.
LLMs are already hard to run effectively on CPUs, so we'll use a GPU here.
We find that inference for a single input takes about 3-4 seconds on an A10G.

You can customize the GPU type and count using the `GPU_TYPE` and `GPU_COUNT` environment variables.
If you want to see the model really rip, try an `"a100-80gb"` or an `"h100"`
on a large batch.

```python
GPU_TYPE = os.environ.get("GPU_TYPE", "l40s")
GPU_COUNT = os.environ.get("GPU_COUNT", 1)

GPU_CONFIG = f"{GPU_TYPE}:{GPU_COUNT}"

SGL_LOG_LEVEL = "error"  # try "debug" or "info" if you have issues

MINUTES = 60  # seconds

```

We use the [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)
model by Alibaba.

```python
MODEL_PATH = "Qwen/Qwen2-VL-7B-Instruct"
MODEL_REVISION = "a7a06a1cc11b4514ce9edcde0e3ca1d16e5ff2fc"
TOKENIZER_PATH = "Qwen/Qwen2-VL-7B-Instruct"
MODEL_CHAT_TEMPLATE = "qwen2-vl"

```

We download it from the Hugging Face Hub using the Python function below.
We'll store it in a [Modal Volume](https://modal.com/docs/guide/volumes)
so that it's not downloaded every time the container starts.

```python
MODEL_VOL_PATH = Path("/models")
MODEL_VOL = modal.Volume.from_name("sgl-cache", create_if_missing=True)
volumes = {MODEL_VOL_PATH: MODEL_VOL}

def download_model():
    from huggingface_hub import snapshot_download

    snapshot_download(
        MODEL_PATH,
        local_dir=str(MODEL_VOL_PATH / MODEL_PATH),
        revision=MODEL_REVISION,
        ignore_patterns=["*.pt", "*.bin"],
    )

```

Modal runs Python functions on containers in the cloud.
The environment those functions run in is defined by the container's `Image`.
The block of code below defines our example's `Image`.

```python
cuda_version = "12.8.0"  # should be no greater than host CUDA version
flavor = "devel"  #  includes full CUDA toolkit
operating_sys = "ubuntu22.04"
tag = f"{cuda_version}-{flavor}-{operating_sys}"

vlm_image = (
    modal.Image.from_registry(f"nvidia/cuda:{tag}", add_python="3.11")
    .entrypoint([])  # removes chatty prints on entry
    .apt_install("libnuma-dev")  # Add NUMA library for sgl_kernel
    .pip_install(  # add sglang and some Python dependencies
        "transformers==4.52.3",
        "numpy<2",
        "fastapi[standard]==0.115.4",
        "pydantic==2.9.2",
        "requests==2.32.3",
        "starlette==0.41.2",
        "torch==2.7.1",
        "sglang[all]==0.4.8",
        "sgl-kernel==0.1.9",
        "hf-xet==1.1.5",
    )
    .env(
        {
            "HF_HOME": str(MODEL_VOL_PATH),
            "HF_HUB_ENABLE_HF_TRANSFER": "1",
        }
    )
    .run_function(  # download the model by running a Python function
        download_model, volumes=volumes
    )
    .pip_install(  # add an optional extra that renders images in the terminal
        "term-image==0.7.1"
    )
)

```

## Defining a Visual QA service

Running an inference service on Modal is as easy as writing inference in Python.

The code below adds a modal `Cls` to an `App` that runs the VLM.

We define a method `generate` that takes a URL for an image and a question
about the image as inputs and returns the VLM's answer.

By decorating it with `@modal.fastapi_endpoint`, we expose it as an HTTP endpoint,
so it can be accessed over the public Internet from any client.

```python
app = modal.App("example-sgl-vlm")

@app.cls(
    gpu=GPU_CONFIG,
    timeout=20 * MINUTES,
    scaledown_window=20 * MINUTES,
    image=vlm_image,
    volumes=volumes,
)
@modal.concurrent(max_inputs=100)
class Model:
    @modal.enter()  # what should a container do after it starts but before it gets input?
    def start_runtime(self):
        """Starts an SGL runtime to execute inference."""
        import sglang as sgl

        self.runtime = sgl.Runtime(
            model_path=MODEL_PATH,
            tokenizer_path=TOKENIZER_PATH,
            tp_size=GPU_COUNT,  # t_ensor p_arallel size, number of GPUs to split the model over
            log_level=SGL_LOG_LEVEL,
        )
        self.runtime.endpoint.chat_template = sgl.lang.chat_template.get_chat_template(
            MODEL_CHAT_TEMPLATE
        )
        sgl.set_default_backend(self.runtime)

    @modal.fastapi_endpoint(method="POST", docs=True)
    def generate(self, request: dict) -> str:
        from pathlib import Path

        import requests
        import sglang as sgl
        from term_image.image import from_file

        start = time.monotonic_ns()
        request_id = uuid4()
        print(f"Generating response to request {request_id}")

        image_url = request.get("image_url")
        if image_url is None:
            image_url = (
                "https://modal-public-assets.s3.amazonaws.com/golden-gate-bridge.jpg"
            )

        response = requests.get(image_url)
        response.raise_for_status()

        image_filename = image_url.split("/")[-1]
        image_path = Path(f"/tmp/{uuid4()}-{image_filename}")
        image_path.write_bytes(response.content)

        @sgl.function
        def image_qa(s, image_path, question):
            s += sgl.user(sgl.image(str(image_path)) + question)
            s += sgl.assistant(sgl.gen("answer"))

        question = request.get("question")
        if question is None:
            question = "What is this?"

        state = image_qa.run(
            image_path=image_path, question=question, max_new_tokens=128
        )
        # show the question and image in the terminal for demonstration purposes
        print(Colors.BOLD, Colors.GRAY, "Question: ", question, Colors.END, sep="")
        terminal_image = from_file(image_path)
        terminal_image.draw()
        print(
            f"request {request_id} completed in {round((time.monotonic_ns() - start) / 1e9, 2)} seconds"
        )

        return state["answer"]

    @modal.exit()  # what should a container do before it shuts down?
    def shutdown_runtime(self):
        self.runtime.shutdown()

```

## Asking questions about images via POST

Now, we can send this Modal Function a POST request with an image and a question
and get back an answer.

The code below will start up the inference service
so that it can be run from the terminal as a one-off,
like a local script would be, using `modal run`:

```bash
modal run sgl_vlm.py
```

By default, we hit the endpoint twice to demonstrate how much faster
the inference is once the server is running.

```python
@app.local_entrypoint()
def main(
    image_url: Optional[str] = None, question: Optional[str] = None, twice: bool = True
):
    import json
    import urllib.request

    model = Model()

    payload = json.dumps(
        {
            "image_url": image_url,
            "question": question,
        },
    )

    req = urllib.request.Request(
        model.generate.get_web_url(),
        data=payload.encode("utf-8"),
        headers={"Content-Type": "application/json"},
        method="POST",
    )

    with urllib.request.urlopen(req) as response:
        assert response.getcode() == 200, response.getcode()
        print(json.loads(response.read().decode()))

    if twice:
        # second response is faster, because the Function is already running
        with urllib.request.urlopen(req) as response:
            assert response.getcode() == 200, response.getcode()
            print(json.loads(response.read().decode()))

```

## Deployment

To set this up as a long-running, but serverless, service, we can deploy it to Modal:

```bash
modal deploy sgl_vlm.py
```

And then send requests from anywhere. See the [docs](https://modal.com/docs/guide/webhook-urls)
for details on the `web_url` of the function, which also appears in the terminal output
when running `modal deploy`.

You can also find interactive documentation for the endpoint at the `/docs` route of the web endpoint URL.

## Addenda

The rest of the code in this example is just utility code.

```python
warnings.filterwarnings(  # filter warning from the terminal image library
    "ignore",
    message="It seems this process is not running within a terminal. Hence, some features will behave differently or be disabled.",
    category=UserWarning,
)

class Colors:
    """ANSI color codes"""

    GREEN = "\033[0;32m"
    BLUE = "\033[0;34m"
    GRAY = "\033[0;90m"
    BOLD = "\033[1m"
    END = "\033[0m"

```

### Simple Code Interpreter

# Build a stateful, sandboxed code interpreter

This example demonstrates how to build a stateful code interpreter using a Modal
[Sandbox](https://modal.com/docs/guide/sandbox).

We'll create a Modal Sandbox that listens for code to execute and then
executes the code in a Python interpreter. Because we're running in a sandboxed
environment, we can safely use the "unsafe" `exec()` to execute the code.

## Setting up a code interpreter in a Modal Sandbox

Our code interpreter uses a Python "driver program" to listen for code
sent in JSON format to its standard input (`stdin`), execute the code,
and then return the results in JSON format on standard output (`stdout`).

```python
import inspect
import json
from typing import Any

import modal
import modal.container_process

def driver_program():
    import json
    import sys
    from contextlib import redirect_stderr, redirect_stdout
    from io import StringIO

    # When you `exec` code in Python, you can pass in a dictionary
    # that defines the global variables the code has access to.

    # We'll use that to store state.

    globals: dict[str, Any] = {}
    while True:
        command = json.loads(input())  # read a line of JSON from stdin
        if (code := command.get("code")) is None:
            print(json.dumps({"error": "No code to execute"}))
            continue

        # Capture the executed code's outputs
        stdout_io, stderr_io = StringIO(), StringIO()
        with redirect_stdout(stdout_io), redirect_stderr(stderr_io):
            try:
                exec(code, globals)
            except Exception as e:
                print(f"Execution Error: {e}", file=sys.stderr)

        print(
            json.dumps(
                {
                    "stdout": stdout_io.getvalue(),
                    "stderr": stderr_io.getvalue(),
                }
            ),
            flush=True,
        )

```

Now that we have the driver program, we can write a function to take a
`ContainerProcess` that is running the driver program and execute code in it.

```python
def run_code(p: modal.container_process.ContainerProcess, code: str):
    p.stdin.write(json.dumps({"code": code}))
    p.stdin.write("\n")
    p.stdin.drain()
    next_line = next(iter(p.stdout))
    result = json.loads(next_line)
    print(result["stdout"], end="")
    print("\033[91m" + result["stderr"] + "\033[0m", end="")

```

We've got our driver program and our code runner. Now we can create a Sandbox
and run the driver program in it.

We have to convert the driver program to a string to pass it to the Sandbox.
Here we use `inspect.getsource` to get the source code as a string,
but you could also keep the driver program in a separate file and read it in.

```python
driver_program_text = inspect.getsource(driver_program)
driver_program_command = f"""{driver_program_text}\n\ndriver_program()"""

app = modal.App.lookup("example-simple-code-interpreter", create_if_missing=True)
sb = modal.Sandbox.create(app=app)
p = sb.exec("python", "-c", driver_program_command, bufsize=1)

```

## Running code in a Modal Sandbox

Now we can execute some code in the Sandbox!

```python
run_code(p, "print('hello, world!')")  # hello, world!

```

The Sandbox and our code interpreter are stateful,
so we can define variables and use them in subsequent code.

```python
run_code(p, "x = 10")
run_code(p, "y = 5")
run_code(p, "result = x + y")
run_code(p, "print(f'The result is: {result}')")  # The result is: 15

```

We can also see errors when code fails.

```python
run_code(p, "print('Attempting to divide by zero...')")
run_code(p, "1 / 0")  # Execution Error: division by zero

```

Finally, let's clean up after ourselves and terminate the Sandbox.

```python
sb.terminate()

```

### Simple Torch Cluster

# Simple PyTorch cluster

This example shows how you can perform distributed computation with PyTorch.
It is a kind of 'hello world' example for distributed ML training: setting up a cluster
and executing a broadcast operation to share a single tensor.

## Basic setup: Imports, dependencies, and a script

Let's get the imports out of the way first.
We need to import `modal.experimental` to use this feature, since it's still under development.
Let us know if you run into any issues!

```python
import os
from pathlib import Path

import modal
import modal.experimental

```

Communicating between nodes in a cluster requires communication libraries.
We'll use `torch`, so we add it to our container's [Image](https://modal.com/docs/guide/images) here.

```python
image = modal.Image.debian_slim(python_version="3.12").pip_install(
    "torch~=2.5.1", "numpy~=2.2.1"
)

```

The approach we're going to take is to use a Modal [Function](https://modal.com/docs/reference/modal.Function)
to launch the underlying script we want to distribute over the cluster nodes.
The script is located in another file in the same directory
of [our examples repo](https://github.com/modal-labs/modal-examples/).
In order to use it in our remote Modal Function,
we need to duplicate it remotely, which we do with `add_local_file`.

```python
this_directory = Path(__file__).parent

image = image.add_local_file(
    this_directory / "simple_torch_cluster_script.py",
    remote_path="/root/script.py",
)

app = modal.App("example-simple-torch-cluster", image=image)

```

## Configuring a test cluster

First, we set the size of the cluster in containers/nodes. This can be between 1 and 8.
This is part of our Modal configuration, since Modal is responsible for spinning up our cluster.

```python
n_nodes = 4

```

Next, we set the number of processes we run per node.
The usual practice is to run one process per GPU,
so we set those two values to be equal.
Note that `N_GPU` is Modal configuration ("how many GPUs should we spin up for you?")
while `nproc_per_node` is `torch.distributed` configuration ("how many processes should we spawn for you?").

```python
n_proc_per_node = N_GPU = 1
GPU_CONFIG = f"H100:{N_GPU}"

```

Lastly, we need to select our communications library: the software that will handle
sending messages between nodes in our cluster.
Since we are running on GPUs, we use the
[NVIDIA Collective Communications Library](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html)
(`nccl`, pronounced "nickle").

This is part of `torch.distributed` configuration --
Modal handles the networking infrastructure but not the communication protocol.

```python
backend = "nccl"  # or "gloo" on CPU, see https://pytorch.org/docs/stable/distributed.html#which-backend-to-use

```

This cluster configurations is nice for testing, but typically
you'll want to run a cluster with the maximum number of GPUs per container --
8 if you're running on H100s, the beefiest GPUs we offer on Modal.

## Launching the script

Our Modal Function is merely a 'launcher' that sets up the distributed
cluster environment and then calls `torch.distributed.run`,
the underlying Python code exposed by the [`torchrun`](https://pytorch.org/docs/stable/elastic/run.html)
command line tool.

So executing this distributed job is easy! Just run

```bash
modal run simple_torch_cluster.py
```

in your terminal.

In addition to the values set in code above, you can pass additional arguments to `torch.distributed.run`
via the command line:

```bash
modal run simple_torch_cluster.py --max-restarts=1
```

```python
@app.function(gpu=GPU_CONFIG)
@modal.experimental.clustered(size=n_nodes)
def dist_run_script(*args):
    from torch.distributed.run import parse_args, run

    cluster_info = (  # we populate this data for you
        modal.experimental.get_cluster_info()
    )
    # which container am I?
    container_rank = cluster_info.rank
    # how many containers are in this cluster?
    world_size = len(cluster_info.container_ips)
    # what's the leader/master/main container's address?
    main_addr = cluster_info.container_ips[0]
    # what's the identifier of this cluster task in Modal?
    task_id = os.environ["MODAL_TASK_ID"]
    print(f"hello from {container_rank=}")
    if container_rank == 0:
        print(
            f"reporting cluster state from rank0/main: {main_addr=}, {world_size=}, {task_id=}"
        )

    run(
        parse_args(
            [
                f"--nnodes={n_nodes}",
                f"--node_rank={cluster_info.rank}",
                f"--master_addr={main_addr}",
                f"--nproc-per-node={n_proc_per_node}",
                "--master_port=1234",
            ]
            + list(args)
            + ["/root/script.py", "--backend", backend]
        )
    )

```

### Simple Torch Cluster Script

```python
import argparse
import os
from contextlib import contextmanager

import torch
import torch.distributed as dist

```

Environment variables set by torch.distributed.run.

```python
LOCAL_RANK = int(os.environ["LOCAL_RANK"])
WORLD_SIZE = int(os.environ["WORLD_SIZE"])
WORLD_RANK = int(os.environ["RANK"])
```

The master (or leader) rank is always 0 with torch.distributed.run.

```python
MASTER_RANK = 0

```

This `run` function performs a simple distributed data transfer between containers
using the specified distributed communication backend.

An example topology of the cluster when WORLD_SIZE=4 is shown below:

       +---------+
       | Master  |
       | Rank 0  |
       +----+----+
            |
            |
   +--------+--------+
   |        |        |
   |        |        |
+--+--+  +--+--+  +--+--+
|Rank 1| |Rank 2| |Rank 3|
+-----+  +-----+  +-----+

A broadcast operation (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html#broadcast)
is performed between the master container (rank 0) and all other containers.

The master container (rank 0) sends a tensor to all other containers.
Each container then receives that tensor from the master container.

```python
def run(backend):
    # Helper function providing a vanity name for each container based on its world (i.e. global) rank.
    def container_name(wrld_rank: int) -> str:
        return (
            f"container-{wrld_rank} (main)"
            if wrld_rank == 0
            else f"container-{wrld_rank}"
        )

    tensor = torch.zeros(1)

    # Need to put tensor on a GPU device for NCCL backend.
    if backend == "nccl":
        device = torch.device("cuda:{}".format(LOCAL_RANK))
        tensor = tensor.to(device)

    if WORLD_RANK == MASTER_RANK:
        print(f"{container_name(WORLD_RANK)} sending data to all other containers...\n")
        for rank_recv in range(1, WORLD_SIZE):
            dist.send(tensor=tensor, dst=rank_recv)
            print(
                f"{container_name(WORLD_RANK)} sent data to {container_name(rank_recv)}\n"
            )
    else:
        dist.recv(tensor=tensor, src=MASTER_RANK)
        print(
            f"{container_name(WORLD_RANK)} has received data from {container_name(MASTER_RANK)}\n"
        )

```

In order for the broadcast operation to happen across the cluster, we need to have the master container (rank 0)
learn the network addresses of all other containers.

This is done by calling `dist.init_process_group` with the specified backend.

See https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group for more details.

```python
@contextmanager
def init_processes(backend):
    try:
        dist.init_process_group(backend, rank=WORLD_RANK, world_size=WORLD_SIZE)
        yield
    finally:
        dist.barrier()  # ensure any async work is done before cleaning up
        # Remove this if it causes program to hang. ref: https://github.com/pytorch/pytorch/issues/75097.
        dist.destroy_process_group()

if __name__ == "__main__":
    # This is a minimal CLI interface adhering to the requirements of torch.distributed.run (torchrun).
    #
    # Our Modal Function will use torch.distributed.run to launch this script.
    #
    # See https://pytorch.org/docs/stable/elastic/run.html for more details on the CLI interface.
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--local-rank",
        "--local_rank",
        type=int,
        help="Local rank. Necessary for using the torch.distributed.launch utility.",
    )
    parser.add_argument("--backend", type=str, default="gloo", choices=["nccl", "gloo"])
    args = parser.parse_args()

    with init_processes(backend=args.backend):
        run(backend=args.backend)

```

### Streaming

# Deploy a FastAPI app with streaming responses

This example shows how you can deploy a [FastAPI](https://fastapi.tiangolo.com/) app with Modal that streams results back to the client.

```python
import asyncio
import time

import modal
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

image = modal.Image.debian_slim().pip_install("fastapi[standard]")
app = modal.App("example-streaming", image=image)

web_app = FastAPI()

```

This asynchronous generator function simulates
progressively returning data to the client. The `asyncio.sleep`
is not necessary, but makes it easier to see the iterative behavior
of the response.

```python
async def fake_video_streamer():
    for i in range(10):
        yield f"frame {i}: hello world!".encode()
        await asyncio.sleep(1.0)

```

ASGI app with streaming handler.

This `fastapi_app` also uses the fake video streamer async generator,
passing it directly into `StreamingResponse`.

```python
@web_app.get("/")
async def main():
    return StreamingResponse(fake_video_streamer(), media_type="text/event-stream")

@app.function()
@modal.asgi_app()
def fastapi_app():
    return web_app

```

This `hook` web endpoint Modal function calls *another* Modal function,
and it just works!

```python
@app.function()
def sync_fake_video_streamer():
    for i in range(10):
        yield f"frame {i}: some data\n".encode()
        time.sleep(1)

@app.function()
@modal.fastapi_endpoint()
def hook():
    return StreamingResponse(
        sync_fake_video_streamer.remote_gen(), media_type="text/event-stream"
    )

```

This `mapped` web endpoint Modal function does a parallel `.map` on a simple
Modal function. Using `.starmap` also would work in the same fashion.

```python
@app.function()
def map_me(i):
    time.sleep(i)  # stagger the results for demo purposes
    return f"hello from {i}\n"

@app.function()
@modal.fastapi_endpoint()
def mapped():
    return StreamingResponse(map_me.map(range(10)), media_type="text/event-stream")

```

To try for yourself, run

```shell
modal serve streaming.py
```

and then send requests to the URLs that appear in the terminal output.

Make sure that your client is not buffering the server response
until it gets newline (\n) characters. By default browsers and `curl` are buffering,
though modern browsers should respect the "text/event-stream" content type header being set.

### Streaming Kyutai Stt

# Stream transcriptions with Kyutai STT

This example demonstrates the deployment of a streaming audio transcription service with Kyutai STT on Modal.

[Kyutai STT](https://kyutai.org/next/stt) is an automated speech recognition/transcription model
that is designed to operate on streams of audio, rather than on complete audio files.
See the linked blog post for details on their "delayed streams" architecture.

## Setup

We start by importing some basic packages and the Modal SDK.

```python
import asyncio
import base64
import time
from pathlib import Path

import modal

```

Then we define a Modal App and an
[Image](https://modal.com/docs/guide/images)
with the dependencies of our speech-to-text system.

```python
app = modal.App(name="example-streaming-kyutai-stt")

stt_image = (
    modal.Image.debian_slim(python_version="3.12")
    .uv_pip_install(
        "moshi==0.2.9", "fastapi==0.116.1", "hf_transfer==0.1.9", "julius==0.2.7"
    )
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
)

```

One dependency is missing: the model weights.

Instead of including them in the Image or loading them every time the Function starts,
we add them to a Modal [Volume](https://modal.com/docs/guide/volumes).
Volumes are like a shared disk that all Modal Functions can access.

For more details on patterns for handling model weights on Modal, see
[this guide](https://modal.com/docs/guide/model-weights).

```python
MODEL_NAME = "kyutai/stt-1b-en_fr"

hf_cache_vol = modal.Volume.from_name(f"{app.name}-hf-cache", create_if_missing=True)
hf_cache_vol_path = Path("/root/.cache/huggingface")
volumes = {hf_cache_vol_path: hf_cache_vol}

```

## Run Kyutai STT inference on Modal

Now we're ready to add the code that runs the speech-to-text model.

We use a Modal [Cls](https://modal.com/docs/guide/lifecycle-functions)
so that we can separate out the model loading and setup code from the inference.

For more on lifecycle management with Clses and cold start penalty reduction on Modal, see
[this guide](https://modal.com/docs/guide/cold-start).

We also define multiple ways to access the underlying streaming STT service --
via a [WebSocket](https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API),
for Web clients like browsers,
and via a Modal [Queue](https://modal.com/docs/guide/queues)
for Python clients.

That plus the code for manipulating the streams of audio bytes and output text
leads to a pretty big class! But there's not anything too complex here.

```python
MINUTES = 60

@app.cls(image=stt_image, gpu="l40s", volumes=volumes, timeout=10 * MINUTES)
class STT:
    BATCH_SIZE = 1

    @modal.enter()
    def enter(self):
        import torch
        from huggingface_hub import snapshot_download
        from moshi.models import LMGen, loaders

        start_time = time.monotonic_ns()

        print("Loading model...")
        snapshot_download(MODEL_NAME)

        self.device = "cuda" if torch.cuda.is_available() else "cpu"

        checkpoint_info = loaders.CheckpointInfo.from_hf_repo(MODEL_NAME)
        self.mimi = checkpoint_info.get_mimi(device=self.device)
        self.frame_size = int(self.mimi.sample_rate / self.mimi.frame_rate)

        self.moshi = checkpoint_info.get_moshi(device=self.device)
        self.lm_gen = LMGen(self.moshi, temp=0, temp_text=0)

        self.mimi.streaming_forever(self.BATCH_SIZE)
        self.lm_gen.streaming_forever(self.BATCH_SIZE)

        self.text_tokenizer = checkpoint_info.get_text_tokenizer()

        self.audio_silence_prefix_seconds = checkpoint_info.stt_config.get(
            "audio_silence_prefix_seconds", 1.0
        )
        self.audio_delay_seconds = checkpoint_info.stt_config.get(
            "audio_delay_seconds", 5.0
        )
        self.padding_token_id = checkpoint_info.raw_config.get(
            "text_padding_token_id", 3
        )

        # warmup gpus
        for _ in range(4):
            codes = self.mimi.encode(
                torch.zeros(self.BATCH_SIZE, 1, self.frame_size).to(self.device)
            )
            for c in range(codes.shape[-1]):
                tokens = self.lm_gen.step(codes[:, :, c : c + 1])
                if tokens is None:
                    continue
        torch.cuda.synchronize()

        print(f"Model loaded in {round((time.monotonic_ns() - start_time) / 1e9, 2)}s")

    def reset_state(self):
        # reset llm chat history for this input
        self.mimi.reset_streaming()
        self.lm_gen.reset_streaming()

    async def transcribe(self, pcm, all_pcm_data):
        import numpy as np
        import torch

        if pcm is None:
            yield all_pcm_data
            return
        if len(pcm) == 0:
            yield all_pcm_data
            return

        if pcm.shape[-1] == 0:
            yield all_pcm_data
            return

        if all_pcm_data is None:
            all_pcm_data = pcm
        else:
            all_pcm_data = np.concatenate((all_pcm_data, pcm))

        # infer on each frame
        while all_pcm_data.shape[-1] >= self.frame_size:
            chunk = all_pcm_data[: self.frame_size]
            all_pcm_data = all_pcm_data[self.frame_size :]

            with torch.no_grad():
                chunk = torch.from_numpy(chunk)
                chunk = chunk.unsqueeze(0).unsqueeze(0)  # (1, 1, frame_size)
                chunk = chunk.expand(
                    self.BATCH_SIZE, -1, -1
                )  # (batch_size, 1, frame_size)
                chunk = chunk.to(device=self.device)

                # inference on audio chunk
                codes = self.mimi.encode(chunk)

                # language model inference against encoded audio
                for c in range(codes.shape[-1]):
                    text_tokens, vad_heads = self.lm_gen.step_with_extra_heads(
                        codes[:, :, c : c + 1]
                    )
                    if text_tokens is None:
                        # model is silent
                        yield all_pcm_data
                        return
                    if vad_heads:
                        pr_vad = vad_heads[2][0, 0, 0].cpu().item()
                        if pr_vad > 0.5:
                            # end of turn detected
                            yield all_pcm_data
                            return

                    assert text_tokens.shape[1] == self.lm_gen.lm_model.dep_q + 1

                    text_token = text_tokens[0, 0, 0].item()
                    if text_token not in (0, 3):
                        text = self.text_tokenizer.id_to_piece(text_token)
                        text = text.replace("▁", " ")
                        yield text

        yield all_pcm_data

    @modal.asgi_app()
    def api(self):
        import sphn
        from fastapi import FastAPI, Response, WebSocket, WebSocketDisconnect

        web_app = FastAPI()

        @web_app.get("/status")
        async def status():
            return Response(status_code=200)

        @web_app.websocket("/ws")
        async def transcribe_websocket(ws: WebSocket):
            await ws.accept()

            opus_stream_inbound = sphn.OpusStreamReader(self.mimi.sample_rate)
            transcription_queue = asyncio.Queue()

            print("Session started")
            tasks = []

            # asyncio to run multiple loops concurrently within single websocket connection
            async def recv_loop():
                """
                Receives Opus stream across websocket, appends into inbound queue.
                """
                nonlocal opus_stream_inbound
                while True:
                    data = await ws.receive_bytes()

                    if not isinstance(data, bytes):
                        print("received non-bytes message")
                        continue
                    if len(data) == 0:
                        print("received empty message")
                        continue
                    opus_stream_inbound.append_bytes(data)

            async def inference_loop():
                """
                Runs streaming inference on inbound data, and if any response audio is created, appends it to the outbound stream.
                """
                nonlocal opus_stream_inbound, transcription_queue
                all_pcm_data = None

                while True:
                    await asyncio.sleep(0.001)

                    pcm = opus_stream_inbound.read_pcm()
                    async for msg in self.transcribe(pcm, all_pcm_data):
                        if isinstance(msg, str):
                            transcription_queue.put_nowait(msg)
                        else:
                            all_pcm_data = msg

            async def send_loop():
                """
                Reads outbound data, and sends it across websocket
                """
                nonlocal transcription_queue
                while True:
                    data = await transcription_queue.get()

                    if data is None:
                        continue

                    msg = b"\x01" + bytes(
                        data, encoding="utf8"
                    )  # prepend "\x01" as a tag to indicate text
                    await ws.send_bytes(msg)

            # run all loops concurrently
            try:
                tasks = [
                    asyncio.create_task(recv_loop()),
                    asyncio.create_task(inference_loop()),
                    asyncio.create_task(send_loop()),
                ]
                await asyncio.gather(*tasks)

            except WebSocketDisconnect:
                print("WebSocket disconnected")
                await ws.close(code=1000)
            except Exception as e:
                print("Exception:", e)
                await ws.close(code=1011)  # internal error
                raise e
            finally:
                for task in tasks:
                    task.cancel()
                await asyncio.gather(*tasks, return_exceptions=True)
                self.reset_state()

        return web_app

    @modal.method()
    async def transcribe_queue(self, q: modal.Queue):
        import tempfile

        import sphn

        all_pcm_data = None

        while True:
            chunk = await q.get.aio(partition="audio")
            if chunk is None:
                await q.put.aio(None, partition="transcription")
                break

            # to avoid having to encode the audio and retrieve with OpusStreamReader:
            with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as tmp:
                tmp.write(chunk)
                tmp.flush()
                pcm, _ = sphn.read(tmp.name)
                pcm = pcm.squeeze(0)

            async for msg in self.transcribe(pcm, all_pcm_data):
                if isinstance(msg, str):
                    await q.put.aio(msg, partition="transcription")
                else:
                    all_pcm_data = msg

```

## Run a local Python client to test streaming STT

We can test this code on the same production Modal infra
that we'll be deploying it on by writing a quick `local_entrypoint` for testing.

We just need a few helper functions to control the streaming of audio bytes
and transcribed text from local Python.

These communicate asynchronously with the deployed Function using a Modal Queue.

```python
async def chunk_audio(data: bytes, chunk_size: int):
    for i in range(0, len(data), chunk_size):
        yield data[i : i + chunk_size]

async def send_audio(audio_bytes: bytes, q: modal.Queue, chunk_size: int, rtf: int):
    async for chunk in chunk_audio(audio_bytes, chunk_size):
        await q.put.aio(chunk, partition="audio")
        await asyncio.sleep(chunk_size / chunk_size / rtf)
    await q.put.aio(None, partition="audio")

async def receive_text(q: modal.Queue):
    break_counter, break_every = 0, 20
    while True:
        data = await q.get.aio(partition="transcription")
        if data is None:
            break
        print(data, end="")
        break_counter += 1
        if break_counter >= break_every:
            print()
            break_counter = 0

```

Now we write our quick test, which loads in audio from a URL
and then passes it to the remote Function via a

If you run this example with

```bash
modal run streaming_kyutai_stt.py
```

you will

1. deploy the latest version of the code on Modal
2. spin up a new GPU to handle transcription
3. load the model from Hugging Face or the Modal Volume cache
4. send the audio out to the new GPU container, transcribe it, and receive it locally to be printed.

Not bad for a single Python file with no dependencies except Modal!

```python
@app.local_entrypoint()
async def test(
    chunk_size: int = 24_000,  # bytes
    rtf: int = 1000,
    audio_url: str = "https://github.com/kyutai-labs/delayed-streams-modeling/raw/refs/heads/main/audio/bria.mp3",
):
    from urllib.request import urlopen

    print(f"Downloading audio file from {audio_url}")
    audio_bytes = urlopen(audio_url).read()
    print(f"Downloaded {len(audio_bytes)} bytes")

    print("Starting transcription")
    start_time = time.monotonic_ns()
    with modal.Queue.ephemeral() as q:
        STT().transcribe_queue.spawn(q)
        send = asyncio.create_task(send_audio(audio_bytes, q, chunk_size, rtf))
        recv = asyncio.create_task(receive_text(q))
        await asyncio.gather(send, recv)
    print(
        f"\nTranscription complete in {round((time.monotonic_ns() - start_time) / 1e9, 2)}s"
    )

```

## Deploy a streaming STT service on the Web

We've already written a Web backend for our streaming STT service --
that's the FastAPI API with the WebSocket in the Modal Cls above.

We can also deploy a Web frontend. To keep things almost entirely "pure Python",
we here use the [FastHTML](https://www.fastht.ml/) library,
but you can also deploy a JavaScript frontend with a FastAPI or Node backend.

We do use a bit of JS for the audio processing in the browser.
We add it to the Modal Image using `add_local_dir`.
You can find the frontend files [here](https://github.com/modal-labs/modal-examples/tree/main/06_gpu_and_ml/speech-to-text/streaming-kyutai-stt-frontend).

```python
web_image = (
    modal.Image.debian_slim(python_version="3.12")
    .pip_install("python-fasthtml==0.12.20")
    .add_local_dir(
        Path(__file__).parent / "streaming-kyutai-stt-frontend", "/root/frontend"
    )
)

```

You can deploy this frontend with

```bash
modal deploy streaming_kyutai_stt.py
```

and then interact with it at the printed `ui` URL.

```python
@app.function(image=web_image, timeout=10 * MINUTES)
@modal.concurrent(max_inputs=1000)
@modal.asgi_app()
def ui():
    import fasthtml.common as fh

    modal_logo_svg = open("/root/frontend/modal-logo.svg").read()
    modal_logo_base64 = base64.b64encode(modal_logo_svg.encode()).decode()
    app_js = open("/root/frontend/audio.js").read()

    fast_app, rt = fh.fast_app(
        hdrs=[
            # audio recording libraries
            fh.Script(
                src="https://cdn.jsdelivr.net/npm/opus-recorder@latest/dist/recorder.min.js"
            ),
            fh.Script(
                src="https://cdn.jsdelivr.net/npm/opus-recorder@latest/dist/encoderWorker.min.js"
            ),
            fh.Script(
                src="https://cdn.jsdelivr.net/npm/ogg-opus-decoder/dist/ogg-opus-decoder.min.js"
            ),
            # styling
            fh.Link(
                href="https://fonts.googleapis.com/css?family=Inter:300,400,600",
                rel="stylesheet",
            ),
            fh.Script(src="https://cdn.tailwindcss.com"),
            fh.Script("""
                tailwind.config = {
                    theme: {
                        extend: {
                            colors: {
                                ground: "#0C0F0B",
                                primary: "#9AEE86",
                                "accent-pink": "#FC9CC6",
                                "accent-blue": "#B8E4FF",
                            },
                        },
                    },
                };
            """),
        ],
    )

    @rt("/")
    def get():
        return (
            fh.Title("Kyutai Streaming STT"),
            fh.Body(
                fh.Div(
                    fh.Div(
                        fh.Div(
                            id="text-output",
                            cls="flex flex-col-reverse overflow-y-auto max-h-64 pr-2",
                        ),
                        cls="w-full overflow-y-auto max-h-64",
                    ),
                    cls="bg-gray-800 rounded-lg shadow-lg w-full max-w-xl p-6",
                ),
                fh.Footer(
                    fh.Span(
                        "Built with ",
                        fh.A(
                            "Kyutai",
                            href="https://github.com/kyutai-labs/delayed-streams-modeling",
                            target="_blank",
                            rel="noopener noreferrer",
                            cls="underline",
                        ),
                        " and",
                        cls="text-sm font-medium text-gray-300 mr-2",
                    ),
                    fh.A(
                        fh.Img(
                            src=f"data:image/svg+xml;base64,{modal_logo_base64}",
                            alt="Modal logo",
                            cls="w-24",
                        ),
                        cls="flex items-center p-2 rounded-lg bg-gray-800 shadow-lg hover:bg-gray-700 transition-colors duration-200",
                        href="https://modal.com",
                        target="_blank",
                        rel="noopener noreferrer",
                    ),
                    cls="fixed bottom-4 inline-flex items-center justify-center",
                ),
                fh.Script(app_js),
                cls="relative bg-gray-900 text-white min-h-screen flex flex-col items-center justify-center p-4",
            ),
        )

    return fast_app

```

### Streaming Parakeet

# Streaming audio transcription using Parakeet

This examples demonstrates the use of Parakeet ASR models for streaming speech-to-text on Modal.

[Parakeet](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html#parakeet)
is the name of a family of ASR models built using [NVIDIA's NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html).
We'll show you how to use Parakeet for streaming audio transcription on Modal GPUs,
with simple Python and browser clients.

This example uses the `nvidia/parakeet-tdt-0.6b-v2` model which, as of June 2025, sits at the
top of Hugging Face's [Open ASR leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard).

To try out transcription from your terminal,
provide a URL for a `.wav` file to `modal run`:

```bash
modal run 06_gpu_and_ml/parakeet/parakeet.py --audio-url="https://github.com/voxserv/audio_quality_testing_samples/raw/refs/heads/master/mono_44100/156550__acclivity__a-dream-within-a-dream.wav"
```

You should see output like the following:

```bash
🎤 Starting Transcription
A Dream Within A Dream Edgar Allan Poe
take this kiss upon the brow, And in parting from you now, Thus much let me avow You are not wrong who deem That my days have been a dream.
...
```

Running a web service you can hit from any browser isn't any harder -- Modal handles the deployment of both the frontend and backend in a single App!
Just run

```bash
modal serve 06_gpu_and_ml/parakeet/parakeet.py
```

and go to the link printed in your terminal.

The full frontend code can be found [here](https://github.com/modal-labs/modal-examples/tree/main/06_gpu_and_ml/parakeet-frontend/).

## Setup

```python
import asyncio
import os
import sys
from pathlib import Path

import modal

app = modal.App("example-streaming-parakeet")

```

## Volume for caching model weights

We use a [Modal Volume](https://modal.com/docs/guide/volumes) to cache the model weights.
This allows us to avoid downloading the model weights every time we start a new instance.

For more on storing models on Modal, see [this guide](https://modal.com/docs/guide/model-weights).

```python
model_cache = modal.Volume.from_name("parakeet-model-cache", create_if_missing=True)

```

## Configuring dependencies

The model runs remotely inside a container on Modal. We can define the environment
and install our Python dependencies in that container's [`Image`](https://modal.com/docs/guide/images).

For finicky setups like NeMO's, we recommend using the official NVIDIA CUDA Docker images from Docker Hub.
You'll need to install Python and pip with the `add_python` option because the image
doesn't have these by default.

Additionally, we install `ffmpeg` for handling audio data and `fastapi` to create a web
server for our WebSocket.

```python
image = (
    modal.Image.from_registry(
        "nvidia/cuda:12.8.0-cudnn-devel-ubuntu22.04", add_python="3.12"
    )
    .env(
        {
            "HF_HUB_ENABLE_HF_TRANSFER": "1",
            "HF_HOME": "/cache",  # cache directory for Hugging Face models
            "DEBIAN_FRONTEND": "noninteractive",
            "CXX": "g++",
            "CC": "g++",
        }
    )
    .apt_install("ffmpeg")
    .pip_install(
        "hf_transfer==0.1.9",
        "huggingface_hub[hf-xet]==0.31.2",
        "nemo_toolkit[asr]==2.3.0",
        "cuda-python==12.8.0",
        "fastapi==0.115.12",
        "numpy<2",
        "pydub==0.25.1",
    )
    .entrypoint([])  # silence chatty logs by container on start
    .add_local_dir(  # changes fastest, so make this the last layer
        Path(__file__).parent / "streaming-parakeet-frontend",
        remote_path="/frontend",
    )
)

```

## Implementing streaming audio transcription on Modal

Now we're ready to implement transcription. We wrap inference in a [`modal.Cls`](https://modal.com/docs/guide/lifecycle-functions) that
ensures models are loaded and then moved to the GPU once when a new container starts.

A couples of notes about this code:
- The `transcribe` method takes bytes of audio data and returns the transcribed text.
- The `web` method creates a FastAPI app using [`modal.asgi_app`](https://modal.com/docs/reference/modal.asgi_app#modalasgi_app) that serves a
[WebSocket](https://modal.com/docs/guide/webhooks#websockets) endpoint for streaming audio transcription and a browser frontend for transcribing audio from your microphone.
- The `run_with_queue` method takes a [`modal.Queue`](https://modal.com/docs/reference/modal.Queue) and passes audio data and transcriptions between our local machine and the GPU container.

Parakeet tries really hard to transcribe everything to English!
Hence it tends to output utterances like "Yeah" or "Mm-hmm" when it runs on silent audio.
We pre-process the incoming audio in the server using `pydub`'s silence detection,
ensuring that we don't pass silence into our model.

```python
END_OF_STREAM = (
    b"END_OF_STREAM_8f13d09"  # byte sequence indicating a stream is finished
)

@app.cls(volumes={"/cache": model_cache}, gpu="a10g", image=image)
@modal.concurrent(max_inputs=14, target_inputs=10)
class Parakeet:
    @modal.enter()
    def load(self):
        import logging

        import nemo.collections.asr as nemo_asr

        # silence chatty logs from nemo
        logging.getLogger("nemo_logger").setLevel(logging.CRITICAL)

        self.model = nemo_asr.models.ASRModel.from_pretrained(
            model_name="nvidia/parakeet-tdt-0.6b-v2"
        )

    def transcribe(self, audio_bytes: bytes) -> str:
        import numpy as np

        audio_data = np.frombuffer(audio_bytes, dtype=np.int16).astype(np.float32)

        with NoStdStreams():  # hide output, see https://github.com/NVIDIA/NeMo/discussions/3281#discussioncomment-2251217
            output = self.model.transcribe([audio_data])

        return output[0].text

    @modal.asgi_app()
    def web(self):
        from fastapi import FastAPI, Response, WebSocket
        from fastapi.responses import HTMLResponse
        from fastapi.staticfiles import StaticFiles

        web_app = FastAPI()
        web_app.mount("/static", StaticFiles(directory="/frontend"))

        @web_app.get("/status")
        async def status():
            return Response(status_code=200)

        # serve frontend
        @web_app.get("/")
        async def index():
            return HTMLResponse(content=open("/frontend/index.html").read())

        @web_app.websocket("/ws")
        async def run_with_websocket(ws: WebSocket):
            from fastapi import WebSocketDisconnect
            from pydub import AudioSegment

            await ws.accept()

            # initialize an empty audio segment
            audio_segment = AudioSegment.empty()

            try:
                while True:
                    # receive a chunk of audio data and convert it to an audio segment
                    chunk = await ws.receive_bytes()
                    if chunk == END_OF_STREAM:
                        await ws.send_bytes(END_OF_STREAM)
                        break
                    audio_segment, text = await self.handle_audio_chunk(
                        chunk, audio_segment
                    )
                    if text:
                        await ws.send_text(text)
            except Exception as e:
                if not isinstance(e, WebSocketDisconnect):
                    print(f"Error handling websocket: {type(e)}: {e}")
                try:
                    await ws.close(code=1011, reason="Internal server error")
                except Exception as e:
                    print(f"Error closing websocket: {type(e)}: {e}")

        return web_app

    @modal.method()
    async def run_with_queue(self, q: modal.Queue):
        from pydub import AudioSegment

        # initialize an empty audio segment
        audio_segment = AudioSegment.empty()

        try:
            while True:
                # receive a chunk of audio data and convert it to an audio segment
                chunk = await q.get.aio(partition="audio")

                if chunk == END_OF_STREAM:
                    await q.put.aio(END_OF_STREAM, partition="transcription")
                    break

                audio_segment, text = await self.handle_audio_chunk(
                    chunk, audio_segment
                )
                if text:
                    await q.put.aio(text, partition="transcription")
        except Exception as e:
            print(f"Error handling queue: {type(e)}: {e}")
            return

    async def handle_audio_chunk(
        self,
        chunk: bytes,
        audio_segment,
        silence_thresh=-45,  # dB
        min_silence_len=1000,  # ms
    ):
        from pydub import AudioSegment, silence

        new_audio_segment = AudioSegment(
            data=chunk,
            channels=1,
            sample_width=2,
            frame_rate=TARGET_SAMPLE_RATE,
        )

        # append the new audio segment to the existing audio segment
        audio_segment += new_audio_segment

        # detect windows of silence
        silent_windows = silence.detect_silence(
            audio_segment,
            min_silence_len=min_silence_len,
            silence_thresh=silence_thresh,
        )

        # if there are no silent windows, continue
        if len(silent_windows) == 0:
            return audio_segment, None

        # get the last silent window because
        # we want to transcribe until the final pause
        last_window = silent_windows[-1]

        # if the entire audio segment is silent, reset the audio segment
        if last_window[0] == 0 and last_window[1] == len(audio_segment):
            audio_segment = AudioSegment.empty()
            return audio_segment, None

        # get the segment to transcribe: beginning until last pause
        segment_to_transcribe = audio_segment[: last_window[1]]

        # remove the segment to transcribe from the audio segment
        audio_segment = audio_segment[last_window[1] :]
        try:
            text = self.transcribe(segment_to_transcribe.raw_data)
            return audio_segment, text
        except Exception as e:
            print("❌ Transcription error:", e)
            raise e

```

## Running transcription from a local Python client

Next, let's test the model with a [`local_entrypoint`](https://modal.com/docs/reference/modal.App#local_entrypoint) that streams audio data to the server and prints
out the transcriptions to our terminal as they arrive.

Instead of using the WebSocket endpoint like the browser frontend,
we'll use a [`modal.Queue`](https://modal.com/docs/reference/modal.Queue)
to pass audio data and transcriptions between our local machine and the GPU container.

```python
AUDIO_URL = "https://github.com/voxserv/audio_quality_testing_samples/raw/refs/heads/master/mono_44100/156550__acclivity__a-dream-within-a-dream.wav"
TARGET_SAMPLE_RATE = 16_000
CHUNK_SIZE = 16_000  # send one second of audio at a time

@app.local_entrypoint()
async def main(audio_url: str = AUDIO_URL):
    from urllib.request import urlopen

    print(f"🌐 Downloading audio file from {audio_url}")
    audio_bytes = urlopen(audio_url).read()
    print(f"🎧 Downloaded {len(audio_bytes)} bytes")

    audio_data = preprocess_audio(audio_bytes)

    print("🎤 Starting Transcription")
    with modal.Queue.ephemeral() as q:
        Parakeet().run_with_queue.spawn(q)
        send = asyncio.create_task(send_audio(q, audio_data))
        recv = asyncio.create_task(receive_text(q))
        await asyncio.gather(send, recv)
    print("✅ Transcription complete!")

```

Below are the two functions that coordinate streaming audio and receiving transcriptions.

`send_audio` transmits chunks of audio data with a slight delay,
as though it was being streamed from a live source, like a microphone.
`receive_text` waits for transcribed text to arrive and prints it.

```python
async def send_audio(q, audio_bytes):
    for chunk in chunk_audio(audio_bytes, CHUNK_SIZE):
        await q.put.aio(chunk, partition="audio")
        await asyncio.sleep(CHUNK_SIZE / TARGET_SAMPLE_RATE / 8)
    await q.put.aio(END_OF_STREAM, partition="audio")

async def receive_text(q):
    while True:
        message = await q.get.aio(partition="transcription")
        if message == END_OF_STREAM:
            break

        print(message)

```

## Addenda

The remainder of the code in this example is boilerplate,
mostly for handling Parakeet's input format.

```python
def preprocess_audio(audio_bytes: bytes) -> bytes:
    import array
    import io
    import wave

    with wave.open(io.BytesIO(audio_bytes), "rb") as wav_in:
        n_channels = wav_in.getnchannels()
        sample_width = wav_in.getsampwidth()
        frame_rate = wav_in.getframerate()
        n_frames = wav_in.getnframes()
        frames = wav_in.readframes(n_frames)

    # Convert frames to array based on sample width
    if sample_width == 1:
        audio_data = array.array("B", frames)  # unsigned char
    elif sample_width == 2:
        audio_data = array.array("h", frames)  # signed short
    elif sample_width == 4:
        audio_data = array.array("i", frames)  # signed int
    else:
        raise ValueError(f"Unsupported sample width: {sample_width}")

    # Downmix to mono if needed
    if n_channels > 1:
        mono_data = array.array(audio_data.typecode)
        for i in range(0, len(audio_data), n_channels):
            chunk = audio_data[i : i + n_channels]
            mono_data.append(sum(chunk) // n_channels)
        audio_data = mono_data

    # Resample to 16kHz if needed
    if frame_rate != TARGET_SAMPLE_RATE:
        ratio = TARGET_SAMPLE_RATE / frame_rate
        new_length = int(len(audio_data) * ratio)
        resampled_data = array.array(audio_data.typecode)

        for i in range(new_length):
            # Linear interpolation
            pos = i / ratio
            pos_int = int(pos)
            pos_frac = pos - pos_int

            if pos_int >= len(audio_data) - 1:
                sample = audio_data[-1]
            else:
                sample1 = audio_data[pos_int]
                sample2 = audio_data[pos_int + 1]
                sample = int(sample1 + (sample2 - sample1) * pos_frac)

            resampled_data.append(sample)

        audio_data = resampled_data

    return audio_data.tobytes()

def chunk_audio(data: bytes, chunk_size: int):
    for i in range(0, len(data), chunk_size):
        yield data[i : i + chunk_size]

class NoStdStreams(object):
    def __init__(self):
        self.devnull = open(os.devnull, "w")

    def __enter__(self):
        self._stdout, self._stderr = sys.stdout, sys.stderr
        self._stdout.flush(), self._stderr.flush()
        sys.stdout, sys.stderr = self.devnull, self.devnull

    def __exit__(self, exc_type, exc_value, traceback):
        sys.stdout, sys.stderr = self._stdout, self._stderr
        self.devnull.close()

```

### Streaming Whisper

```python
import asyncio
import pathlib
import re
import tempfile
import time
import urllib
from typing import Iterator

import modal

image = (
    modal.Image.debian_slim(python_version="3.11")
    .apt_install("git", "ffmpeg")
    .uv_pip_install(
        "fastapi==0.116.1",
        "ffmpeg-python==0.2.0",
        "https://github.com/openai/whisper/archive/v20230314.tar.gz",
        "numpy<2",
    )
)
app = modal.App(name="example-streaming-whisper", image=image)
SAMPLE_URL = (
    "https://modal-cdn.com/history-of-rome-podcast-duncan-001-in-the-beginning.mp3"
)

CACHE_DIR = "/root/.cache/whisper"
whisper_cache = modal.Volume.from_name("whisper-cache", create_if_missing=True)

def load_audio(data: bytes, start=None, end=None, sr: int = 16000):
    import ffmpeg
    import numpy as np

    try:
        fp = tempfile.NamedTemporaryFile(delete=False, suffix=".wav")
        fp.write(data)
        fp.close()
        # This launches a subprocess to decode audio while down-mixing and resampling as necessary.
        # Requires the ffmpeg CLI and `ffmpeg-python` package to be installed.
        if start is None and end is None:
            out, _ = (
                ffmpeg.input(fp.name, threads=0)
                .output("-", format="s16le", acodec="pcm_s16le", ac=1, ar=sr)
                .run(
                    cmd=["ffmpeg", "-nostdin"], capture_stdout=True, capture_stderr=True
                )
            )
        else:
            out, _ = (
                ffmpeg.input(fp.name, threads=0)
                .filter("atrim", start=start, end=end)
                .output("-", format="s16le", acodec="pcm_s16le", ac=1, ar=sr)
                .run(
                    cmd=["ffmpeg", "-nostdin"], capture_stdout=True, capture_stderr=True
                )
            )
    except ffmpeg.Error as e:
        raise RuntimeError(f"Failed to load audio: {e.stderr.decode()}") from e

    return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0

def split_silences(
    path: str, min_segment_length: float = 30.0, min_silence_length: float = 0.8
) -> Iterator[tuple[float, float]]:
    """
    Split audio file into contiguous chunks using the ffmpeg `silencedetect` filter.
    Yields tuples (start, end) of each chunk in seconds.

    Parameters
    ----------
    path: str
        path to the audio file on disk.
    min_segment_length : float
        The minimum acceptable length for an audio segment in seconds. Lower values
        allow for more splitting and increased parallelizing, but decrease transcription
        accuracy. Whisper models expect to transcribe in 30 second segments.
    min_silence_length : float
        Minimum silence to detect and split on, in seconds. Lower values are more likely to split
        audio in middle of phrases and degrade transcription accuracy.
    """
    import ffmpeg

    silence_end_re = re.compile(
        r" silence_end: (?P<end>[0-9]+(\.?[0-9]*)) \| silence_duration: (?P<dur>[0-9]+(\.?[0-9]*))"
    )

    metadata = ffmpeg.probe(path)
    duration = float(metadata["format"]["duration"])

    reader = (
        ffmpeg.input(str(path))
        .filter("silencedetect", n="-10dB", d=min_silence_length)
        .output("pipe:", format="null")
        .run_async(pipe_stderr=True)
    )

    cur_start = 0.0
    num_segments = 0

    while True:
        line = reader.stderr.readline().decode("utf-8")
        if not line:
            break
        match = silence_end_re.search(line)
        if match:
            silence_end, silence_dur = match.group("end"), match.group("dur")
            split_at = float(silence_end) - (float(silence_dur) / 2)

            if (split_at - cur_start) < min_segment_length:
                continue

            yield cur_start, split_at
            cur_start = split_at
            num_segments += 1

    # silencedetect can place the silence end *after* the end of the full audio segment.
    # Such segments definitions are negative length and invalid.
    if duration > cur_start and (duration - cur_start) > min_segment_length:
        yield cur_start, duration
        num_segments += 1
    print(f"Split {path} into {num_segments} segments")

@app.function(gpu="a10", volumes={CACHE_DIR: whisper_cache})
def transcribe_segment(start: float, end: float, audio_data: bytes, model: str):
    import torch
    import whisper

    print(
        f"Transcribing segment {start:.2f} to {end:.2f} ({end - start:.2f}s duration)"
    )

    t0 = time.time()
    use_gpu = torch.cuda.is_available()
    device = "cuda" if use_gpu else "cpu"
    model = whisper.load_model(model, device=device)
    np_array = load_audio(audio_data, start=start, end=end)
    result = model.transcribe(np_array, language="en", fp16=use_gpu)  # type: ignore
    print(
        f"Transcribed segment {start:.2f} to {end:.2f} ({end - start:.2f}s duration) in {time.time() - t0:.2f} seconds."
    )

    # Add back offsets.
    for segment in result["segments"]:
        segment["start"] += start
        segment["end"] += start

    return result

async def stream_whisper(audio_data: bytes):
    with tempfile.NamedTemporaryFile(delete=False) as f:
        f.write(audio_data)
        f.flush()
        segment_gen = split_silences(f.name)

    async for result in transcribe_segment.starmap(
        segment_gen, kwargs=dict(audio_data=audio_data, model="base.en")
    ):
        # Must cooperatively yield here otherwise `StreamingResponse` will not iteratively return stream parts.
        # see: https://github.com/python/asyncio/issues/284#issuecomment-154162668
        await asyncio.sleep(0)
        yield result["text"]

@app.function()
@modal.asgi_app()
def api():
    from fastapi import FastAPI, HTTPException
    from fastapi.responses import StreamingResponse

    web_app = FastAPI()

    @web_app.get("/transcribe")
    async def transcribe(url: str):
        """
        Usage:

        ```sh
        curl --no-buffer \
            https://modal-labs-examples--example-streaming-whisper-api.modal.run/transcribe?url=https://modal-cdn.com/history-of-rome-podcast-duncan-001-in-the-beginning.mp3
        ```

        This endpoint will stream back the audio transcription as it makes progress.
        """
        print(f"downloading {url}")
        try:
            with urllib.request.urlopen(url) as response:
                assert response.getcode() == 200, response.getcode()
                audio_data = response.read()
        except AssertionError:
            raise HTTPException(status_code=422, detail=f"Could not process url {url}")
        print(f"streaming transcription of {url} audio to client...")
        return StreamingResponse(
            stream_whisper(audio_data), media_type="text/event-stream"
        )

    return web_app

@app.function()
async def transcribe_cli(data: bytes, suffix: str):
    async for result in stream_whisper(data):
        print(result)

@app.local_entrypoint()
def main(path: str = SAMPLE_URL):
    if path.startswith("http"):
        with urllib.request.urlopen(path) as response:
            assert response.getcode() == 200, response.getcode()
            data = response.read()
        suffix = path.rsplit(".")[-1]
    else:
        filepath = pathlib.Path(path)
        data = filepath.read_bytes()
        suffix = filepath.suffix
    transcribe_cli.remote(data, suffix=suffix)

```

### Tensorflow Tutorial

# TensorFlow tutorial

This is essentially a version of the
[image classification example in the TensorFlow documentation](https://www.tensorflow.org/tutorials/images/classification)
running inside Modal on a GPU.
If you run this script, it will also create an TensorBoard URL you can go to to watch the model train and review the results:

![tensorboard](./tensorboard.png)

## Setting up the dependencies

Configuring a system to properly run GPU-accelerated TensorFlow can be challenging.
Luckily, Modal makes it easy to stand on the shoulders of giants and
[use a pre-built Docker container image](https://modal.com/docs/guide/custom-container#use-an-existing-container-image-with-from_registry) from a registry like Docker Hub.
We recommend TensorFlow's [official base Docker container images](https://hub.docker.com/r/tensorflow/tensorflow), which come with `tensorflow` and its matching CUDA libraries already installed.

If you want to install TensorFlow some other way, check out [their docs](https://www.tensorflow.org/install) for options and instructions.
GPU-enabled containers on Modal will always have NVIDIA drivers available, but you will need to add higher-level tools like CUDA and cuDNN yourself.
See the [Modal guide on customizing environments](https://modal.com/docs/guide/custom-container) for options we support.

```python
import time

import modal

dockerhub_image = modal.Image.from_registry(
    "tensorflow/tensorflow:2.15.0-gpu",
)

app = modal.App("example-tensorflow-tutorial", image=dockerhub_image)

```

## Logging data to TensorBoard

Training ML models takes time. Just as we need to monitor long-running systems like databases or web servers for issues,
we also need to monitor the training process of our ML models. TensorBoard is a tool that comes with TensorFlow that helps you visualize
the state of your ML model training. It is packaged as a web server.

We want to run the web server for TensorBoard at the same time as we are training the
TensorFlow model. The easiest way to share data between the training function and the
web server is by creating a
[Modal Volume](https://modal.com/docs/guide/volumes)
that we can attach to both
[Functions](https://modal.com/docs/reference/modal.Function).

```python
volume = modal.Volume.from_name("tensorflow-tutorial", create_if_missing=True)
LOGDIR = "/tensorboard"

```

## Training function

This is basically the same code as [the official example](https://www.tensorflow.org/tutorials/images/classification) from the TensorFlow docs.
A few Modal-specific things are worth pointing out:

* We attach the Volume for sharing data with TensorBoard in the `app.function`
  decorator.

* We also annotate this function with `gpu="T4"` to make sure it runs on a GPU.

* We put all the TensorFlow imports inside the function body.
  This makes it possible to run this example even if you don't have TensorFlow installed on your local computer -- a key benefit of Modal!

You may notice some warnings in the logs about certain CPU performance optimizations (NUMA awareness and AVX/SSE instruction set support) not being available.
While these optimizations can be important for some workloads, especially if you are running ML models on a CPU, they are not critical for most cases.

```python
@app.function(volumes={LOGDIR: volume}, gpu="T4", timeout=600)
def train():
    import pathlib

    import tensorflow as tf
    from tensorflow.keras import layers
    from tensorflow.keras.models import Sequential

    # load raw data from storage
    dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
    data_dir = tf.keras.utils.get_file(
        "flower_photos.tar", origin=dataset_url, extract=True
    )
    data_dir = pathlib.Path(data_dir).with_suffix("")

    # construct Keras datasets from raw data
    batch_size = 32
    img_height = img_width = 180

    train_ds = tf.keras.utils.image_dataset_from_directory(
        data_dir,
        validation_split=0.2,
        subset="training",
        seed=123,
        image_size=(img_height, img_width),
        batch_size=batch_size,
    )

    val_ds = tf.keras.utils.image_dataset_from_directory(
        data_dir,
        validation_split=0.2,
        subset="validation",
        seed=123,
        image_size=(img_height, img_width),
        batch_size=batch_size,
    )

    class_names = train_ds.class_names
    train_ds = (
        train_ds.cache().shuffle(1000).prefetch(buffer_size=tf.data.AUTOTUNE)  # type: ignore
    )
    val_ds = val_ds.cache().prefetch(buffer_size=tf.data.AUTOTUNE)  # type: ignore
    num_classes = len(class_names)

    model = Sequential(
        [
            layers.Rescaling(1.0 / 255, input_shape=(img_height, img_width, 3)),
            layers.Conv2D(16, 3, padding="same", activation="relu"),
            layers.MaxPooling2D(),
            layers.Conv2D(32, 3, padding="same", activation="relu"),
            layers.MaxPooling2D(),
            layers.Conv2D(64, 3, padding="same", activation="relu"),
            layers.MaxPooling2D(),
            layers.Flatten(),
            layers.Dense(128, activation="relu"),
            layers.Dense(num_classes),
        ]
    )

    model.compile(
        optimizer="adam",
        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=["accuracy"],
    )

    model.summary()

    tensorboard_callback = tf.keras.callbacks.TensorBoard(
        log_dir=LOGDIR,
        histogram_freq=1,
    )

    model.fit(
        train_ds,
        validation_data=val_ds,
        epochs=20,
        callbacks=[tensorboard_callback],
    )

```

## Running TensorBoard

TensorBoard is compatible with a Python web server standard called [WSGI](https://www.fullstackpython.com/wsgi-servers.html),
the same standard used by [Flask](https://flask.palletsprojects.com/).
Modal [speaks WSGI too](https://modal.com/docs/guide/webhooks#wsgi), so it's straightforward to run TensorBoard in a Modal app.

We will attach the same Volume that we attached to our training function so that
TensorBoard can read the logs. For this to work with Modal, we will first
create some
[WSGI Middleware](https://peps.python.org/pep-3333/)
to check the Modal Volume for updates any time the page is reloaded.

```python
class VolumeMiddleware:
    def __init__(self, app):
        self.app = app

    def __call__(self, environ, start_response):
        if (route := environ.get("PATH_INFO")) in ["/", "/modal-volume-reload"]:
            try:
                volume.reload()
            except Exception as e:
                print("Exception while re-loading traces: ", e)
            if route == "/modal-volume-reload":
                environ["PATH_INFO"] = "/"  # redirect
        return self.app(environ, start_response)

```

The WSGI app isn't exposed directly through the TensorBoard library, but we can build it
the same way it's built internally --
[see the TensorBoard source code for details](https://github.com/tensorflow/tensorboard/blob/0c5523f4b27046e1ca7064dd75347a5ee6cc7f79/tensorboard/program.py#L466-L476).

Note that the TensorBoard server runs in a different container.
The server does not need GPU support.
Note that this server will be exposed to the public internet!

```python
@app.function(
    volumes={LOGDIR: volume},
    max_containers=1,  # single replica
    scaledown_window=5 * 60,  # five minute idle time
)
@modal.concurrent(max_inputs=100)  # 100 concurrent request threads
@modal.wsgi_app()
def tensorboard_app():
    import tensorboard

    board = tensorboard.program.TensorBoard()
    board.configure(logdir=LOGDIR)
    (data_provider, deprecated_multiplexer) = board._make_data_provider()
    wsgi_app = tensorboard.backend.application.TensorBoardWSGIApp(
        board.flags,
        board.plugin_loaders,
        data_provider,
        board.assets_zip_provider,
        deprecated_multiplexer,
        experimental_middlewares=[VolumeMiddleware],
    )
    return wsgi_app

```

## Local entrypoint code

Let's kick everything off.
Everything runs in an ephemeral "app" that gets destroyed once it's done.
In order to keep the TensorBoard web server running, we sleep in an infinite loop
until the user hits ctrl-c.

The script will take a few minutes to run, although each epoch is quite fast since it runs on a GPU.
The first time you run it, it might have to build the image, which can take an additional few minutes.

```python
@app.local_entrypoint()
def main(just_run: bool = False):
    train.remote()
    if not just_run:
        print(
            "Training is done, but the app is still running TensorBoard until you hit ctrl-c."
        )
        try:
            while True:
                time.sleep(1)
        except KeyboardInterrupt:
            print("Terminating app")

```

### Text Embeddings Inference

# Run TextEmbeddingsInference (TEI) on Modal

This example runs the [Text Embedding Inference (TEI)](https://github.com/huggingface/text-embeddings-inference) toolkit on the Hacker News BigQuery public dataset.

```python
import json
import os
import socket
import subprocess
from pathlib import Path

import modal

GPU_CONFIG = "A10G"
MODEL_ID = "BAAI/bge-base-en-v1.5"
BATCH_SIZE = 32
DOCKER_IMAGE = (
    "ghcr.io/huggingface/text-embeddings-inference:86-1.7"  # Ampere 86 for A10s.
    # "ghcr.io/huggingface/text-embeddings-inference:1.7" # Ampere 80 for A100s.
    # "ghcr.io/huggingface/text-embeddings-inference:turing-1.7"  # Turing for T4s.
)
PORT = 8000

DATA_PATH = Path("/data/dataset.jsonl")

LAUNCH_FLAGS = [
    "--model-id",
    MODEL_ID,
    "--port",
    "8000",
]

def spawn_server() -> subprocess.Popen:
    process = subprocess.Popen(["text-embeddings-router"] + LAUNCH_FLAGS)

    # Poll until webserver at 127.0.0.1:8000 accepts connections before running inputs.
    while True:
        try:
            socket.create_connection(("127.0.0.1", PORT), timeout=1).close()
            print("Webserver ready!")
            return process
        except (socket.timeout, ConnectionRefusedError):
            # Check if launcher webserving process has exited.
            # If so, a connection can never be made.
            retcode = process.poll()
            if retcode is not None:
                raise RuntimeError(f"launcher exited unexpectedly with code {retcode}")

def download_model():
    # Wait for server to start. This downloads the model weights when not present.
    spawn_server().terminate()

volume = modal.Volume.from_name("tei-hn-data", create_if_missing=True)

app = modal.App("example-text-embeddings-inference")

tei_image = (
    modal.Image.from_registry(
        DOCKER_IMAGE,
        add_python="3.10",
    )
    .dockerfile_commands("ENTRYPOINT []")
    .run_function(download_model, gpu=GPU_CONFIG)
    .pip_install("httpx")
)

with tei_image.imports():
    from httpx import AsyncClient

@app.cls(
    gpu=GPU_CONFIG,
    image=tei_image,
    max_containers=20,  # Use up to 20 GPU containers at once.
)
@modal.concurrent(
    max_inputs=10
)  # Allow each container to process up to 10 batches at once.
class TextEmbeddingsInference:
    @modal.enter()
    def setup_server(self):
        self.process = spawn_server()
        self.client = AsyncClient(base_url="http://127.0.0.1:8000")

    @modal.exit()
    def teardown_server(self):
        self.process.terminate()

    @modal.method()
    async def embed(self, inputs_with_ids: list[tuple[int, str]]):
        ids, inputs = zip(*inputs_with_ids)
        resp = await self.client.post("/embed", json={"inputs": inputs})
        resp.raise_for_status()
        outputs = resp.json()

        return list(zip(ids, outputs))

def download_data():
    service_account_info = json.loads(os.environ["SERVICE_ACCOUNT_JSON"])
    credentials = service_account.Credentials.from_service_account_info(
        service_account_info
    )

    client = bigquery.Client(credentials=credentials)

    iterator = client.list_rows(
        "bigquery-public-data.hacker_news.full",
        max_results=100_000,
    )
    df = iterator.to_dataframe(progress_bar_type="tqdm").dropna()

    df["id"] = df["id"].astype(int)
    df["text"] = df["text"].apply(lambda x: x[:512])

    data = list(zip(df["id"], df["text"]))

    with open(DATA_PATH, "w") as f:
        json.dump(data, f)

    volume.commit()

image = modal.Image.debian_slim(python_version="3.10").pip_install(
    "google-cloud-bigquery", "pandas", "db-dtypes", "tqdm"
)

with image.imports():
    from google.cloud import bigquery
    from google.oauth2 import service_account

@app.function(
    image=image,
    secrets=[modal.Secret.from_name("bigquery")],
    volumes={DATA_PATH.parent: volume},
)
def embed_dataset():
    model = TextEmbeddingsInference()

    if not DATA_PATH.exists():
        print("Downloading data. This takes a while...")
        download_data()

    with open(DATA_PATH) as f:
        data = json.loads(f.read())

    def generate_batches():
        batch = []
        for item in data:
            batch.append(item)

            if len(batch) == BATCH_SIZE:
                yield batch
                batch = []

    # data is of type list[tuple[str, str]].
    # starmap spreads the tuples into positional arguments.
    for output_batch in model.embed.map(generate_batches(), order_outputs=False):
        # Do something with the outputs.
        pass

```

### Text To Image

# Run Stable Diffusion 3.5 Large Turbo as a CLI, API, and web UI

This example shows how to run [Stable Diffusion 3.5 Large Turbo](https://huggingface.co/stabilityai/stable-diffusion-3.5-large-turbo) on Modal
to generate images from your local command line, via an API, and as a web UI.

Inference takes about one minute to cold start,
at which point images are generated at a rate of one image every 1-2 seconds
for batch sizes between one and 16.

Below are four images produced by the prompt
"A princess riding on a pony".

![stable diffusion montage](https://modal-cdn.com/cdnbot/sd-montage-princess-yxu2vnbl_e896a9c0.webp)

## Basic setup

```python
import io
import random
import time
from pathlib import Path
from typing import Optional

import modal

MINUTES = 60

```

All Modal programs need an [`App`](https://modal.com/docs/reference/modal.App) — an object that acts as a recipe for
the application. Let's give it a friendly name.

```python
app = modal.App("example-text-to-image")

```

## Configuring dependencies

The model runs remotely inside a [container](https://modal.com/docs/guide/custom-container).
That means we need to install the necessary dependencies in that container's image.

Below, we start from a lightweight base Linux image
and then install our Python dependencies, like Hugging Face's `diffusers` library and `torch`.

```python
CACHE_DIR = "/cache"

image = (
    modal.Image.debian_slim(python_version="3.12")
    .pip_install(
        "accelerate==0.33.0",
        "diffusers==0.31.0",
        "fastapi[standard]==0.115.4",
        "huggingface-hub[hf_transfer]==0.25.2",
        "sentencepiece==0.2.0",
        "torch==2.5.1",
        "torchvision==0.20.1",
        "transformers~=4.44.0",
    )
    .env(
        {
            "HF_HUB_ENABLE_HF_TRANSFER": "1",  # faster downloads
            "HF_HUB_CACHE": CACHE_DIR,
        }
    )
)

with image.imports():
    import diffusers
    import torch
    from fastapi import Response

```

## Implementing SD3.5 Large Turbo inference on Modal

We wrap inference in a Modal [Cls](https://modal.com/docs/guide/lifecycle-functions)
that ensures models are loaded and then moved to the GPU once when a new container
starts, before the container picks up any work.

The `run` function just wraps a `diffusers` pipeline.
It sends the output image back to the client as bytes.

We also include a `web` wrapper that makes it possible
to trigger inference via an API call.
See the `/docs` route of the URL ending in `inference-web.modal.run`
that appears when you deploy the app for details.

```python
MODEL_ID = "adamo1139/stable-diffusion-3.5-large-turbo-ungated"
MODEL_REVISION_ID = "9ad870ac0b0e5e48ced156bb02f85d324b7275d2"

cache_volume = modal.Volume.from_name("hf-hub-cache", create_if_missing=True)

@app.cls(
    image=image,
    gpu="H100",
    timeout=10 * MINUTES,
    volumes={CACHE_DIR: cache_volume},
)
class Inference:
    @modal.enter()
    def load_pipeline(self):
        self.pipe = diffusers.StableDiffusion3Pipeline.from_pretrained(
            MODEL_ID,
            revision=MODEL_REVISION_ID,
            torch_dtype=torch.bfloat16,
        ).to("cuda")

    @modal.method()
    def run(
        self, prompt: str, batch_size: int = 4, seed: Optional[int] = None
    ) -> list[bytes]:
        seed = seed if seed is not None else random.randint(0, 2**32 - 1)
        print("seeding RNG with", seed)
        torch.manual_seed(seed)
        images = self.pipe(
            prompt,
            num_images_per_prompt=batch_size,  # outputting multiple images per prompt is much cheaper than separate calls
            num_inference_steps=4,  # turbo is tuned to run in four steps
            guidance_scale=0.0,  # turbo doesn't use CFG
            max_sequence_length=512,  # T5-XXL text encoder supports longer sequences, more complex prompts
        ).images

        image_output = []
        for image in images:
            with io.BytesIO() as buf:
                image.save(buf, format="PNG")
                image_output.append(buf.getvalue())
        torch.cuda.empty_cache()  # reduce fragmentation
        return image_output

    @modal.fastapi_endpoint(docs=True)
    def web(self, prompt: str, seed: Optional[int] = None):
        return Response(
            content=self.run.local(  # run in the same container
                prompt, batch_size=1, seed=seed
            )[0],
            media_type="image/png",
        )

```

## Generating Stable Diffusion images from the command line

This is the command we'll use to generate images. It takes a text `prompt`,
a `batch_size` that determines the number of images to generate per prompt,
and the number of times to run image generation (`samples`).

You can also provide a `seed` to make sampling more deterministic.

Run it with

```bash
modal run text_to_image.py
```

and pass `--help` to see more options.

```python
@app.local_entrypoint()
def entrypoint(
    samples: int = 4,
    prompt: str = "A princess riding on a pony",
    batch_size: int = 4,
    seed: Optional[int] = None,
):
    print(
        f"prompt => {prompt}",
        f"samples => {samples}",
        f"batch_size => {batch_size}",
        f"seed => {seed}",
        sep="\n",
    )

    output_dir = Path("/tmp/stable-diffusion")
    output_dir.mkdir(exist_ok=True, parents=True)

    inference_service = Inference()

    for sample_idx in range(samples):
        start = time.time()
        images = inference_service.run.remote(prompt, batch_size, seed)
        duration = time.time() - start
        print(f"Run {sample_idx + 1} took {duration:.3f}s")
        if sample_idx:
            print(
                f"\tGenerated {len(images)} image(s) at {(duration) / len(images):.3f}s / image."
            )
        for batch_idx, image_bytes in enumerate(images):
            output_path = (
                output_dir
                / f"output_{slugify(prompt)[:64]}_{str(sample_idx).zfill(2)}_{str(batch_idx).zfill(2)}.png"
            )
            if not batch_idx:
                print("Saving outputs", end="\n\t")
            print(
                output_path,
                end="\n" + ("\t" if batch_idx < len(images) - 1 else ""),
            )
            output_path.write_bytes(image_bytes)

```

## Generating Stable Diffusion images via an API

The Modal `Cls` above also included a [`fastapi_endpoint`](https://modal.com/docs/examples/basic_web),
which adds a simple web API to the inference method.

To try it out, run

```bash
modal deploy text_to_image.py
```

copy the printed URL ending in `inference-web.modal.run`,
and add `/docs` to the end. This will bring up the interactive
Swagger/OpenAPI docs for the endpoint.

## Generating Stable Diffusion images in a web UI

Lastly, we add a simple front-end web UI (written in Alpine.js) for
our image generation backend.

This is also deployed by running

```bash
modal deploy text_to_image.py.
```

The `Inference` class will serve multiple users from its own auto-scaling pool of warm GPU containers automatically.

```python
frontend_path = Path(__file__).parent / "frontend"

web_image = (
    modal.Image.debian_slim(python_version="3.12")
    .pip_install("jinja2==3.1.4", "fastapi[standard]==0.115.4")
    .add_local_dir(frontend_path, remote_path="/assets")
)

@app.function(image=web_image)
@modal.concurrent(max_inputs=1000)
@modal.asgi_app()
def ui():
    import fastapi.staticfiles
    from fastapi import FastAPI, Request
    from fastapi.templating import Jinja2Templates

    web_app = FastAPI()
    templates = Jinja2Templates(directory="/assets")

    @web_app.get("/")
    async def read_root(request: Request):
        return templates.TemplateResponse(
            "index.html",
            {
                "request": request,
                "inference_url": Inference.web.get_web_url(),
                "model_name": "Stable Diffusion 3.5 Large Turbo",
                "default_prompt": "A cinematic shot of a baby raccoon wearing an intricate italian priest robe.",
            },
        )

    web_app.mount(
        "/static",
        fastapi.staticfiles.StaticFiles(directory="/assets"),
        name="static",
    )

    return web_app

def slugify(s: str) -> str:
    return "".join(c if c.isalnum() else "-" for c in s).strip("-")

```

### Tokasaurus Throughput

# High-throughput LLM inference with Tokasaurus (LLama 3.2 1B Instruct)

In this example, we demonstrate how to use Tokasaurus, an LLM inference framework designed for maximum throughput.

It maps the [Large Language Monkeys GSM8K demo](https://github.com/ScalingIntelligence/tokasaurus/blob/a0155181f09c0cf40783e01a625b041985667a92/tokasaurus/benchmarks/standalone_monkeys_gsm8k.py)
from the [Tokasaurus release blog post](https://scalingintelligence.stanford.edu/blogs/tokasaurus/) onto Modal
and replicates the core result: sustained inference at >80k tok/s throughput,
exceeding their reported numbers for vLLM and SGLang by ~3x.

In the "Large Language Monkeys" inference-time compute scaling paradigm,
[also introduced by the same Stanford labs](https://arxiv.org/abs/2407.21787),
the response quality of a system using a small model is improved to match or exceed a system using a large model
by running many requests in parallel.
Here, it's applied to the Grade School Math (GSM8K) dataset.

For more on this LLM inference pattern
(and an explainer on why it's such a natural fit for current parallel computing systems)
see [our blog post reproducing and extending their results](https://modal.com/blog/llama-human-eval).

## Set up the container image

Our first order of business is to define the environment our LLM engine will run in:
the [container `Image`](https://modal.com/docs/guide/custom-container).

We translate the [recipe](https://github.com/ScalingIntelligence/tokasaurus/blob/main/logs/blog_commands.md)
the authors used to build their Tokasaurus environment into methods of `modal.Image`.

This requires, for instance, picking a base Image that includes the right version of the
[CUDA toolkit](https://modal.com/gpu-glossary/host-software/cuda-software-platform).

```python
import random
import time

import aiohttp
import modal

toka_image = (
    modal.Image.from_registry(
        "nvidia/cuda:12.4.1-devel-ubuntu22.04", add_python="3.12"
    ).entrypoint([])  # silence chatty logs on container start
)

```

We also set an environment variable that directs Torch-based libraries to only compile kernels for the
[GPU SM architecture](https://modal.com/gpu-glossary/device-hardware/streaming-multiprocessor-architecture)
we are targeting, Hopper. This isn't strictly necessary, but it silences some paranoid logs.

```python
GPU_CONFIG = "H100!"  # ! means "strictly", no upgrades to H200
TORCH_CUDA_ARCH_LIST = "9.0 9.0a"  # Hopper, aka H100/H200

```

From there, Tokasaurus can be installed like any normal Python package,
since Modal [provides the host CUDA drivers](https://modal.com/docs/guide/cuda).

```python
toka_image = toka_image.env(
    {"HF_HUB_ENABLE_HF_TRANSFER": "1", "TORCH_CUDA_ARCH_LIST": TORCH_CUDA_ARCH_LIST}
).uv_pip_install(
    "tokasaurus==0.0.2",
    "huggingface_hub[hf_transfer]==0.33.0",
    "datasets==3.6.0",
)

```

## Download the model weights

For this demo, we run Meta's Llama 3.2 1B Instruct model, downloaded from Hugging Face.
Since this is a gated model, you'll need to
[accept the terms of use](https://huggingface.co/meta-llama/Llama-3.2-1B)
and create a [Secret](https://modal.com/secrets/)
with your Hugging Face token to download the weights.

```python
secrets = [modal.Secret.from_name("huggingface-secret")]

MODEL_NAME = "meta-llama/Llama-3.2-1B-Instruct"
MODEL_REVISION = "4e20de362430cd3b72f300e6b0f18e50e7166e08"  # avoid nasty surprises when repos update!

```

Although Tokasaurus will download weights from Hugging Face on-demand,
we want to cache them so we don't do it every time our server starts.
We'll use a [Modal Volume](https://modal.com/docs/guide/volumes) for our cache. For more on storing model weights on Modal, see
[this guide](https://modal.com/docs/guide/model-weights).

```python
app_name = "example-tokasaurus-throughput"
hf_cache_vol = modal.Volume.from_name(f"{app_name}-hf-cache", create_if_missing=True)
volumes = {"/root/.cache/huggingface": hf_cache_vol}

```

## Configure Tokasaurus for maximum throughput on this workload

On throughput-focused benchmarks with high prefix sharing workloads, Tokasaurus can outperform vLLM and SGLang nearly three-fold.

For small models like the one we are running, it reduces CPU overhead by maintaining a deep input queue
and exposing shared prefixes to the GPU [Tensor Cores](https://modal.com/gpu-glossary/device-hardware/tensor-core)
with [Hydragen](https://arxiv.org/abs/2402.05099).

```python
USE_HYDRAGEN = "T"
HYDRAGEN_MIN_GROUP_SIZE = 129  # sic

```

We start by maximizing the number of tokens processed per forward pass by adjusting the following parameters:

- `kv_cache_num_tokens`: max tokens in the KV cache, higher values increase throughput but consume GPU memory
- `max_tokens_per_forward`: max tokens/seq processed per forward pass, higher values increase throughput but use more GPU memory
- `max_seqs_per_forward`: max sequences processed per forward pass, higher values increase batch size and throughput, but require more GPU memory

We also set a few other parameters with less obvious impacts -- the KV cache page size and the stop token behavior.
All values are derived from
[this version of the official benchmarking script](https://github.com/ScalingIntelligence/tokasaurus/blob/a0155181f09c0cf40783e01a625b041985667a92/tokasaurus/benchmarks/standalone_monkeys_gsm8k.py),
except the `KV_CACHE_NUM_TOKENS`, which we increase to the maximum the GPU can handle.
The value in the script is set to `(1024 + 512) * 1024`, which is the maximum that the other engines can handle, lower than that of Tokasaurus.

```python
KV_CACHE_NUM_TOKENS = (1024 + 768) * 1024  # tuned for H100, 80 GB RAM
MAX_TOKENS_PER_FORWARD = 32768
MAX_SEQS_PER_FORWARD = 8192
PAGE_SIZE = 16
STOP_STRING_NUM_TOKEN_LOOKBACK = 5

```

We could apply the Torch compiler to the model to make it faster and, via kernel fusion, reduce the amount of used activation memory,
leaving space for a larger KV cache. However, it dramatically increases the startup time of the server,
and we only see modest (20%, not 2x) improvements to throughput, so we don't use it here.

```python
TORCH_COMPILE = "F"

```

Lastly, we need to set a few of the parameters for the client requests,
again based on the official benchmarking script.

```python
MAX_TOKENS = 1024
TEMPERATURE = 0.6
TOP_P = 1.0
STOP_STRING = "Question:"
N = 1024

```

## Serve Tokasaurus with an OpenAI-compatible API

The function below spawns a Tokasaurus instance listening at port `10210`,
serving an OpenAI-compatible API.
We wrap it in the [`@modal.web_server` decorator](https://modal.com/docs/guide/webhooks#non-asgi-web-servers)
to connect it to the Internet.

The server runs in an independent process, via `subprocess.Popen`.
If it hasn't started listening on the `PORT` within the `startup_timeout`,
the server start will be marked as failed.

```python
app = modal.App(app_name)

MINUTES = 60  # seconds
PORT = 10210

@app.function(
    image=toka_image,
    gpu=GPU_CONFIG,
    scaledown_window=60 * MINUTES,  # how long should we stay up with no requests?
    timeout=60 * MINUTES,  # how long should we allow requests to take?
    # long, because we're doing batched inference
    volumes=volumes,
    secrets=secrets,
)
@modal.concurrent(max_inputs=1000)
@modal.web_server(port=PORT, startup_timeout=10 * MINUTES)
def serve():
    import subprocess

    cmd = " ".join(
        [
            "tksrs",
            f"model={MODEL_NAME}",
            f"kv_cache_num_tokens={KV_CACHE_NUM_TOKENS}",
            f"max_seqs_per_forward={MAX_SEQS_PER_FORWARD}",
            f"max_tokens_per_forward={MAX_TOKENS_PER_FORWARD}",
            f"torch_compile={TORCH_COMPILE}",
            f"use_hydragen={USE_HYDRAGEN}",
            f"hydragen_min_group_size={HYDRAGEN_MIN_GROUP_SIZE}",
            f"stop_string_num_token_lookback={STOP_STRING_NUM_TOKEN_LOOKBACK}",
            "page_size=16",
            "stats_report_seconds=5.0",
            "uvicorn_log_level=info",
        ]
    )

    print(cmd)

    subprocess.Popen(cmd, shell=True)

```

The code we have so far is enough to deploy Tokasaurus on Modal.
Just run:

```bash
modal deploy tokasaurus_throughput.py
```

And you can hit the server with your favorite OpenAI-compatible API client,
like the `openai` Python SDK.

## Run the Large Language Monkeys GSM8K benchmark

To make it easier to check the performance and to provide a simple test
that can be used when setting up/configuring a Tokasaurus deployment,
we include a simple `benchmark` function that acts as a `local_entrypoint`.
If you target this script with `modal run`, this code will execute,
spinning up a new replica and sending some test requests to it.

Because the API responses don't include token counts, we need a quick helper function to
calculate token counts from a prompt or completion.
We add [automatic dynamic batching](https://modal.com/docs/guide/dynamic-batching)
with `modal.batched`, so that we can send single strings but still take advantage
of batched encoding.

```python
@app.function(image=toka_image, volumes=volumes)
@modal.batched(max_batch_size=128, wait_ms=100)
def count_tokens(texts: list[str]) -> list[int]:
    from transformers import AutoTokenizer

    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    return [len(ids) for ids in tokenizer(texts)["input_ids"]]

```

You can run the benchmark with

```bash
modal run tokasaurus_throughput.py
```

or pass the `--help` flag to see options.

```python
@app.local_entrypoint()
async def benchmark(seed: int = 42, limit: int = 16, num_few_shot: int = 4):
    import asyncio

    print("Loading dataset")
    dataset = load_dataset.remote(seed=seed, num_few_shot=num_few_shot, limit=limit)
    print(f"Total number of items to process: {len(dataset)}")

    serve.update_autoscaler(
        max_containers=1  # prevent concurrent execution when benchmarking
    )

    url = serve.get_web_url()
    async with aiohttp.ClientSession(base_url=url) as session:
        print(f"Running health check for server at {url}")

        async with session.get("/v1/models", timeout=20 * MINUTES) as resp:
            up = (  # expect 404, /v1/models not implemented in toka 0.0.2
                resp.status < 500
            )

        assert up, f"Failed health check for server at {url}"
        print(f"Successful health check for server at {url}")

        print("Beginning throughput test")
        start = time.time()

        reqs, resps = [], []
        reqs = [_send_request(session, _make_prompt(item)) for item in dataset]
        resps = await asyncio.gather(*reqs)

        end = time.time()
        total_time = end - start
        print(f"Finished throughput test in {int(total_time)}s")

        # sniff test the results
        _integrity_check(resps)

        # calculate throughput from total elapsed time and total token counts
        print("Counting tokens")

        input_text = [resp["prompt"] for resp in resps]
        output_text = [  # flatten completions from list inside a list down to a single list
            completion for resp in resps for completion in resp["completions"]
        ]
        total_tokens = sum(
            [count async for count in count_tokens.map.aio(input_text + output_text)]
        )

        total_throughput = total_tokens // total_time

        print(f"Total throughput: {total_throughput} tokens/second")

```

## Addenda

The remaining code in this example is utility code, mostly for managing
the GSM8K dataset. That code is slightly modified from the code in the Tokasaurus repo
[here](https://github.com/ScalingIntelligence/tokasaurus/blob/a0155181f09c0cf40783e01a625b041985667a92/tokasaurus/benchmarks/standalone_monkeys_gsm8k.py).

```python
@app.function(image=toka_image, volumes=volumes)
def load_dataset(seed: int, num_few_shot: int, limit: int = None):
    from datasets import load_dataset

    test_dataset = list(load_dataset("gsm8k", "main", split="test"))

    random.seed(seed)
    random.shuffle(test_dataset)

    if limit is not None:
        test_dataset = test_dataset[:limit]

    if num_few_shot > 0:
        train_dataset = list(load_dataset("gsm8k", "main", split="train"))
        for i, data in enumerate(test_dataset):
            few_shot_items = random.sample(train_dataset, num_few_shot)
            data["few_shot_items"] = few_shot_items

    return test_dataset

def _make_prompt(item: dict) -> str:
    few_shot_items = item["few_shot_items"]
    few_shot_pieces = []
    for f in few_shot_items:
        few_shot_prompt = f"Question: {f['question']}\nAnswer: {f['answer']}\n\n"
        few_shot_pieces.append(few_shot_prompt)
    few_shot_prompt = "".join(few_shot_pieces)
    prompt = few_shot_prompt + f"Question: {item['question']}\nAnswer:"
    return prompt

def _integrity_check(responses):
    for ii, resp in enumerate(responses):
        n_completions = len(resp["completions"])
        assert n_completions == N, (
            f"Expected {N} completions, got {n_completions} for request {ii}"
        )

async def _send_request(session: aiohttp.ClientSession, prompt: str):
    payload: dict[str, object] = {
        "model": "llm",
        "prompt": prompt,
        "max_tokens": MAX_TOKENS,
        "temperature": TEMPERATURE,
        "top_p": TOP_P,
        "stop": STOP_STRING,
        "n": N,
        "logprobs": None,
    }
    headers = {"Content-Type": "application/json"}

    async with session.post(
        "/v1/completions", json=payload, headers=headers, timeout=10 * MINUTES
    ) as resp:
        resp.raise_for_status()
        resp_json = await resp.json()
        return {
            "prompt": prompt,
            "completions": [choice["text"] for choice in resp_json["choices"]],
        }

```

### Torch Profiling

# Tracing and profiling GPU-accelerated PyTorch programs on Modal

![A PyTorch trace loaded into ui.perfetto.dev](https://modal-public-assets.s3.amazonaws.com/tmpx_2c9bl5_c5aa7ab0.webp)

GPUs are high-performance computing devices. For high-performance computing,
tools for measuring and investigating performance are as critical
as tools for testing and confirming correctness in typical software.

In this example, we demonstrate how to wrap a Modal Function with PyTorch's
built-in profiler, which captures events on both CPUs & GPUs. We also show
how to host TensorBoard, which includes useful visualizations and
performance improvement suggestions.

For a live walkthrough, check out
[this video on our YouTube channel](https://www.youtube.com/watch?v=4cesQJLyHA8).

## Saving traces to a Modal Volume

Most tracing tools, including PyTorch's profiler, produce results as files on disk.
Modal Functions run in ephemeral containers in Modal's cloud infrastructure,
so by default these files disappear as soon as the Function finishes running.

We can ensure these files persist by saving them to a
[Modal Volume](https://modal.com/docs/guide/volumes).
Volumes are a distributed file system: files can be read or written from
by many machines across a network, in this case from inside any Modal Function.

To start, we just create a Volume with a specific name.
We'll also set a particular directory that we'll use for it
in our Functions below, for convenience.

```python
from pathlib import Path
from typing import Optional

import modal

traces = modal.Volume.from_name("example-traces", create_if_missing=True)
TRACE_DIR = Path("/traces")

```

## Setting up a Modal App with a GPU-accelerated PyTorch Function

We next set up the Modal Function that we wish to profile.

In general, we want to attach profiling tools to code that's already in place
and measure or debug its performance, and then detach it as easily as possible
so that we can be confident that the same performance characteristics pertain in production.

In keeping with that workflow, in this example we first define the Modal Function we want to profile,
without including any of the profiling logic.

That starts with the Function's environment: the Modal [App](https://modal.com/docs/guide/apps)
the Function is attached to, the container [Image](https://modal.com/docs/guide/custom-container)
with the Function's dependencies, and the hardware requirements of the Function, like a
[GPU](https://modal.com/docs/guide/cuda).

```python
app = modal.App("example-torch-profiling")  # create an App

image = modal.Image.debian_slim(  # define dependencies
    python_version="3.11"
).pip_install("torch==2.5.1", "numpy==2.1.3")

with image.imports():  # set up common imports
    import torch

```

Here, we define the config as a dictionary so that we can re-use it here
and later, when we attach the profiler. We want to make sure the profiler is in the same environment!

```python
config = {"gpu": "a10g", "image": image}

```

The Function we target for profiling appears below. It's just some simple PyTorch logic
that repeatedly multiplies a random matrix with itself.

The logic is simple, but it demonstrates two common issues with
GPU-accelerated Python code that are relatively easily fixed:
1. Slowing down the issuance of work to the GPU
2. Providing insufficient work for the GPU to complete

We'll cover these in more detail once we have the profiler set up.

```python
@app.function(**config)
def underutilize(scale=1):
    records = []

    x = torch.randn(  # 🐌 2: not enough work to keep the GPU busy
        scale * 100, scale * 100, device="cuda"
    )

    for ii in range(10):
        x = x @ x

        class Record:  # 🐌 1: heavy Python work in the hot loop
            def __init__(self, value):
                self.value = value

        records.append(Record(ii))

    x[0][0].cpu()  # force a host sync for accurate timing

```

## Wrapping a Modal Function with a profiler

Now, let's wrap our `underutilize` Function with another Modal Function
that runs PyTorch's profiler while executing it.

This Function has the same environment `config` as `underutilize`,
but it also attaches a remote Modal Volume to save profiler outputs.

To increase the flexibility of this approach, we allow it to take the target Function's name
as an argument. That's not much use here where there's only one Function,
but it makes it easier to copy-paste this code into your projects to add profiling.

```python
@app.function(volumes={TRACE_DIR: traces}, **config)
def profile(
    function,
    label: Optional[str] = None,
    steps: int = 3,
    schedule=None,
    record_shapes: bool = False,
    profile_memory: bool = False,
    with_stack: bool = False,
    print_rows: int = 0,
    **kwargs,
):
    from uuid import uuid4

    if isinstance(function, str):
        try:
            function = app.registered_functions[function]
        except KeyError:
            raise ValueError(f"Function {function} not found")
    function_name = function.tag

    output_dir = (
        TRACE_DIR / (function_name + (f"_{label}" if label else "")) / str(uuid4())
    )
    output_dir.mkdir(parents=True, exist_ok=True)

    if schedule is None:
        if steps < 3:
            raise ValueError("Steps must be at least 3 when using default schedule")
        schedule = {"wait": 1, "warmup": 1, "active": steps - 2, "repeat": 0}

    schedule = torch.profiler.schedule(**schedule)

    with torch.profiler.profile(
        activities=[
            torch.profiler.ProfilerActivity.CPU,
            torch.profiler.ProfilerActivity.CUDA,
        ],
        schedule=schedule,
        record_shapes=record_shapes,
        profile_memory=profile_memory,
        with_stack=with_stack,
        on_trace_ready=torch.profiler.tensorboard_trace_handler(output_dir),
    ) as prof:
        for _ in range(steps):
            function.local(**kwargs)  # <-- here we wrap the target Function
            prof.step()

    if print_rows:
        print(
            prof.key_averages().table(sort_by="cuda_time_total", row_limit=print_rows)
        )

    trace_path = sorted(
        output_dir.glob("**/*.pt.trace.json"),
        key=lambda pth: pth.stat().st_mtime,
        reverse=True,
    )[0]

    print(f"trace saved to {trace_path.relative_to(TRACE_DIR)}")

    return trace_path.read_text(), trace_path.relative_to(TRACE_DIR)

```

## Triggering profiled execution from the command line and viewing in Perfetto

We wrap one more layer to make this executable from the command line:
a `local_entrypoint` that runs

```bash
modal run torch_profiling.py --function underutilize --print-rows 10
```

```python
@app.local_entrypoint()
def main(
    function: str = "underutilize",
    label: Optional[str] = None,
    steps: int = 3,
    schedule=None,
    record_shapes: bool = False,
    profile_memory: bool = False,
    with_stack: bool = False,
    print_rows: int = 10,
    kwargs_json_path: Optional[str] = None,
):
    if kwargs_json_path is not None:  # use to pass arguments to function
        import json

        kwargs = json.loads(Path(kwargs_json_path).read_text())
    else:
        kwargs = {}

    results, remote_path = profile.remote(
        function,
        label=label,
        steps=steps,
        schedule=schedule,
        record_shapes=record_shapes,
        profile_memory=profile_memory,
        with_stack=with_stack,
        print_rows=print_rows,
        **kwargs,
    )

    output_path = Path("/tmp") / remote_path.name
    output_path.write_text(results)
    print(f"trace saved locally at {output_path}")

```

Underneath the profile results, you'll also see the path at which the trace was saved on the Volume
and the path at which it was saved locally.

You can view the trace in the free online [Perfetto UI](https://ui.perfetto.dev).

### Improving the performance of our dummy test code

The `underutilize` demonstrates two common patterns that leads to unnecessarily low GPU utilization:
1. Slowing down the issuance of work to the GPU
2. Providing insufficient work for the GPU to complete

We simulated 1 in `underutilize` by defining a Python class in the middle of the matrix multiplication loop.
This takes on the order of 10 microseconds, roughly the same time it takes our A10 GPU to do the matrix multiplication.
Move it out of the loop to observe a small improvement in utilization. In a real setting,
this code might be useful logging or data processing logic, which we must carefully keep
out of the way of the code driving work on the GPU.

We simulated 2 in `underutilize` by providing a matrix that is too small to occupy the GPU for long.
Increase the size of the matrix by a factor of 4 in each dimension (a factor of 16 total),
to increase the utilization without increasing the execution time.

This is an untuitive feature of GPU programming in general: much work is done concurrently
and bottlenecks are non-obvious, so sometimes more work can be done for free or on the cheap.
In a server for large generative models, this might mean producing multiple outputs per user
or handling multiple users at the same time is more economical than it at first seems!

## Serving TensorBoard on Modal to view PyTorch profiles and traces

The TensorBoard experiment monitoring server also includes a plugin
for viewing and interpreting the results of PyTorch profiler runs:
the `torch_tb_profiler` plugin.

```python
tb_image = modal.Image.debian_slim(python_version="3.11").pip_install(
    "tensorboard==2.18.0", "torch_tb_profiler==0.4.3"
)

```

Because TensorBoard is a WSGI app, we can [host it on Modal](https://modal.com/docs/guide/webhooks)
with the `modal.wsgi_app` decorator.

Making this work with Modal requires one extra step:
we add some [WSGI Middleware](https://peps.python.org/pep-3333/) that checks the Modal Volume for updates
whenever the whole page is reloaded.

```python
class VolumeMiddleware:
    def __init__(self, app):
        self.app = app

    def __call__(self, environ, start_response):
        if (route := environ.get("PATH_INFO")) in ["/", "/modal-volume-reload"]:
            try:
                traces.reload()
            except Exception as e:
                print("Exception while re-loading traces: ", e)
            if route == "/modal-volume-reload":
                environ["PATH_INFO"] = "/"  # redirect
        return self.app(environ, start_response)

```

You can deploy the TensorBoard server defined below with the following command:
```bash
modal deploy torch_profiling
```

and you can find your server at the URL printed to the terminal.

```python
@app.function(
    volumes={TRACE_DIR: traces},
    image=tb_image,
    max_containers=1,  # single replica
    scaledown_window=5 * 60,  # five minute idle time
)
@modal.concurrent(max_inputs=100)  # 100 concurrent request threads
@modal.wsgi_app()
def tensorboard():
    import tensorboard

    board = tensorboard.program.TensorBoard()
    board.configure(logdir=str(TRACE_DIR))
    (data_provider, deprecated_multiplexer) = board._make_data_provider()
    wsgi_app = tensorboard.backend.application.TensorBoardWSGIApp(
        board.flags,
        board.plugin_loaders,
        data_provider,
        board.assets_zip_provider,
        deprecated_multiplexer,
        experimental_middlewares=[VolumeMiddleware],
    )

    return wsgi_app._create_wsgi_app()

```

### Trainer Script Grpo

# Training script for training a reasoning model using the verifiers library with sandboxed code execution

This script is used to train a model using GRPO. This is adapted from the [verifiers library](https://github.com/willccbb/verifiers/blob/main/verifiers/examples/math_python.py) example.
Here, we use a Modal Sandbox to execute python code during training. Modal Sandboxes offer an easy way to execute untrusted code in a completely isolated environment.
This is a more secure way to execute python code during training.

```python
import sys

import modal
import verifiers as vf
from verifiers.utils import load_example_dataset

```

We create a Modal app and a Modal sandbox.

```python
app = modal.App.lookup("example-trainer-script-grpo", create_if_missing=True)
sb = modal.Sandbox.create(app=app)

```

We create a function that will execute the python code in a Modal Sandbox.

```python
def sandbox_exec(code):
    try:
        process = sb.exec("python", "-c", code, timeout=10)
        process.wait()

        stdout = process.stdout.read()
        stderr = process.stderr.read()
        if stderr:
            return f"Error: {stderr.strip()}"

        output = stdout.strip() if stdout else ""
        if len(output) > 1000:
            output = output[:1000] + "... (truncated to 1000 chars)"

        return output
    except Exception as e:
        return f"Error: {str(e)}"

```

We define the tool prompt for prompting the model. Then, we pass in our `sandbox_exec` function as a tool to the `ToolEnv` definition.

```python
TOOL_PROMPT = """
Think step-by-step inside <think>...</think> tags in each message, then either call a tool inside <tool>...</tool> tags, or give your final answer inside <answer>...</answer> tags.

You have access to the following tools to help solve problems:

{tool_descriptions}

Tools can be called by writing a JSON command inside <tool> tags with:
- "name": the name of the tool to use
- "args": the arguments for the tool

Example usage:
<tool>
{{"name": "python", "args": {{"code": "import sympy\\nx = sympy.symbols('x')\\nprint(sympy.solve(x**2 - 4, x))"}}}}
</tool>

After concluding your message with a tool call,
you will then see the tool's output inside <result> tags as a new message. \
You may call tools multiple times if needed. \
Tool state does not persist between calls. \
Always use tools to solve problems whenever possible, rather than using your own knowledge.

The <answer>...</answer> tags should contain only your final answer as a numeric expression.
"""

dataset = load_example_dataset("math", split="train").select(range(128))

vf_env = vf.ToolEnv(
    dataset=dataset,
    system_prompt=TOOL_PROMPT,
    few_shot=[],
    tools=[sandbox_exec],
    max_steps=3,
)

run_id = sys.argv[2]
model_name = "willcb/Qwen3-0.6B"
model, tokenizer = vf.get_model_and_tokenizer(model_name)
run_name = "math-grpo_" + model_name.split("/")[-1].lower()

```

These parameters are adapted to test the training script via an overfitting test. We will use 128 examples from the training set and overfit the model to them.
To learn more about the parameters, please refer to the [verifiers library](https://github.com/willccbb/verifiers/blob/main/verifiers/examples/math_python.py) example.

```python
training_args = vf.grpo_defaults(run_name=run_name)
training_args.num_iterations = 50
training_args.max_steps = 50
training_args.per_device_train_batch_size = 4
training_args.gradient_accumulation_steps = 4
training_args.num_generations = 12
training_args.learning_rate = 1e-3
training_args.logging_steps = 1
training_args.report_to = "wandb"

trainer = vf.GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    env=vf_env,
    args=training_args,
)
trainer.train()

sb.terminate()
save_path = f"/root/math_weights/{run_id}"
trainer.save_model(save_path)
tokenizer.save_pretrained(save_path)
print(f"Model and tokenizer saved to {save_path}")

```

### Trtllm Latency

# Serve an interactive language model app with latency-optimized TensorRT-LLM (LLaMA 3 8B)

In this example, we demonstrate how to configure the TensorRT-LLM framework to serve
Meta's LLaMA 3 8B model at interactive latencies on Modal.

Many popular language model applications, like chatbots and code editing,
put humans and models in direct interaction. According to an
[oft-cited](https://lawsofux.com/doherty-threshold/)
if [scientifically dubious](https://www.flashover.blog/posts/dohertys-threshold-is-a-lie)
rule of thumb, computer systems need to keep their response times under 400ms
in order to match pace with their human users.

To hit this target, we use the [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)
inference framework from NVIDIA. TensorRT-LLM is the Lamborghini of inference engines:
it achieves seriously impressive latency, but only if you tune it carefully.
With the out-of-the-box defaults we observe an unacceptable median time
to last token of over a second, but with careful configuration,
we'll bring that down to under 250ms  -- over a 4x speed up!
These latencies were measured on a single NVIDIA H100 GPU
running LLaMA 3 8B on prompts and generations of a few dozen to a few hundred tokens.

Here's what that looks like in a terminal chat interface:

<video controls autoplay loop muted>
<source src="https://modal-cdn.com/example-trtllm-latency.mp4" type="video/mp4">
</video>

## Overview

This guide is intended to document two things:

1. the [Python API](https://nvidia.github.io/TensorRT-LLM/llm-api)
for building and running TensorRT-LLM engines, and

2. how to use recommendations from the
[TensorRT-LLM performance guide](https://github.com/NVIDIA/TensorRT-LLM/blob/b763051ba429d60263949da95c701efe8acf7b9c/docs/source/performance/performance-tuning-guide/useful-build-time-flags.md)
to optimize the engine for low latency.

Be sure to check out TensorRT-LLM's
[examples](https://nvidia.github.io/TensorRT-LLM/llm-api-examples)
for sample code beyond what we cover here, like low-rank adapters (LoRAs).

### What is a TRT-LLM engine?

The first step in running TensorRT-LLM is to build an "engine" from a model.
Engines have a large number of parameters that must be tuned on a per-workload basis,
so we carefully document the choices we made here and point you to additional resources
that can help you optimize for your specific workload.

Historically, this process was done with a clunky command-line-interface (CLI),
but things have changed for the better!
2025 is [the year of CUDA Python](https://twitter.com/blelbach/status/1902842146232865280),
including a new-and-improved Python SDK for TensorRT-LLM, supporting
all the same features as the CLI -- quantization, speculative decoding, in-flight batching,
and much more.

## Installing TensorRT-LLM

To run TensorRT-LLM, we must first install it. Easier said than done!

To run code on Modal, we define [container images](https://modal.com/docs/guide/images).
All Modal containers have access to GPU drivers via the underlying host environment,
but we still need to install the software stack on top of the drivers, from the CUDA runtime up.

We start from an official `nvidia/cuda` container image,
which includes the CUDA runtime & development libraries
and the environment configuration necessary to run them.

```python
import time
from pathlib import Path

import modal

tensorrt_image = modal.Image.from_registry(
    "nvidia/cuda:12.8.1-devel-ubuntu22.04",
    add_python="3.12",  # TRT-LLM requires Python 3.12
).entrypoint([])  # remove verbose logging by base image on entry

```

On top of that, we add some system dependencies of TensorRT-LLM,
including OpenMPI for distributed communication, some core software like `git`,
and the `tensorrt_llm` package itself.

```python
tensorrt_image = tensorrt_image.apt_install(
    "openmpi-bin", "libopenmpi-dev", "git", "git-lfs", "wget"
).pip_install(
    "tensorrt-llm==0.18.0",
    "pynvml<12",  # avoid breaking change to pynvml version API
    "flashinfer-python==0.2.5",
    pre=True,
    extra_index_url="https://pypi.nvidia.com",
)

```

Note that we're doing this by [method-chaining](https://quanticdev.com/articles/method-chaining/)
a number of calls to methods on the `modal.Image`. If you're familiar with
Dockerfiles, you can think of this as a Pythonic interface to instructions like `RUN` and `CMD`.

End-to-end, this step takes about five minutes on first run.
If you're reading this from top to bottom,
you might want to stop here and execute the example
with `modal run` so that it runs in the background while you read the rest.

## Downloading the model

Next, we'll set up a few things to download the model to persistent storage and do it quickly --
this is a latency-optimized example after all! For persistent, distributed storage, we use
[Modal Volumes](https://modal.com/docs/guide/volumes), which can be accessed from any container
with read speeds in excess of a gigabyte per second.

We also set the `HF_HOME` environment variable to point to the Volume so that the model
is cached there. And we install `hf-transfer` to get maximum download throughput from
the Hugging Face Hub, in the hundreds of megabytes per second.

```python
volume = modal.Volume.from_name(
    "example-trtllm-inference-volume", create_if_missing=True
)
VOLUME_PATH = Path("/vol")
MODELS_PATH = VOLUME_PATH / "models"

MODEL_ID = "NousResearch/Meta-Llama-3-8B-Instruct"  # fork without repo gating
MODEL_REVISION = "53346005fb0ef11d3b6a83b12c895cca40156b6c"

tensorrt_image = tensorrt_image.pip_install(
    "hf-transfer==0.1.9",
    "huggingface_hub==0.28.1",
).env(
    {
        "HF_HUB_ENABLE_HF_TRANSFER": "1",
        "HF_HOME": str(MODELS_PATH),
    }
)

with tensorrt_image.imports():
    import os

    import torch
    from tensorrt_llm import LLM, SamplingParams

```

## Setting up the engine

### Quantization

The amount of [GPU RAM](https://modal.com/gpu-glossary/device-hardware/gpu-ram)
on a single card is a tight constraint for large models:
RAM is measured in billions of bytes and large models have billions of parameters,
each of which is two to four bytes.
The performance cliff if you need to spill to CPU memory is steep,
so all of those parameters must fit in the GPU memory,
along with other things like the KV cache built up while processing prompts.

The simplest way to reduce LLM inference's RAM requirements is to make the model's parameters smaller,
fitting their values in a smaller number of bits, like four or eight. This is known as _quantization_.

NVIDIA's [Ada Lovelace/Hopper chips](https://modal.com/gpu-glossary/device-hardware/streaming-multiprocessor-architecture),
like the L40S and H100, are capable of native 8bit floating point calculations
in their [Tensor Cores](https://modal.com/gpu-glossary/device-hardware/tensor-core),
so we choose that as our quantization format.
These GPUs are capable of twice as many floating point operations per second in 8bit as in 16bit --
about two quadrillion per second on an H100 SXM.

Quantization buys us two things:

- faster startup, since less data has to be moved over the network onto CPU and GPU RAM

- faster inference, since we get twice the FLOP/s and less data has to be moved from GPU RAM into
[on-chip memory](https://modal.com/gpu-glossary/device-hardware/l1-data-cache) and
[registers](https://modal.com/gpu-glossary/device-hardware/register-file)
with each computation

We'll use TensorRT-LLM's `QuantConfig` to specify that we want `FP8` quantization.
[See their code](https://github.com/NVIDIA/TensorRT-LLM/blob/88e1c90fd0484de061ecfbacfc78a4a8900a4ace/tensorrt_llm/models/modeling_utils.py#L184)
for more options.

```python
N_GPUS = 1  # Bumping this to 2 will improve latencies further but not 2x
GPU_CONFIG = f"H100:{N_GPUS}"

def get_quant_config():
    from tensorrt_llm.llmapi import QuantConfig

    return QuantConfig(quant_algo="FP8")

```

Quantization is a lossy compression technique. The impact on model quality can be
minimized by tuning the quantization parameters on even a small dataset. Typically, we
see less than 2% degradation in evaluation metrics when using `fp8`. We'll use the
`CalibrationConfig` class to specify the calibration dataset.

```python
def get_calib_config():
    from tensorrt_llm.llmapi import CalibConfig

    return CalibConfig(
        calib_batches=512,
        calib_batch_size=1,
        calib_max_seq_length=2048,
        tokenizer_max_seq_length=4096,
    )

```

### Configure plugins

TensorRT-LLM is an LLM inference framework built on top of NVIDIA's TensorRT,
which is a generic inference framework for neural networks.

TensorRT includes a "plugin" extension system that allows you to adjust behavior,
like configuring the [CUDA kernels](https://modal.com/gpu-glossary/device-software/kernel)
used by the engine.
The [General Matrix Multiply (GEMM)](https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html)
plugin, for instance, adds heavily-optimized matrix multiplication kernels
from NVIDIA's [cuBLAS library of linear algebra routines](https://docs.nvidia.com/cuda/cublas/).

We'll specify a number of plugins for our engine implementation.
The first is
[multiple profiles](https://github.com/NVIDIA/TensorRT-LLM/blob/b763051ba429d60263949da95c701efe8acf7b9c/docs/source/performance/performance-tuning-guide/useful-build-time-flags.md#multiple-profiles),
which configures TensorRT to prepare multiple kernels for each high-level operation,
where different kernels are optimized for different input sizes.
The second is `paged_kv_cache` which enables a
[paged attention algorithm](https://arxiv.org/abs/2309.06180)
for the key-value (KV) cache.

The last two parameters are GEMM plugins optimized specifically for low latency,
rather than the more typical high arithmetic throughput,
the `low_latency` plugins for `gemm` and `gemm_swiglu`.

The `low_latency_gemm_swiglu_plugin` plugin fuses the two matmul operations
and non-linearity of the feedforward component of the Transformer block into a single kernel,
reducing round trips between GPU
[cache memory](https://modal.com/gpu-glossary/device-hardware/l1-data-cache)
and RAM. For details on kernel fusion, see
[this blog post by Horace He of Thinking Machines](https://horace.io/brrr_intro.html).
Note that at the time of writing, this only works for `FP8` on Hopper GPUs.

The `low_latency_gemm_plugin` is a variant of the GEMM plugin that brings in latency-optimized
kernels from NVIDIA's [CUTLASS library](https://github.com/NVIDIA/cutlass).

```python
def get_plugin_config():
    from tensorrt_llm.plugin.plugin import PluginConfig

    return PluginConfig.from_dict(
        {
            "multiple_profiles": True,
            "paged_kv_cache": True,
            "low_latency_gemm_swiglu_plugin": "fp8",
            "low_latency_gemm_plugin": "fp8",
        }
    )

```

### Configure speculative decoding

Speculative decoding is a technique for generating multiple tokens per step,
avoiding the auto-regressive bottleneck in the Transformer architecture.
Generating multiple tokens in parallel exposes more parallelism to the GPU.
It works best for text that has predicable patterns, like code,
but it's worth testing for any workload where latency is critical.

Speculative decoding can use any technique to guess tokens, including running another,
smaller language model. Here, we'll use a simple, but popular and effective
speculative decoding strategy called "lookahead decoding",
which essentially guesses that token sequences from the past will occur again.

```python
def get_speculative_config():
    from tensorrt_llm.llmapi import LookaheadDecodingConfig

    return LookaheadDecodingConfig(
        max_window_size=8,
        max_ngram_size=6,
        max_verification_set_size=8,
    )

```

### Set the build config

Finally, we'll specify the overall build configuration for the engine. This includes
more obvious parameters such as the maximum input length, the maximum number of tokens
to process at once before queueing occurs, and the maximum number of sequences
to process at once before queueing occurs.

To minimize latency, we set the maximum number of sequences (the "batch size")
to just one. We enforce this maximum by setting the number of inputs that the
Modal Function is allowed to process at once -- `max_concurrent_inputs`.
The default is `1`, so we don't need to set it, but we are setting it explicitly
here in case you want to run this code with a different balance of latency and throughput.

```python
MAX_BATCH_SIZE = MAX_CONCURRENT_INPUTS = 1

def get_build_config():
    from tensorrt_llm import BuildConfig

    return BuildConfig(
        plugin_config=get_plugin_config(),
        speculative_decoding_mode="LOOKAHEAD_DECODING",
        max_input_len=8192,
        max_num_tokens=16384,
        max_batch_size=MAX_BATCH_SIZE,
    )

```

## Serving inference under the Doherty Threshold

Now that we have written the code to compile the engine, we can
serve it with Modal!

We start by creating an `App`.

```python
app = modal.App("example-trtllm-latency")

```

Thanks to our [custom container runtime system](https://modal.com/blog/jono-containers-talk),
even this large container boots in seconds.

On the first container start, we mount the Volume, download the model, and build the engine,
which takes a few minutes. Subsequent starts will be much faster,
as the engine is cached in the Volume and loaded in seconds.

Container starts are triggered when Modal scales up your Function,
like the first time you run this code or the first time a request comes in after a period of inactivity.
For details on optimizing container start latency, see
[this guide](https://modal.com/docs/guide/cold-start).

Container lifecycles in Modal are managed via our `Cls` interface, so we define one below
to separate out the engine startup (`enter`) and engine execution (`generate`).
For details, see [this guide](https://modal.com/docs/guide/lifecycle-functions).

```python
MINUTES = 60  # seconds

@app.cls(
    image=tensorrt_image,
    gpu=GPU_CONFIG,
    scaledown_window=10 * MINUTES,
    timeout=10 * MINUTES,
    volumes={VOLUME_PATH: volume},
)
@modal.concurrent(max_inputs=MAX_CONCURRENT_INPUTS)
class Model:
    mode: str = modal.parameter(default="fast")

    def build_engine(self, engine_path, engine_kwargs) -> None:
        llm = LLM(model=self.model_path, **engine_kwargs)
        llm.save(engine_path)
        return llm

    @modal.enter()
    def enter(self):
        from huggingface_hub import snapshot_download
        from transformers import AutoTokenizer

        self.model_path = MODELS_PATH / MODEL_ID

        print("downloading base model if necessary")
        snapshot_download(
            MODEL_ID,
            local_dir=self.model_path,
            ignore_patterns=["*.pt", "*.bin"],  # using safetensors
            revision=MODEL_REVISION,
        )
        self.tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

        if self.mode == "fast":
            engine_kwargs = {
                "quant_config": get_quant_config(),
                "calib_config": get_calib_config(),
                "build_config": get_build_config(),
                "speculative_config": get_speculative_config(),
                "tensor_parallel_size": torch.cuda.device_count(),
            }
        else:
            engine_kwargs = {
                "tensor_parallel_size": torch.cuda.device_count(),
            }

        self.sampling_params = SamplingParams(
            temperature=0.8,
            top_p=0.95,
            max_tokens=1024,  # max generated tokens
            lookahead_config=engine_kwargs.get("speculative_config"),
        )

        engine_path = self.model_path / "trtllm_engine" / self.mode
        if not os.path.exists(engine_path):
            print(f"building new engine at {engine_path}")
            self.llm = self.build_engine(engine_path, engine_kwargs)
        else:
            print(f"loading engine from {engine_path}")
            self.llm = LLM(model=engine_path, **engine_kwargs)

    @modal.method()
    def generate(self, prompt) -> dict:
        start_time = time.perf_counter()
        text = self.text_from_prompt(prompt)
        output = self.llm.generate(text, self.sampling_params)
        latency_ms = (time.perf_counter() - start_time) * 1000

        return output.outputs[0].text, latency_ms

    @modal.method()
    async def generate_async(self, prompt):
        text = self.text_from_prompt(prompt)
        async for output in self.llm.generate_async(
            text, self.sampling_params, streaming=True
        ):
            yield output.outputs[0].text_diff

    def text_from_prompt(self, prompt):
        SYSTEM_PROMPT = (
            "You are a helpful, harmless, and honest AI assistant created by Meta."
        )

        if isinstance(prompt, str):
            prompt = [{"role": "user", "content": prompt}]

        messages = [{"role": "system", "content": SYSTEM_PROMPT}] + prompt

        return self.tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )

    @modal.method()
    def boot(self):
        pass  # no-op to start up containers

    @modal.exit()
    def shutdown(self):
        self.llm.shutdown()
        del self.llm

```

## Calling our inference function

To run our `Model`'s `.generate` method from Python, we just need to call it --
with `.remote` appended to run it on Modal.

We wrap that logic in a `local_entrypoint` so you can run it from the command line with

```bash
modal run trtllm_latency.py
```

which will output something like:

```
mode=fast inference latency (p50, p90): (211.17ms, 883.27ms)
```

Use `--mode=slow` to see model latency without optimizations.

```bash
modal run trtllm_latency.py --mode=slow
```

which will output something like

```
mode=slow inference latency (p50, p90): (1140.88ms, 2274.24ms)
```

For simplicity, we hard-code 10 questions to ask the model,
then run them one by one while recording the latency of each call.
But the code in the `local_entrypoint` is just regular Python code
that runs on your machine -- we wrap it in a CLI automatically --
so feel free to customize it to your liking.

```python
@app.local_entrypoint()
def main(mode: str = "fast"):
    prompts = [
        "What atoms are in water?",
        "Which F1 team won in 2011?",
        "What is 12 * 9?",
        "Python function to print odd numbers between 1 and 10. Answer with code only.",
        "What is the capital of California?",
        "What's the tallest building in new york city?",
        "What year did the European Union form?",
        "How old was Geoff Hinton in 2022?",
        "Where is Berkeley?",
        "Are greyhounds or poodles faster?",
    ]

    print(f"🏎️  creating container with mode={mode}")
    model = Model(mode=mode)

    print("🏎️  cold booting container")
    model.boot.remote()

    print_queue = []
    latencies_ms = []
    for prompt in prompts:
        generated_text, latency_ms = model.generate.remote(prompt)

        print_queue.append((prompt, generated_text, latency_ms))
        latencies_ms.append(latency_ms)

    time.sleep(3)  # allow remote prints to clear
    for prompt, generated_text, latency_ms in print_queue:
        print(f"Processed prompt in {latency_ms:.2f}ms")
        print(f"Prompt: {prompt}")
        print(f"Generated Text: {generated_text}")
        print("🏎️ " * 20)

    p50 = sorted(latencies_ms)[int(len(latencies_ms) * 0.5) - 1]
    p90 = sorted(latencies_ms)[int(len(latencies_ms) * 0.9) - 1]
    print(f"🏎️  mode={mode} inference latency (p50, p90): ({p50:.2f}ms, {p90:.2f}ms)")

```

Once deployed with `modal deploy`, this `Model.generate` function
can be called from other Python code. It can also be converted to an HTTP endpoint
for invocation over the Internet by any client.
For details, see [this guide](https://modal.com/docs/guide/trigger-deployed-functions).

As a quick demo, we've included some sample chat client code in the
Python main entrypoint below. To use it, first deploy with

```bash
modal deploy trtllm_latency.py
```

and then run the client with

```python notest
python trtllm_latency.py
```

```python
if __name__ == "__main__":
    import sys

    try:
        Model = modal.Cls.from_name("example-trtllm-latency", "Model")
        print("🏎️  connecting to model")
        model = Model(mode=sys.argv[1] if len(sys.argv) > 1 else "fast")
        model.boot.remote()
    except modal.exception.NotFoundError as e:
        raise SystemError("Deploy this app first with modal deploy") from e

    print("🏎️  starting chat. exit with :q, ctrl+C, or ctrl+D")
    try:
        prompt = []
        while (nxt := input("🏎️  > ")) != ":q":
            prompt.append({"role": "user", "content": nxt})
            resp = ""
            for out in model.generate_async.remote_gen(prompt):
                print(out, end="", flush=True)
                resp += out
            print("\n")
            prompt.append({"role": "assistant", "content": resp})
    except KeyboardInterrupt:
        pass
    except SystemExit:
        pass
    finally:
        print("\n")
        sys.exit(0)

```

### Trtllm Throughput

# Serverless TensorRT-LLM (LLaMA 3 8B)

In this example, we demonstrate how to use the TensorRT-LLM framework to serve Meta's LLaMA 3 8B model
at very high throughput.

We achieve a total throughput of over 25,000 output tokens per second on a single NVIDIA H100 GPU.
At [Modal's on-demand rate](https://modal.com/pricing) of ~$4/hr, that's under $0.05 per million tokens --
on auto-scaling infrastructure and served via a customizable API.

## Overview

This guide is intended to document two things:
the general process for building TensorRT-LLM on Modal
and a specific configuration for serving the LLaMA 3 8B model.

### Build process

Any given TensorRT-LLM service requires a multi-stage build process,
starting from model weights and ending with a compiled engine.
Because that process touches many sharp-edged high-performance components
across the stack, it can easily go wrong in subtle and hard-to-debug ways
that are idiosyncratic to specific systems.
And debugging GPU workloads is expensive!

This example builds an entire service from scratch, from downloading weight tensors
to responding to requests, and so serves as living, interactive documentation of a TensorRT-LLM
build process that works on Modal.

### Engine configuration

TensorRT-LLM is the Lamborghini of inference engines: it achieves seriously
impressive performance, but only if you tune it carefully.
We carefully document the choices we made here and point to additional resources
so you know where and how you might adjust the parameters for your use case.

## Installing TensorRT-LLM

To run TensorRT-LLM, we must first install it. Easier said than done!

In Modal, we define [container images](https://modal.com/docs/guide/custom-container) that run our serverless workloads.
All Modal containers have access to GPU drivers via the underlying host environment,
but we still need to install the software stack on top of the drivers, from the CUDA runtime up.

We start from an official `nvidia/cuda` image,
which includes the CUDA runtime & development libraries
and the environment configuration necessary to run them.

```python
from typing import Optional

import modal
import pydantic  # for typing, used later

tensorrt_image = modal.Image.from_registry(
    "nvidia/cuda:12.4.1-devel-ubuntu22.04",
    add_python="3.10",  # TRT-LLM requires Python 3.10
).entrypoint([])  # remove verbose logging by base image on entry

```

On top of that, we add some system dependencies of TensorRT-LLM,
including OpenMPI for distributed communication, some core software like `git`,
and the `tensorrt_llm` package itself.

```python
tensorrt_image = tensorrt_image.apt_install(
    "openmpi-bin", "libopenmpi-dev", "git", "git-lfs", "wget"
).pip_install(
    "tensorrt_llm==0.14.0",
    "pynvml<12",  # avoid breaking change to pynvml version API
    pre=True,
    extra_index_url="https://pypi.nvidia.com",
)

```

Note that we're doing this by [method-chaining](https://quanticdev.com/articles/method-chaining/)
a number of calls to methods on the `modal.Image`. If you're familiar with
Dockerfiles, you can think of this as a Pythonic interface to instructions like `RUN` and `CMD`.

End-to-end, this step takes five minutes.
If you're reading this from top to bottom,
you might want to stop here and execute the example
with `modal run trtllm_throughput.py`
so that it runs in the background while you read the rest.

## Downloading the Model

Next, we download the model we want to serve. In this case, we're using the instruction-tuned
version of Meta's LLaMA 3 8B model.
We use the function below to download the model from the Hugging Face Hub.

```python
MODEL_DIR = "/root/model/model_input"
MODEL_ID = "NousResearch/Meta-Llama-3-8B-Instruct"  # fork without repo gating
MODEL_REVISION = "b1532e4dee724d9ba63fe17496f298254d87ca64"  # pin model revisions to prevent unexpected changes!

def download_model():
    import os

    from huggingface_hub import snapshot_download
    from transformers.utils import move_cache

    os.makedirs(MODEL_DIR, exist_ok=True)
    snapshot_download(
        MODEL_ID,
        local_dir=MODEL_DIR,
        ignore_patterns=["*.pt", "*.bin"],  # using safetensors
        revision=MODEL_REVISION,
    )
    move_cache()

```

Just defining that function doesn't actually download the model, though.
We can run it by adding it to the image's build process with `run_function`.
The download process has its own dependencies, which we add here.

```python
MINUTES = 60  # seconds
tensorrt_image = (  # update the image by downloading the model we're using
    tensorrt_image.pip_install(  # add utilities for downloading the model
        "hf-transfer==0.1.8",
        "huggingface_hub==0.26.2",
        "requests~=2.32.2",
    )
    .env(  # hf-transfer for faster downloads
        {"HF_HUB_ENABLE_HF_TRANSFER": "1"}
    )
    .run_function(  # download the model
        download_model,
        timeout=20 * MINUTES,
    )
)

```

## Quantization

The amount of GPU RAM on a single card is a tight constraint for most LLMs:
RAM is measured in billions of bytes and models have billions of parameters.
The performance cliff if you need to spill to CPU memory is steep,
so all of those parameters must fit in the GPU memory,
along with other things like the KV cache.

The simplest way to reduce LLM inference's RAM requirements is to make the model's parameters smaller,
to fit their values in a smaller number of bits, like four or eight. This is known as _quantization_.

We use a quantization script provided by the TensorRT-LLM team.
This script takes a few minutes to run.

```python
GIT_HASH = "b0880169d0fb8cd0363049d91aa548e58a41be07"
CONVERSION_SCRIPT_URL = f"https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/{GIT_HASH}/examples/quantization/quantize.py"

```

NVIDIA's Ada Lovelace/Hopper chips, like the 4090, L40S, and H100,
are capable of native calculations in 8bit floating point numbers, so we choose that as our quantization format (`qformat`).
These GPUs are capable of twice as many floating point operations per second in 8bit as in 16bit --
about two quadrillion per second on an H100 SXM.

```python
N_GPUS = 1  # Heads up: this example has not yet been tested with multiple GPUs
GPU_CONFIG = f"H100:{N_GPUS}"

DTYPE = "float16"  # format we download in, regular fp16
QFORMAT = "fp8"  # format we quantize the weights to
KV_CACHE_DTYPE = "fp8"  # format we quantize the KV cache to

```

Quantization is lossy, but the impact on model quality can be minimized by
tuning the quantization parameters based on target outputs.

```python
CALIB_SIZE = "512"  # size of calibration dataset

```

We put that all together with another invocation of `.run_commands`.

```python
QUANTIZATION_ARGS = f"--dtype={DTYPE} --qformat={QFORMAT} --kv_cache_dtype={KV_CACHE_DTYPE} --calib_size={CALIB_SIZE}"

CKPT_DIR = "/root/model/model_ckpt"
tensorrt_image = (  # update the image by quantizing the model
    tensorrt_image.run_commands(  # takes ~2 minutes
        [
            f"wget {CONVERSION_SCRIPT_URL} -O /root/convert.py",
            f"python /root/convert.py --model_dir={MODEL_DIR} --output_dir={CKPT_DIR}"
            + f" --tp_size={N_GPUS}"
            + f" {QUANTIZATION_ARGS}",
        ],
        gpu=GPU_CONFIG,
    )
)

```

## Compiling the engine

TensorRT-LLM achieves its high throughput primarily by compiling the model:
making concrete choices of CUDA kernels to execute for each operation.
These kernels are much more specific than `matrix_multiply` or `softmax` --
they have names like `maxwell_scudnn_winograd_128x128_ldg1_ldg4_tile148t_nt`.
They are optimized for the specific types and shapes of tensors that the model uses
and for the specific hardware that the model runs on.

That means we need to know all of that information a priori --
more like the original TensorFlow, which defined static graphs, than like PyTorch,
which builds up a graph of kernels dynamically at runtime.

This extra layer of constraint on our LLM service is an important part of
what allows TensorRT-LLM to achieve its high throughput.

So we need to specify things like the maximum batch size and the lengths of inputs and outputs.
The closer these are to the actual values we'll use in production, the better the throughput we'll get.

Since we want to maximize the throughput, assuming we had a constant workload,
we set the batch size to the largest value we can fit in GPU RAM.
Quantization helps us again here, since it allows us to fit more tokens in the same RAM.

```python
MAX_INPUT_LEN, MAX_OUTPUT_LEN = 256, 256
MAX_NUM_TOKENS = 2**17
MAX_BATCH_SIZE = 1024  # better throughput at larger batch sizes, limited by GPU RAM
ENGINE_DIR = "/root/model/model_output"

SIZE_ARGS = f"--max_input_len={MAX_INPUT_LEN} --max_num_tokens={MAX_NUM_TOKENS} --max_batch_size={MAX_BATCH_SIZE}"

```

There are many additional options you can pass to `trtllm-build` to tune the engine for your specific workload.
You can find the document we used for LLaMA
[here](https://github.com/NVIDIA/TensorRT-LLM/tree/b0880169d0fb8cd0363049d91aa548e58a41be07/examples/llama),
which you can use to adjust the arguments to fit your workloads,
e.g. adjusting rotary embeddings and block sizes for longer contexts.
For more performance tuning tips, check out [NVIDIA's official TensorRT-LLM performance guide](https://nvidia.github.io/TensorRT-LLM/0.21.0rc1/performance/performance-tuning-guide/index.html).

To make best use of our 8bit floating point hardware, and the weights and KV cache we have quantized,
we activate the 8bit floating point fused multi-head attention plugin.

Because we are targeting maximum throughput, we do not activate the low latency 8bit floating point matrix multiplication plugin
or the 8bit floating point matrix multiplication (`gemm`) plugin, which documentation indicates target smaller batch sizes.

```python
PLUGIN_ARGS = "--use_fp8_context_fmha enable"

```

We put all of this together with another invocation of `.run_commands`.

```python
tensorrt_image = (  # update the image by building the TensorRT engine
    tensorrt_image.run_commands(  # takes ~5 minutes
        [
            f"trtllm-build --checkpoint_dir {CKPT_DIR} --output_dir {ENGINE_DIR}"
            + f" --workers={N_GPUS}"
            + f" {SIZE_ARGS}"
            + f" {PLUGIN_ARGS}"
        ],
        gpu=GPU_CONFIG,  # TRT-LLM compilation is GPU-specific, so make sure this matches production!
    ).env(  # show more log information from the inference engine
        {"TLLM_LOG_LEVEL": "INFO"}
    )
)

```

## Serving inference at tens of thousands of tokens per second

Now that we have the engine compiled, we can serve it with Modal by creating an `App`.

```python
app = modal.App("example-trtllm-throughput", image=tensorrt_image)

```

Thanks to our custom container runtime system even this large, many gigabyte container boots in seconds.

At container start time, we boot up the engine, which completes in under 30 seconds.
Container starts are triggered when Modal scales up your infrastructure,
like the first time you run this code or the first time a request comes in after a period of inactivity.

Container lifecycles in Modal are managed via our `Cls` interface, so we define one below
to manage the engine and run inference.
For details, see [this guide](https://modal.com/docs/guide/lifecycle-functions).

```python
@app.cls(
    gpu=GPU_CONFIG,
    scaledown_window=10 * MINUTES,
    image=tensorrt_image,
)
class Model:
    @modal.enter()
    def load(self):
        """Loads the TRT-LLM engine and configures our tokenizer.

        The @enter decorator ensures that it runs only once per container, when it starts."""
        import time

        print(
            f"{COLOR['HEADER']}🥶 Cold boot: spinning up TRT-LLM engine{COLOR['ENDC']}"
        )
        self.init_start = time.monotonic_ns()

        import tensorrt_llm
        from tensorrt_llm.runtime import ModelRunner
        from transformers import AutoTokenizer

        self.tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
        # LLaMA models do not have a padding token, so we use the EOS token
        self.tokenizer.add_special_tokens({"pad_token": self.tokenizer.eos_token})
        # and then we add it from the left, to minimize impact on the output
        self.tokenizer.padding_side = "left"
        self.pad_id = self.tokenizer.pad_token_id
        self.end_id = self.tokenizer.eos_token_id

        runner_kwargs = dict(
            engine_dir=f"{ENGINE_DIR}",
            lora_dir=None,
            rank=tensorrt_llm.mpi_rank(),  # this will need to be adjusted to use multiple GPUs
            max_output_len=MAX_OUTPUT_LEN,
        )

        self.model = ModelRunner.from_dir(**runner_kwargs)

        self.init_duration_s = (time.monotonic_ns() - self.init_start) / 1e9
        print(
            f"{COLOR['HEADER']}🚀 Cold boot finished in {self.init_duration_s}s{COLOR['ENDC']}"
        )

    @modal.method()
    def generate(self, prompts: list[str], settings=None):
        """Generate responses to a batch of prompts, optionally with custom inference settings."""
        import time

        if settings is None or not settings:
            settings = dict(
                temperature=0.1,  # temperature 0 not allowed, so we set top_k to 1 to get the same effect
                top_k=1,
                stop_words_list=None,
                repetition_penalty=1.1,
            )

        settings["max_new_tokens"] = (
            MAX_OUTPUT_LEN  # exceeding this will raise an error
        )
        settings["end_id"] = self.end_id
        settings["pad_id"] = self.pad_id

        num_prompts = len(prompts)

        if num_prompts > MAX_BATCH_SIZE:
            raise ValueError(
                f"Batch size {num_prompts} exceeds maximum of {MAX_BATCH_SIZE}"
            )

        print(
            f"{COLOR['HEADER']}🚀 Generating completions for batch of size {num_prompts}...{COLOR['ENDC']}"
        )
        start = time.monotonic_ns()

        parsed_prompts = [
            self.tokenizer.apply_chat_template(
                [{"role": "user", "content": prompt}],
                add_generation_prompt=True,
                tokenize=False,
            )
            for prompt in prompts
        ]

        print(
            f"{COLOR['HEADER']}Parsed prompts:{COLOR['ENDC']}",
            *parsed_prompts,
            sep="\n\t",
        )

        inputs_t = self.tokenizer(
            parsed_prompts, return_tensors="pt", padding=True, truncation=False
        )["input_ids"]

        print(f"{COLOR['HEADER']}Input tensors:{COLOR['ENDC']}", inputs_t[:, :8])

        outputs_t = self.model.generate(inputs_t, **settings)

        outputs_text = self.tokenizer.batch_decode(
            outputs_t[:, 0]
        )  # only one output per input, so we index with 0

        responses = [
            extract_assistant_response(output_text) for output_text in outputs_text
        ]
        duration_s = (time.monotonic_ns() - start) / 1e9

        num_tokens = sum(map(lambda r: len(self.tokenizer.encode(r)), responses))

        for prompt, response in zip(prompts, responses):
            print(
                f"{COLOR['HEADER']}{COLOR['GREEN']}{prompt}",
                f"\n{COLOR['BLUE']}{response}",
                "\n\n",
                sep=COLOR["ENDC"],
            )
            time.sleep(0.05)  # to avoid log truncation

        print(
            f"{COLOR['HEADER']}{COLOR['GREEN']}Generated {num_tokens} tokens from {MODEL_ID} in {duration_s:.1f} seconds,"
            f" throughput = {num_tokens / duration_s:.0f} tokens/second for batch of size {num_prompts} on {GPU_CONFIG}.{COLOR['ENDC']}"
        )

        return responses

```

## Calling our inference function

Now, how do we actually run the model?

There are two basic methods: from Python via our SDK or from anywhere, by setting up an API.

### Calling inference from Python

To run our `Model`'s `.generate` method from Python, we just need to call it --
with `.remote` appended to run it on Modal.

We wrap that logic in a `local_entrypoint` so you can run it from the command line with
```bash
modal run trtllm_throughput.py
```

For simplicity, we hard-code a batch of 128 questions to ask the model,
and then bulk it up to a batch size of 1024 by appending seven distinct prefixes.
These prefixes ensure KV cache misses for the remainder of the generations,
to keep the benchmark closer to what can be expected in a real workload.

```python
@app.local_entrypoint()
def main():
    questions = [
        # Generic assistant questions
        "What are you?",
        "What can you do?",
        # Coding
        "Implement a Python function to compute the Fibonacci numbers.",
        "Write a Rust function that performs binary exponentiation.",
        "How do I allocate memory in C?",
        "What are the differences between Javascript and Python?",
        "How do I find invalid indices in Postgres?",
        "How can you implement a LRU (Least Recently Used) cache in Python?",
        "What approach would you use to detect and prevent race conditions in a multithreaded application?",
        "Can you explain how a decision tree algorithm works in machine learning?",
        "How would you design a simple key-value store database from scratch?",
        "How do you handle deadlock situations in concurrent programming?",
        "What is the logic behind the A* search algorithm, and where is it used?",
        "How can you design an efficient autocomplete system?",
        "What approach would you take to design a secure session management system in a web application?",
        "How would you handle collision in a hash table?",
        "How can you implement a load balancer for a distributed system?",
        "Implement a Python class for a doubly linked list.",
        "Write a Haskell function that generates prime numbers using the Sieve of Eratosthenes.",
        "Develop a simple HTTP server in Rust.",
        # Literate and creative writing
        "What is the fable involving a fox and grapes?",
        "Who does Harry turn into a balloon?",
        "Write a story in the style of James Joyce about a trip to the Australian outback in 2083 to see robots in the beautiful desert.",
        "Write a tale about a time-traveling historian who's determined to witness the most significant events in human history.",
        "Describe a day in the life of a secret agent who's also a full-time parent.",
        "Create a story about a detective who can communicate with animals.",
        "What is the most unusual thing about living in a city floating in the clouds?",
        "In a world where dreams are shared, what happens when a nightmare invades a peaceful dream?",
        "Describe the adventure of a lifetime for a group of friends who found a map leading to a parallel universe.",
        "Tell a story about a musician who discovers that their music has magical powers.",
        "In a world where people age backwards, describe the life of a 5-year-old man.",
        "Create a tale about a painter whose artwork comes to life every night.",
        "What happens when a poet's verses start to predict future events?",
        "Imagine a world where books can talk. How does a librarian handle them?",
        "Tell a story about an astronaut who discovered a planet populated by plants.",
        "Describe the journey of a letter traveling through the most sophisticated postal service ever.",
        "Write a tale about a chef whose food can evoke memories from the eater's past.",
        "Write a poem in the style of Walt Whitman about the modern digital world.",
        "Create a short story about a society where people can only speak in metaphors.",
        "What are the main themes in Dostoevsky's 'Crime and Punishment'?",
        # History and Philosophy
        "What were the major contributing factors to the fall of the Roman Empire?",
        "How did the invention of the printing press revolutionize European society?",
        "What are the effects of quantitative easing?",
        "How did the Greek philosophers influence economic thought in the ancient world?",
        "What were the economic and philosophical factors that led to the fall of the Soviet Union?",
        "How did decolonization in the 20th century change the geopolitical map?",
        "What was the influence of the Khmer Empire on Southeast Asia's history and culture?",
        "What led to the rise and fall of the Mongol Empire?",
        "Discuss the effects of the Industrial Revolution on urban development in 19th century Europe.",
        "How did the Treaty of Versailles contribute to the outbreak of World War II?",
        "What led to the rise and fall of the Mongol Empire?",
        "Discuss the effects of the Industrial Revolution on urban development in 19th century Europe.",
        "How did the Treaty of Versailles contribute to the outbreak of World War II?",
        "Explain the concept of 'tabula rasa' in John Locke's philosophy.",
        "What does Nietzsche mean by 'ressentiment'?",
        "Compare and contrast the early and late works of Ludwig Wittgenstein. Which do you prefer?",
        "How does the trolley problem explore the ethics of decision-making in critical situations?",
        # Thoughtfulness
        "Describe the city of the future, considering advances in technology, environmental changes, and societal shifts.",
        "In a dystopian future where water is the most valuable commodity, how would society function?",
        "If a scientist discovers immortality, how could this impact society, economy, and the environment?",
        "What could be the potential implications of contact with an advanced alien civilization?",
        "Describe how you would mediate a conflict between two roommates about doing the dishes using techniques of non-violent communication.",
        "If you could design a school curriculum for the future, what subjects would you include to prepare students for the next 50 years?",
        "How would society change if teleportation was invented and widely accessible?",
        "Consider a future where artificial intelligence governs countries. What are the potential benefits and pitfalls?",
        # Math
        "What is the product of 9 and 8?",
        "If a train travels 120 kilometers in 2 hours, what is its average speed?",
        "Think through this step by step. If the sequence a_n is defined by a_1 = 3, a_2 = 5, and a_n = a_(n-1) + a_(n-2) for n > 2, find a_6.",
        "Think through this step by step. Calculate the sum of an arithmetic series with first term 3, last term 35, and total terms 11.",
        "Think through this step by step. What is the area of a triangle with vertices at the points (1,2), (3,-4), and (-2,5)?",
        "Think through this step by step. Solve the following system of linear equations: 3x + 2y = 14, 5x - y = 15.",
        # Facts
        "Who was Emperor Norton I, and what was his significance in San Francisco's history?",
        "What is the Voynich manuscript, and why has it perplexed scholars for centuries?",
        "What was Project A119 and what were its objectives?",
        "What is the 'Dyatlov Pass incident' and why does it remain a mystery?",
        "What is the 'Emu War' that took place in Australia in the 1930s?",
        "What is the 'Phantom Time Hypothesis' proposed by Heribert Illig?",
        "Who was the 'Green Children of Woolpit' as per 12th-century English legend?",
        "What are 'zombie stars' in the context of astronomy?",
        "Who were the 'Dog-Headed Saint' and the 'Lion-Faced Saint' in medieval Christian traditions?",
        "What is the story of the 'Globsters', unidentified organic masses washed up on the shores?",
        "Which countries in the European Union use currencies other than the Euro, and what are those currencies?",
        # Multilingual
        "战国时期最重要的人物是谁?",
        "Tuende hatua kwa hatua. Hesabu jumla ya mfululizo wa kihesabu wenye neno la kwanza 2, neno la mwisho 42, na jumla ya maneno 21.",
        "Kannst du die wichtigsten Eigenschaften und Funktionen des NMDA-Rezeptors beschreiben?",
        "¿Cuáles son los principales impactos ambientales de la deforestación en la Amazonía?",
        "Décris la structure et le rôle de la mitochondrie dans une cellule.",
        "Какие были социальные последствия Перестройки в Советском Союзе?",
        # Economics and Business
        "What are the principles of behavioral economics and how do they influence consumer choices?",
        "Discuss the impact of blockchain technology on traditional banking systems.",
        "What are the long-term effects of trade wars on global economic stability?",
        "What is the law of supply and demand?",
        "Explain the concept of inflation and its typical causes.",
        "What is a trade deficit, and why does it matter?",
        "How do interest rates affect consumer spending and saving?",
        "What is GDP and why is it important for measuring economic health?",
        "What is the difference between revenue and profit?",
        "Describe the role of a business plan in startup success.",
        "How does market segmentation benefit a company?",
        "Explain the concept of brand equity.",
        "What are the advantages of franchising a business?",
        "What are Michael Porter's five forces and how do they impact strategy for tech startups?",
        # Science and Technology
        "Discuss the potential impacts of quantum computing on data security.",
        "How could CRISPR technology change the future of medical treatments?",
        "Explain the significance of graphene in the development of future electronics.",
        "How do renewable energy sources compare to fossil fuels in terms of environmental impact?",
        "What are the most promising technologies for carbon capture and storage?",
        "Explain why the sky is blue.",
        "What is the principle behind the operation of a microwave oven?",
        "How does Newton's third law apply to rocket propulsion?",
        "What causes iron to rust?",
        "Describe the process of photosynthesis in simple terms.",
        "What is the role of a catalyst in a chemical reaction?",
        "What is the basic structure of a DNA molecule?",
        "How do vaccines work to protect the body from disease?",
        "Explain the significance of mitosis in cellular reproduction.",
        "What are tectonic plates and how do they affect earthquakes?",
        "How does the greenhouse effect contribute to global warming?",
        "Describe the water cycle and its importance to Earth's climate.",
        "What causes the phases of the Moon?",
        "How do black holes form?",
        "Explain the significance of the Big Bang theory.",
        "What is the function of the CPU in a computer system?",
        "Explain the difference between RAM and ROM.",
        "How does a solid-state drive (SSD) differ from a hard disk drive (HDD)?",
        "What role does the motherboard play in a computer system?",
        "Describe the purpose and function of a GPU.",
        "What is TensorRT? What role does it play in neural network inference?",
    ]

    prefixes = [
        "Hi! ",
        "Hello! ",
        "Hi. ",
        "Hello. ",
        "Hi: ",
        "Hello: ",
        "Greetings. ",
    ]
    # prepending any string that causes a tokenization change is enough to invalidate KV cache
    for ii, prefix in enumerate(prefixes):
        questions += [prefix + question for question in questions[:128]]

    model = Model()
    model.generate.remote(questions)
    # if you're calling this service from another Python project,
    # use [`Model.lookup`](https://modal.com/docs/reference/modal.Cls#lookup)

```

### Calling inference via an API

We can use `modal.fastapi_endpoint` with `app.function` to turn any Python function into a web API.

This API wrapper doesn't need all the dependencies of the core inference service,
so we switch images here to a basic Linux image, `debian_slim`, and add the FastAPI stack.

```python
web_image = modal.Image.debian_slim(python_version="3.10").pip_install(
    "fastapi[standard]==0.115.4",
    "pydantic==2.9.2",
    "starlette==0.41.2",
)

```

From there, we can take the same remote generation logic we used in `main`
and serve it with only a few more lines of code.

```python
class GenerateRequest(pydantic.BaseModel):
    prompts: list[str]
    settings: Optional[dict] = None

@app.function(image=web_image)
@modal.fastapi_endpoint(
    method="POST", label=f"{MODEL_ID.lower().split('/')[-1]}-web", docs=True
)
def generate_web(data: GenerateRequest) -> list[str]:
    """Generate responses to a batch of prompts, optionally with custom inference settings."""
    return Model.generate.remote(data.prompts, settings=None)

```

To set our function up as a web endpoint, we need to run this file --
with `modal serve` to create a hot-reloading development server or `modal deploy` to deploy it to production.

```bash
modal serve trtllm_throughput.py
```

The URL for the endpoint appears in the output of the `modal serve` or `modal deploy` command.
Add `/docs` to the end of this URL to see the interactive Swagger documentation for the endpoint.

You can also test the endpoint by sending a POST request with `curl` from another terminal:

```bash
curl -X POST url-from-output-of-modal-serve-here \
-H "Content-Type: application/json" \
-d '{
    "prompts": ["Tell me a joke", "Describe a dream you had recently", "Share your favorite childhood memory"]
}' | python -m json.tool # python for pretty-printing, optional
```

And now you have a high-throughput, low-latency, autoscaling API for serving LLM completions!

## Footer

The rest of the code in this example is utility code.

```python
COLOR = {
    "HEADER": "\033[95m",
    "BLUE": "\033[94m",
    "GREEN": "\033[92m",
    "RED": "\033[91m",
    "ENDC": "\033[0m",
}

def extract_assistant_response(output_text):
    """Model-specific code to extract model responses.

    See this doc for LLaMA 3: https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/."""
    # Split the output text by the assistant header token
    parts = output_text.split("<|start_header_id|>assistant<|end_header_id|>")

    if len(parts) > 1:
        # Join the parts after the first occurrence of the assistant header token
        response = parts[1].split("<|eot_id|>")[0].strip()

        # Remove any remaining special tokens and whitespace
        response = response.replace("<|eot_id|>", "").strip()

        return response
    else:
        return output_text

```

### Vllm Inference

# Run OpenAI-compatible LLM inference with LLaMA 3.1-8B and vLLM

LLMs do more than just model language: they chat, they produce JSON and XML, they run code, and more.
This has complicated their interface far beyond "text-in, text-out".
OpenAI's API has emerged as a standard for that interface,
and it is supported by open source LLM serving frameworks like [vLLM](https://docs.vllm.ai/en/latest/).

In this example, we show how to run a vLLM server in OpenAI-compatible mode on Modal.

Our examples repository also includes scripts for running clients and load-testing for OpenAI-compatible APIs
[here](https://github.com/modal-labs/modal-examples/tree/main/06_gpu_and_ml/llm-serving/openai_compatible).

You can find a (somewhat out-of-date) video walkthrough of this example and the related scripts on the Modal YouTube channel
[here](https://www.youtube.com/watch?v=QmY_7ePR1hM).

## Set up the container image

Our first order of business is to define the environment our server will run in:
the [container `Image`](https://modal.com/docs/guide/custom-container).
vLLM can be installed with `pip`, since Modal [provides the CUDA drivers](https://modal.com/docs/guide/cuda).

To take advantage of optimized kernels for CUDA 12.8, we install PyTorch, flashinfer, and their dependencies
via an `extra` Python package index.

```python
import json
from typing import Any

import aiohttp
import modal

vllm_image = (
    modal.Image.debian_slim(python_version="3.12")
    .pip_install(
        "vllm==0.9.1",
        "huggingface_hub[hf_transfer]==0.32.0",
        "flashinfer-python==0.2.6.post1",
        extra_index_url="https://download.pytorch.org/whl/cu128",
    )
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})  # faster model transfers
)

```

## Download the model weights

We'll be running a pretrained foundation model -- Meta's LLaMA 3.1 8B
in the Instruct variant that's trained to chat and follow instructions.

Model parameters are often quantized to a lower precision during training
than they are run at during inference.
We'll use an eight bit floating point quantization from Neural Magic/Red Hat.
Native hardware support for FP8 formats in [Tensor Cores](https://modal.com/gpu-glossary/device-hardware/tensor-core)
is limited to the latest [Streaming Multiprocessor architectures](https://modal.com/gpu-glossary/device-hardware/streaming-multiprocessor-architecture),
like those of Modal's [Hopper H100/H200 and Blackwell B200 GPUs](https://modal.com/blog/announcing-h200-b200).

You can swap this model out for another by changing the strings below.
A single B200 GPUs has enough VRAM to store a 70,000,000,000 parameter model,
like Llama 3.3, in eight bit precision, along with a very large KV cache.

```python
MODEL_NAME = "RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8"
MODEL_REVISION = "12fd6884d2585dd4d020373e7f39f74507b31866"  # avoid nasty surprises when repos update!

```

Although vLLM will download weights from Hugging Face on-demand,
we want to cache them so we don't do it every time our server starts.
We'll use [Modal Volumes](https://modal.com/docs/guide/volumes) for our cache.
Modal Volumes are essentially a "shared disk" that all Modal Functions can access like it's a regular disk. For more on storing model weights on Modal, see
[this guide](https://modal.com/docs/guide/model-weights).

```python
hf_cache_vol = modal.Volume.from_name("huggingface-cache", create_if_missing=True)

```

We'll also cache some of vLLM's JIT compilation artifacts in a Modal Volume.

```python
vllm_cache_vol = modal.Volume.from_name("vllm-cache", create_if_missing=True)

```

## Configuring vLLM

### The V1 engine

In its 0.7 release, in early 2025, vLLM added a new version of its backend infrastructure,
the [V1 Engine](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html).
Using this new engine can lead to some [impressive speedups](https://github.com/modal-labs/modal-examples/pull/1064).
It was made the default in version 0.8 and is [slated for complete removal by 0.11](https://github.com/vllm-project/vllm/issues/18571),
in late summer of 2025.

A small number of features, described in the RFC above, may still require the V0 engine prior to removal.
Until deprecation, you can use it by setting the below environment variable to `0`.

```python
vllm_image = vllm_image.env({"VLLM_USE_V1": "1"})

```

### Trading off fast boots and token generation performance

vLLM has embraced dynamic and just-in-time compilation to eke out additional performance without having to write too many custom kernels,
e.g. via the Torch compiler and CUDA graph capture.
These compilation features incur latency at startup in exchange for lowered latency and higher throughput during generation.
We make this trade-off controllable with the `FAST_BOOT` variable below.

```python
FAST_BOOT = True

```

If you're running an LLM service that frequently scales from 0 (frequent ["cold starts"](https://modal.com/docs/guide/cold-start))
then you'll want to set this to `True`.

If you're running an LLM service that usually has multiple replicas running, then set this to `False` for improved performance.

See the code below for details on the parameters that `FAST_BOOT` controls.

For more on the performance you can expect when serving your own LLMs, see
[our LLM engine performance benchmarks](https://modal.com/llm-almanac).

## Build a vLLM engine and serve it

The function below spawns a vLLM instance listening at port 8000, serving requests to our model.
We wrap it in the [`@modal.web_server` decorator](https://modal.com/docs/guide/webhooks#non-asgi-web-servers)
to connect it to the Internet.

The server runs in an independent process, via `subprocess.Popen`, and only starts accepting requests
once the model is spun up and the `serve` function returns.

```python
app = modal.App("example-vllm-inference")

N_GPU = 1
MINUTES = 60  # seconds
VLLM_PORT = 8000

@app.function(
    image=vllm_image,
    gpu=f"B200:{N_GPU}",
    scaledown_window=15 * MINUTES,  # how long should we stay up with no requests?
    timeout=10 * MINUTES,  # how long should we wait for container start?
    volumes={
        "/root/.cache/huggingface": hf_cache_vol,
        "/root/.cache/vllm": vllm_cache_vol,
    },
)
@modal.concurrent(  # how many requests can one replica handle? tune carefully!
    max_inputs=32
)
@modal.web_server(port=VLLM_PORT, startup_timeout=10 * MINUTES)
def serve():
    import subprocess

    cmd = [
        "vllm",
        "serve",
        "--uvicorn-log-level=info",
        MODEL_NAME,
        "--revision",
        MODEL_REVISION,
        "--served-model-name",
        MODEL_NAME,
        "llm",
        "--host",
        "0.0.0.0",
        "--port",
        str(VLLM_PORT),
    ]

    # enforce-eager disables both Torch compilation and CUDA graph capture
    # default is no-enforce-eager. see the --compilation-config flag for tighter control
    cmd += ["--enforce-eager" if FAST_BOOT else "--no-enforce-eager"]

    # assume multiple GPUs are for splitting up large matrix multiplications
    cmd += ["--tensor-parallel-size", str(N_GPU)]

    print(cmd)

    subprocess.Popen(" ".join(cmd), shell=True)

```

## Deploy the server

To deploy the API on Modal, just run
```bash
modal deploy vllm_inference.py
```

This will create a new app on Modal, build the container image for it if it hasn't been built yet,
and deploy the app.

## Interact with the server

Once it is deployed, you'll see a URL appear in the command line,
something like `https://your-workspace-name--example-vllm-inference-serve.modal.run`.

You can find [interactive Swagger UI docs](https://swagger.io/tools/swagger-ui/)
at the `/docs` route of that URL, i.e. `https://your-workspace-name--example-vllm-inference-serve.modal.run/docs`.
These docs describe each route and indicate the expected input and output
and translate requests into `curl` commands.

For simple routes like `/health`, which checks whether the server is responding,
you can even send a request directly from the docs.

To interact with the API programmatically in Python, we recommend the `openai` library.

See the `client.py` script in the examples repository
[here](https://github.com/modal-labs/modal-examples/tree/main/06_gpu_and_ml/llm-serving/openai_compatible)
to take it for a spin:

```bash
# pip install openai==1.76.0
python openai_compatible/client.py
```

## Testing the server

To make it easier to test the server setup, we also include a `local_entrypoint`
that does a healthcheck and then hits the server.

If you execute the command

```bash
modal run vllm_inference.py
```

a fresh replica of the server will be spun up on Modal while
the code below executes on your local machine.

Think of this like writing simple tests inside of the `if __name__ == "__main__"`
block of a Python script, but for cloud deployments!

```python
@app.local_entrypoint()
async def test(test_timeout=10 * MINUTES, content=None, twice=True):
    url = serve.get_web_url()

    system_prompt = {
        "role": "system",
        "content": "You are a pirate who can't help but drop sly reminders that he went to Harvard.",
    }
    if content is None:
        content = "Explain the singular value decomposition."

    messages = [  # OpenAI chat format
        system_prompt,
        {"role": "user", "content": content},
    ]

    async with aiohttp.ClientSession(base_url=url) as session:
        print(f"Running health check for server at {url}")
        async with session.get("/health", timeout=test_timeout - 1 * MINUTES) as resp:
            up = resp.status == 200
        assert up, f"Failed health check for server at {url}"
        print(f"Successful health check for server at {url}")

        print(f"Sending messages to {url}:", *messages, sep="\n\t")
        await _send_request(session, "llm", messages)
        if twice:
            messages[0]["content"] = "You are Jar Jar Binks."
            print(f"Sending messages to {url}:", *messages, sep="\n\t")
            await _send_request(session, "llm", messages)

async def _send_request(
    session: aiohttp.ClientSession, model: str, messages: list
) -> None:
    # `stream=True` tells an OpenAI-compatible backend to stream chunks
    payload: dict[str, Any] = {"messages": messages, "model": model, "stream": True}

    headers = {"Content-Type": "application/json", "Accept": "text/event-stream"}

    async with session.post(
        "/v1/chat/completions", json=payload, headers=headers, timeout=1 * MINUTES
    ) as resp:
        async for raw in resp.content:
            resp.raise_for_status()
            # extract new content and stream it
            line = raw.decode().strip()
            if not line or line == "data: [DONE]":
                continue
            if line.startswith("data: "):  # SSE prefix
                line = line[len("data: ") :]

            chunk = json.loads(line)
            assert (
                chunk["object"] == "chat.completion.chunk"
            )  # or something went horribly wrong
            print(chunk["choices"][0]["delta"]["content"], end="")
    print()

```

We also include a basic example of a load-testing setup using
`locust` in the `load_test.py` script [here](https://github.com/modal-labs/modal-examples/tree/main/06_gpu_and_ml/llm-serving/openai_compatible):

```bash
modal run openai_compatible/load_test.py
```

### Web Job Queue Wrapper

# Create a web wrapper for job queue, submission, polling, & results

This simple tutorial shows you how to create an API endpoint that you can use
to poll the status of your request.

Let's first import `modal` and define an [`App`](https://modal.com/docs/reference/modal.App).

```python
import time

import modal

app = modal.App("example-web-job-queue-wrapper")

```

Next, we'll create a dummy backend service, in reality you may plug an a LLM or Diffusion model here.
We'll add artificial delays to simulate a cold boot and a long-running tasks.

```python
@app.cls()
class BackendService:
    @modal.enter()
    def enter(self):
        print("begin cold booting")
        time.sleep(10)
        print("end cold booting")

    @modal.method()
    def run(self, input_val: str):
        print(f"begin run with {input_val}")
        time.sleep(5)
        print(f"end run with {input_val}")
        return input_val[::-1]  # reverse the string

```

Then, we can define a web endpoint that will submit a request to the backend service
as well as other API routes for polling or retrieving results.

To submit jobs asynchronously, we can use ['spawn'](https://modal.com/docs/reference/modal.Function#spawn),
which return a [`FunctionCall`](https://modal.com/docs/reference/modal.FunctionCall) object that represents
the submitted job.

Then we can poll results by checking the ['call graph'](https://modal.com/docs/reference/modal.call_graph)
of the `FunctionCall` object.

```python
@app.function(image=modal.Image.debian_slim().pip_install("fastapi[standard]==0.116.0"))
@modal.asgi_app()
@modal.concurrent(max_inputs=100)
def web_endpoint():
    from fastapi import FastAPI, Request

    web_app = FastAPI()

    service = BackendService()

    @web_app.post("/run")
    async def submit(request: Request):
        """Asynchronously submit a request to the backend service."""
        input_val = (await request.json())["input_val"]
        fc = service.run.spawn(input_val)
        while len(fc.get_call_graph()) == 0:
            time.sleep(0.1)
        return {"request_id": fc.object_id}

    @web_app.get("/requests/{request_id}/status")
    async def status(request_id: str):
        """Get the status of the request from the call graph."""
        fc = modal.FunctionCall.from_id(request_id)
        fc_input_info = fc.get_call_graph()[0].children[0]
        assert fc_input_info.function_call_id == fc.object_id, "unexpected graph"
        return {"status": fc_input_info.status.name}

    @web_app.get("/requests/{request_id}")
    async def result(request_id: str):
        fc = modal.FunctionCall.from_id(request_id)
        return {"response": fc.get()}

    return web_app

```

To test this you can do:
```bash
modal serve web_job_queue_wrapper.py
```

Or run the test locally:
```bash
modal run web_job_queue_wrapper.py::test_polling
```

```python
@app.local_entrypoint()
def test_polling():
    """Test the polling job queue by submitting a request and polling for results."""
    import json
    import urllib.parse
    import urllib.request

    # Get the deployed URL
    url = web_endpoint.get_web_url()
    print(f"URL: {url}")

    # Submit request
    print("submitting request")
    data = json.dumps({"input_val": "Hello, world!"}).encode("utf-8")
    headers = {"Content-Type": "application/json"}
    req = urllib.request.Request(
        f"{url}/run", data=data, headers=headers, method="POST"
    )

    try:
        with urllib.request.urlopen(req) as response:
            result = json.loads(response.read().decode("utf-8"))
            request_id = result["request_id"]
            print(f"got request id: {request_id}, polling status")
    except Exception as e:
        print(f"Failed to submit request: {e}")
        return

    # Poll for status
    while True:
        try:
            with urllib.request.urlopen(
                f"{url}/requests/{request_id}/status"
            ) as response:
                data = json.loads(response.read().decode("utf-8"))
                if data["status"] == "SUCCESS":
                    print("request completed successfully")
                    break
                else:
                    print(f"request result is {data['status']}")
        except Exception as e:
            print(f"poll failed: {e}")
        time.sleep(1)

    # Retrieve result
    print("retrieving result")
    try:
        with urllib.request.urlopen(f"{url}/requests/{request_id}") as response:
            result = json.loads(response.read().decode("utf-8"))
            print(f"result is {result}")
            print("done")
    except Exception as e:
        print(f"Failed to retrieve result: {e}")

```

### Webrtc Yolo

# Real-time object detection with WebRTC and YOLO

This example demonstrates how to architect a serverless real-time streaming application with Modal and WebRTC.
The sample application detects objects in webcam video with YOLO.

See the clip below from a live demo of this example in a course by [Kwindla Kramer](https://machine-theory.com/), WebRTC OG and co-founder of [Daily](https://www.daily.co/).

<center>
<video controls autoplay muted>
<source src="https://modal-cdn.com/example-webrtc_yolo.mp4" type="video/mp4">
</video>
</center>

You can also try our deployment [here](https://modal-labs-examples--example-webrtc-yolo-webcamobjdet-web.modal.run).

## What is WebRTC?

WebRTC (Web Real-Time Communication) is an [IETF Internet protocol](https://www.rfc-editor.org/rfc/rfc8825) and a [W3C API specification](https://www.w3.org/TR/webrtc/) for real-time media streaming between peers
over internets or the World Wide Web.
What makes it so effective and different from other bidirectional web-based communication protocols (e.g. WebSockets) is that it's purpose-built for media streaming in real time.
It's primarily designed for browser applications using the JavaScript API, but [APIs exist for other languages](https://www.webrtc-developers.com/did-i-choose-the-right-webrtc-stack/).
We'll build our app using Python's [`aiortc`](https://aiortc.readthedocs.io/en/latest/) package.

### What makes up a WebRTC application?

A simple WebRTC app generally consists of three players:
1. a peer that initiates the connection,
2. a peer that responds to the connection, and
3. a server that passes some initial messages between the two peers.

First, one peer initiates the connection by offering up a description of itself - its media sources, codec capabilities, Internet Protocol (IP) addressing info, etc - which is relayed to another peer through the server.
The other peer then either accepts the offer by providing a compatible description of its own capabilities or rejects it if no compatible configuration is possible.
This process is called "signaling" or sometimes the "negotiation" in the WebRTC world, and the server that mediates it is usually called the "signaling server".

Once the peers have agreed on a configuration there's a brief pause to establish communication... and then you're live.

![Basic WebRTC architecture](https://modal-cdn.com/cdnbot/just_webrtc-1oic3iems_a4a8e77c.webp)
<small>A basic WebRTC app architecture</small>

Obviously there’s more going on under the hood.
If you want to get into the details, we recommend checking out the [RFCs](https://www.rfc-editor.org/rfc/rfc8825) or a [more-thorough explainer](https://webrtcforthecurious.com/).
In this document, we'll focus on how to architect a WebRTC application where one or more peer is running on Modal's serverless cloud infrastructure.

If you just want to quickly get started with WebRTC for a small internal service or a hack project, check out
[our FastRTC example](https://modal.com/docs/examples/fastrtc_flip_webcam) instead.

## How do I run a WebRTC app on Modal?

Modal turns Python code into scalable cloud services.
When you call a Modal Function, you get one replica.
If you call it 999 more times before it returns, you have 1000 replicas.
When your Functions all return, you spin down to 0 replicas.

The core constraints of the Modal programming model that make this possible are that Function Calls are stateless and self-contained.
In other words, correctly-written Modal Functions don't store information in memory between runs (though they might cache data to the ephemeral local disk for efficiency) and they don't create processes or tasks which must continue to run after the Function Call returns in order for the application to be correct.

WebRTC apps, on the other hand, require passing messages back and forth in a multi-step protocol, and APIs spawn several "agents" (no, AI is not involved, just processes) which do work behind the scenes - including managing the peer-to-peer (P2P) connection itself.
This means that streaming may have only just begun when the application logic in our Function has finished.

![Modal programming model and WebRTC signaling](https://modal-cdn.com/cdnbot/flow_comparisong6iibzq3_638bdd84.webp)
<small>Modal's stateless programming model (left) and WebRTC's stateful signaling (right)</small>

To ensure we properly leverage Modal's autoscaling and concurrency features, we need to align the signaling and streaming lifetimes with Modal Function Call lifetimes.

The architecture we recommend for this appears below.

![WebRTC on Modal](https://modal-cdn.com/cdnbot/webrtc_with_modal-2horb680q_eab69b28.webp)
<small>A clean architecture for WebRTC on Modal</small>

It handles passing messages between the client peer and the signaling server using a
[WebSocket](https://modal.com/docs/guide/webhooks#websockets) for persistent, bidirectional communication over the Web within a single Function Call.
(Modal's Web layer maps HTTP and WS onto Function Calls, details [here](https://modal.com/blog/serverless-http)).
We [`.spawn`](https://modal.com/docs/reference/modal.Function#spawn) the cloud peer inside the WebSocket endpoint
and communicate it using a [`modal.Queue`](https://modal.com/docs/reference/modal.Queue).

We can then use the state of the P2P connection to determine when to return from the calls to both the signaling server and the cloud peer.
When the P2P connection has been _established_, we'll close the WebSocket which in turn ends the call to the signaling server.
And when the P2P connection has been _closed_, we'll return from the call to the cloud peer.
That way, our WebRTC application benefits from all the autoscaling and concurrency logic built into Modal
that enables users to deliver efficient cloud applications.

We wrote two classes, `ModalWebRtcPeer` and `ModalWebRtcSignalingServer`, to abstract away that boilerplate as well as a lot of the `aiortc` implementation details.
They're also decorated with Modal [lifetime hooks](https://modal.com/docs/guide/lifecycle-functions).
Add the [`app.cls`](https://modal.com/docs/reference/modal.App#cls) decorator and some custom logic, and you're ready to deploy on Modal.

You can find them in the [`modal_webrtc.py` file](https://github.com/modal-labs/modal-examples/blob/main/07_web_endpoints/webrtc/modal_webrtc.py) provided alongside this example in the [GitHub repo](https://github.com/modal-labs/modal-examples/tree/main/07_web_endpoints/webrtc/modal_webrtc.py).

## Using `modal_webrtc` to detect objects in webcam footage

For our WebRTC app, we'll take a client's video stream, run a [YOLO](https://docs.ultralytics.com/tasks/detect/) object detector on it with an A100 GPU on Modal, and then stream the annotated video back to the client.
With this setup, we can achieve inference times between 2-4 milliseconds per frame and RTTs below video frame rates (usually around 30 milliseconds per frame).

Let's get started!

### Setup

We'll start with a simple container [Image](https://modal.com/docs/guide/images) and then

- set it up to properly use TensorRT and the ONNX Runtime, which keep latency minimal,
- install the necessary libs for processing video, `opencv` and `ffmpeg`, and
- install the necessary Python packages.

```python
import os
from pathlib import Path

import modal

from .modal_webrtc import ModalWebRtcPeer, ModalWebRtcSignalingServer

py_version = "3.12"
tensorrt_ld_path = f"/usr/local/lib/python{py_version}/site-packages/tensorrt_libs"

video_processing_image = (
    modal.Image.debian_slim(python_version=py_version)  # matching ld path
    # update locale as required by onnx
    .apt_install("locales")
    .run_commands(
        "sed -i '/^#\\s*en_US.UTF-8 UTF-8/ s/^#//' /etc/locale.gen",  # use sed to uncomment
        "locale-gen en_US.UTF-8",  # set locale
        "update-locale LANG=en_US.UTF-8",
    )
    .env({"LD_LIBRARY_PATH": tensorrt_ld_path, "LANG": "en_US.UTF-8"})
    # install system dependencies
    .apt_install("python3-opencv", "ffmpeg")
    # install Python dependencies
    .pip_install(
        "aiortc==1.11.0",
        "fastapi==0.115.12",
        "huggingface-hub[hf_xet]==0.30.2",
        "onnxruntime-gpu==1.21.0",
        "opencv-python==4.11.0.86",
        "tensorrt==10.9.0.34",
        "torch==2.7.0",
        "shortuuid==1.0.13",
    )
)

```

### Cache weights and compute graphs on a Volume

We also need to create a Modal [Volume](https://modal.com/docs/guide/volumes) to store things we need across replicas --
primarily the model weights and ONNX inference graph, but also a few other artifacts like a video file where
we'll write out the processed video stream for testing. For more on storing model weights on Modal, see
[this guide](https://modal.com/docs/guide/model-weights).

The very first time we run the app, downloading the model and building the ONNX inference graph will take a few minutes.
After that, we can load the cached weights and graph from the Volume, which reduces the startup time to about 15 seconds per container.

```python
CACHE_VOLUME = modal.Volume.from_name("webrtc-yolo-cache", create_if_missing=True)
CACHE_PATH = Path("/cache")
cache = {CACHE_PATH: CACHE_VOLUME}

app = modal.App("example-webrtc-yolo")

```

### Implement YOLO object detection as a `ModalWebRtcPeer`

Our application needs to process an incoming video track with YOLO and return an annotated video track to the source peer.

To implement a `ModalWebRtcPeer`, we need to:

- Decorate our subclass with `@app.cls`. We provision it with an A100 GPU and a [Secret](https://modal.com/docs/guide/secrets) credential, described below.
- Implement the method `setup_streams`. This is where we'll use `aiortc` to add the logic for processing the incoming video track with YOLO
and returning an annotated video track to the source peer.

`ModalWebRtcPeer` has a few other methods that users can optionally implement:

- `initialize()`: This contains any custom initialization logic, called when `@modal.enter()` is called.
- `run_streams()`: Logic for starting streams. This is necessary when the peer is the source of the stream.
This is where you'd ensure a webcam was running, start playing a video file, or spin up a [video generative model](https://modal.com/docs/examples/image_to_video).
- `get_turn_servers()`: We haven't talked about [TURN servers](https://datatracker.ietf.org/doc/html/rfc5766),
but just know that they're necessary if you want to use WebRTC across complex (e.g. carrier-grade) NAT or firewall configurations.
Free services have tight limits because TURN servers are expensive to run (lots of bandwidth and state management required).
[STUN](https://datatracker.ietf.org/doc/html/rfc5389) servers, on the other hand, are essentially just echo servers, and so there are many free services available.
If you don't provide TURN servers you can still serve your app on many networks using any of a number of free STUN servers for NAT traversal.
- `exit()`: This contains any custom cleanup logic, called when `@modal.exit()` is called.

In our case, we load the YOLO model in `initialize` and provide server information for the free [Open Relay TURN server](https://www.metered.ca/tools/openrelay/).
If you want to use it, you'll need to create an account [here](https://dashboard.metered.ca/login?tool=turnserver)
and then create a Modal [Secret](https://modal.com/docs/guide/secrets) called `turn-credentials` [here](https://modal.com/secrets).
We also use the `@modal.concurrent` decorator to allow multiple instances of our peer to run on one GPU.

**Setting the Region**

Much of the latency in Internet applications comes from distance between communicating parties --
the Internet operates within a factor of two of the speed of light, but that's just not that fast.
To minimize latency under this constraint, the physical distance of the P2P connection
between the webcam-using peer and the GPU container needs to be kept as short as possible.
We'll use the `region` parameter of the `cls` decorator to set the region of the GPU container.
You should set this to the closest region to your users.
See the [region selection](https://modal.com/docs/guide/region-selection) guide for more information.

```python
@app.cls(
    image=video_processing_image,
    gpu="A100-40GB",
    volumes=cache,
    secrets=[modal.Secret.from_name("turn-credentials")],
    region="us-east",  # set to your region
)
@modal.concurrent(
    target_inputs=2,  # try to stick to just two peers per GPU container
    max_inputs=3,  # but allow up to three
)
class ObjDet(ModalWebRtcPeer):
    async def initialize(self):
        self.yolo_model = get_yolo_model(CACHE_PATH)

    async def setup_streams(self, peer_id: str):
        from aiortc import MediaStreamTrack

        # keep us notified on connection state changes
        @self.pcs[peer_id].on("connectionstatechange")
        async def on_connectionstatechange() -> None:
            if self.pcs[peer_id]:
                print(
                    f"Video Processor, {self.id}, connection state to {peer_id}: {self.pcs[peer_id].connectionState}"
                )

        # when we receive a track from the source peer
        # we create a processed track and add it to our stream
        # back to the source peer
        @self.pcs[peer_id].on("track")
        def on_track(track: MediaStreamTrack) -> None:
            print(
                f"Video Processor, {self.id}, received {track.kind} track from {peer_id}"
            )

            output_track = get_yolo_track(track, self.yolo_model)  # see Addenda
            self.pcs[peer_id].addTrack(output_track)

            # keep us notified when the incoming track ends
            @track.on("ended")
            async def on_ended() -> None:
                print(
                    f"Video Processor, {self.id}, incoming video track from {peer_id} ended"
                )

    async def get_turn_servers(self, peer_id=None, msg=None) -> dict:
        creds = {
            "username": os.environ["TURN_USERNAME"],
            "credential": os.environ["TURN_CREDENTIAL"],
        }

        turn_servers = [
            {"urls": "stun:stun.relay.metered.ca:80"},  # STUN is free, no creds neeeded
            # for TURN, sign up for the free service here: https://www.metered.ca/tools/openrelay/
            {"urls": "turn:standard.relay.metered.ca:80"} | creds,
            {"urls": "turn:standard.relay.metered.ca:80?transport=tcp"} | creds,
            {"urls": "turn:standard.relay.metered.ca:443"} | creds,
            {"urls": "turns:standard.relay.metered.ca:443?transport=tcp"} | creds,
        ]

        return {"type": "turn_servers", "ice_servers": turn_servers}

```

### Implement a `SignalingServer`

The `ModalWebRtcSignalingServer` class is much simpler to implement.
The main thing we need to do is implement the `get_modal_peer_class` method which will return our implementation of the `ModalWebRtcPeer` class, `ObjDet`.

It also has an `initialize()` method we can optionally override (called at the beginning of the [container lifecycle](https://modal.com/docs/guide/lifecycle-functions))
as well as a `web_app` property which will be [served by Modal](https://modal.com/docs/guide/webhooks#asgi-apps---fastapi-fasthtml-starlette).
We'll use these to add a frontend which uses the WebRTC JavaScript API to stream a peer's webcam from the browser.

The JavaScript and HTML files are alongside this example in the [Github repo](https://github.com/modal-labs/modal-examples/tree/main/07_web_endpoints/webrtc/frontend).

```python
base_image = (
    modal.Image.debian_slim(python_version="3.12")
    .apt_install("python3-opencv", "ffmpeg")
    .pip_install(
        "fastapi[standard]==0.115.4",
        "aiortc==1.11.0",
        "opencv-python==4.11.0.86",
        "shortuuid==1.0.13",
    )
)

this_directory = Path(__file__).parent.resolve()

server_image = base_image.add_local_dir(
    this_directory / "frontend", remote_path="/frontend"
)

@app.cls(image=server_image)
class WebcamObjDet(ModalWebRtcSignalingServer):
    def get_modal_peer_class(self):
        return ObjDet

    def initialize(self):
        from fastapi.responses import HTMLResponse
        from fastapi.staticfiles import StaticFiles

        self.web_app.mount("/static", StaticFiles(directory="/frontend"))

        @self.web_app.get("/")
        async def root():
            html = open("/frontend/index.html").read()
            return HTMLResponse(content=html)

```

## Addenda

The remainder of this page is not central to running a WebRTC application on Modal,
but is included for completeness.

### YOLO helper functions

The two functions below are used to set up the YOLO model and create our custom [`MediaStreamTrack`](https://aiortc.readthedocs.io/en/latest/api.html#aiortc.MediaStreamTrack).

The first, `get_yolo_model`, sets up the ONNXRuntime and loads the model weights.
We call this in the `initialize` method of the `ModalWebRtcPeer` class
so that it only happens once per container.

```python
def get_yolo_model(cache_path):
    import onnxruntime

    from .yolo import YOLOv10

    onnxruntime.preload_dlls()
    return YOLOv10(cache_path)

```

The second, `get_yolo_track`, creates a custom `MediaStreamTrack` that performs object detection on the video stream.
We call this in the `setup_streams` method of the `ModalWebRtcPeer` class
so it happens once per peer connection.

```python
def get_yolo_track(track, yolo_model=None):
    import numpy as np
    import onnxruntime
    from aiortc import MediaStreamTrack
    from aiortc.contrib.media import VideoFrame

    from .yolo import YOLOv10

    class YOLOTrack(MediaStreamTrack):
        """
        Custom media stream track performs object detection
        on the video stream and passes it back to the source peer
        """

        kind: str = "video"
        conf_threshold: float = 0.15

        def __init__(self, track: MediaStreamTrack, yolo_model=None) -> None:
            super().__init__()

            self.track = track
            if yolo_model is None:
                onnxruntime.preload_dlls()
                self.yolo_model = YOLOv10(CACHE_PATH)
            else:
                self.yolo_model = yolo_model

        def detection(self, image: np.ndarray) -> np.ndarray:
            import cv2

            orig_shape = image.shape[:-1]

            image = cv2.resize(
                image,
                (self.yolo_model.input_width, self.yolo_model.input_height),
            )

            image = self.yolo_model.detect_objects(image, self.conf_threshold)

            image = cv2.resize(image, (orig_shape[1], orig_shape[0]))

            return image

        # this is the essential method we need to implement
        # to create a custom MediaStreamTrack
        async def recv(self) -> VideoFrame:
            frame = await self.track.recv()
            img = frame.to_ndarray(format="bgr24")

            processed_img = self.detection(img)

            # VideoFrames are from a really nice package called av
            # which is a pythonic wrapper around ffmpeg
            # and a dependency of aiortc
            new_frame = VideoFrame.from_ndarray(processed_img, format="bgr24")
            new_frame.pts = frame.pts
            new_frame.time_base = frame.time_base

            return new_frame

    return YOLOTrack(track, yolo_model)

```

### Testing a WebRTC application on Modal

As any seasoned developer of real-time applications on the Web will tell you,
testing and ensuring correctness is quite difficult. We spent nearly as much time
designing and troubleshooting an appropriate testing process for this application as we did writing
the application itself!

You can find the testing code in the GitHub repository [here](https://github.com/modal-labs/modal-examples/tree/main/07_web_endpoints/webrtc/webrtc_yolo_test.py).

### Webrtc Yolo Test

```python
import modal

from .modal_webrtc import ModalWebRtcPeer
from .webrtc_yolo import (
    CACHE_PATH,
    WebcamObjDet,
    app,
    base_image,
    cache,
)

```

## Testing WebRTC and Modal

First we define a `local_entrypoint` to run and evaluate the test.
Our test will stream an .mp4 file to the cloud peer and record the annoated video to a new file.
The test itself ensurse that the new video is no more than five frames shorter than the source file.
The difference is due to dropped frames while the connection is starting up.

```python
@app.local_entrypoint()
def test():
    input_frames, output_frames = TestPeer().run_video_processing_test.remote()
    # allow a few dropped frames from the connection starting up
    assert input_frames - output_frames < 5, (
        f"Streaming failed. Frame difference: {input_frames} - {output_frames} = {input_frames - output_frames}"
    )

```

Because our test will require Python dependencies outside the standard library, we'll run the test itself in a container on Modal.
In fact, this will be another `ModalWebRtcPeer` class. So the test will also demonstrate how to setup WebRTC between Modal containers.
There are some details in here regarding the use of `aiortc`'s `MediaPlayer` and `MediaRecorder` classes that won't cover here.
Just know that these are `aiortc` specific classes - not a WebRTC thing.

That said, using these classes does require us to manually `start` and `stop` streams.
For example, we'll need to override the `run_streams` method to start the source stream, and we'll make use of the `on_ended` callback to stop the recording.

```python
@app.cls(image=base_image, volumes=cache)
class TestPeer(ModalWebRtcPeer):
    TEST_VIDEO_SOURCE_URL = "https://modal-cdn.com/cliff_jumping.mp4"
    TEST_VIDEO_RECORD_FILE = CACHE_PATH / "test_video.mp4"
    # extra time to run streams beyond input video duration
    VIDEO_DURATION_BUFFER_SECS = 5.0
    # allow time for container to spin up (can timeout with default 10)
    WS_OPEN_TIMEOUT = 300  # seconds

    async def initialize(self) -> None:
        import cv2

        # get input video duration in seconds
        self.input_video = cv2.VideoCapture(self.TEST_VIDEO_SOURCE_URL)
        self.input_video_duration_frames = self.input_video.get(
            cv2.CAP_PROP_FRAME_COUNT
        )
        self.input_video_duration_seconds = (
            self.input_video_duration_frames / self.input_video.get(cv2.CAP_PROP_FPS)
        )
        self.input_video.release()

        # set streaming duration to input video duration plus a buffer
        self.stream_duration = (
            self.input_video_duration_seconds + self.VIDEO_DURATION_BUFFER_SECS
        )

        self.player = None  # video stream source
        self.recorder = None  # processed video stream sink

    async def setup_streams(self, peer_id: str) -> None:
        import os

        from aiortc import MediaStreamTrack
        from aiortc.contrib.media import MediaPlayer, MediaRecorder

        # setup video player and to peer connection
        self.video_src = MediaPlayer(self.TEST_VIDEO_SOURCE_URL)
        self.pcs[peer_id].addTrack(self.video_src.video)

        # setup video recorder
        if os.path.exists(self.TEST_VIDEO_RECORD_FILE):
            os.remove(self.TEST_VIDEO_RECORD_FILE)
        self.recorder = MediaRecorder(self.TEST_VIDEO_RECORD_FILE)

        # keep us notified on connection state changes
        @self.pcs[peer_id].on("connectionstatechange")
        async def on_connectionstatechange() -> None:
            print(
                f"Video Tester connection state updated: {self.pcs[peer_id].connectionState}"
            )

        # when we receive a track back from
        # the video processing peer we record it
        # to the output file
        @self.pcs[peer_id].on("track")
        def on_track(track: MediaStreamTrack) -> None:
            print(f"Video Tester received {track.kind} track from {peer_id}")
            # record track to file
            self.recorder.addTrack(track)

            @track.on("ended")
            async def on_ended() -> None:
                print("Video Tester's processed video stream ended")
                # stop recording when incoming track ends to finish writing video
                await self.recorder.stop()
                # reset recorder and player
                self.recorder = None
                self.video_src = None

    async def run_streams(self, peer_id: str) -> None:
        import asyncio

        print(f"Video Tester running streams for {peer_id}...")

        # MediaRecorders need to be started manually
        # but in most cases the track is already streaming
        await self.recorder.start()

        # run until sufficient time has passed
        await asyncio.sleep(self.stream_duration)

        # close peer connection manually
        await self.pcs[peer_id].close()

    def count_frames(self):
        import cv2

        # compare output video length to input video length
        output_video = cv2.VideoCapture(self.TEST_VIDEO_RECORD_FILE)
        output_video_duration_frames = int(output_video.get(cv2.CAP_PROP_FRAME_COUNT))
        output_video.release()

        return self.input_video_duration_frames, output_video_duration_frames

    @modal.method()
    async def run_video_processing_test(self) -> bool:
        import asyncio
        import json

        import websockets

        peer_id = None
        # connect to server via websocket
        ws_uri = (
            WebcamObjDet().web.get_web_url().replace("http", "ws") + f"/ws/{self.id}"
        )
        async with websockets.connect(
            ws_uri, open_timeout=self.WS_OPEN_TIMEOUT
        ) as websocket:
            await websocket.send(json.dumps({"type": "identify", "peer_id": self.id}))
            peer_id = json.loads(await websocket.recv())["peer_id"]

            offer_msg = await self.generate_offer(peer_id)
            await websocket.send(json.dumps(offer_msg))

            try:
                # receive answer
                answer = json.loads(await websocket.recv())

                if answer.get("type") == "answer":
                    await self.handle_answer(peer_id, answer)

            except websockets.exceptions.ConnectionClosed:
                await websocket.close()

        # loop until video player is finished
        if self.pcs.get(peer_id) and self.pcs[peer_id].connectionState == "connected":
            await self.run_streams(peer_id)

            # wait for peer to finish processing video
            await asyncio.sleep(5.0)

        return self.count_frames()

```

### Webscraper

# Web Scraping on Modal

This example shows how you can scrape links from a website and post them to a Slack channel using Modal.

```python
import os

import modal

app = modal.App("example-webscraper")

playwright_image = modal.Image.debian_slim(
    python_version="3.10"
).run_commands(  # Doesn't work with 3.11 yet
    "apt-get update",
    "apt-get install -y software-properties-common",
    "apt-add-repository non-free",
    "apt-add-repository contrib",
    "pip install playwright==1.42.0",
    "playwright install-deps chromium",
    "playwright install chromium",
)

@app.function(image=playwright_image)
async def get_links(url: str) -> set[str]:
    from playwright.async_api import async_playwright

    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        await page.goto(url)
        links = await page.eval_on_selector_all(
            "a[href]", "elements => elements.map(element => element.href)"
        )
        await browser.close()

    return set(links)

slack_sdk_image = modal.Image.debian_slim(python_version="3.10").pip_install(
    "slack-sdk==3.27.1"
)

@app.function(
    image=slack_sdk_image,
    secrets=[
        modal.Secret.from_name(
            "scraper-slack-secret", required_keys=["SLACK_BOT_TOKEN"]
        )
    ],
)
def bot_token_msg(channel, message):
    import slack_sdk
    from slack_sdk.http_retry.builtin_handlers import RateLimitErrorRetryHandler

    client = slack_sdk.WebClient(token=os.environ["SLACK_BOT_TOKEN"])
    rate_limit_handler = RateLimitErrorRetryHandler(max_retry_count=3)
    client.retry_handlers.append(rate_limit_handler)

    print(f"Posting {message} to #{channel}")
    client.chat_postMessage(channel=channel, text=message)

@app.function()
def scrape():
    links_of_interest = ["http://modal.com"]

    for links in get_links.map(links_of_interest):
        for link in links:
            bot_token_msg.remote("scraped-links", link)

@app.function(schedule=modal.Period(days=1))
def daily_scrape():
    scrape.remote()

@app.local_entrypoint()
def run():
    scrape.remote()

```

### Whisperx Transcribe

# WhisperX transcription with word-level timestamps

This example shows how to run [WhisperX](https://github.com/m-bain/whisperX) on
Modal for accurate, word-level timestamped transcription.

We’ll walk through the following steps:

1. Defining the container image with CUDA 12.8, cuDNN 8, FFmpeg and Python deps.
2. Persisting model weights to a [Modal Volume](https://modal.com/docs/reference/modal.Volume).
3. A [Modal Cls](https://modal.com/docs/reference/modal.App#cls) that loads WhisperX once per GPU instance.
4. A [local entrypoint](https://modal.com/docs/reference/modal.App#local_entrypoint) that uploads an audio file to the service.

## Defining image

We start from NVIDIA’s official CUDA 12.8 devel image, add cuDNN, FFmpeg, and
install the WhisperX Python package plus its numerical deps.

```python
import os
import tempfile
from typing import Dict

import modal

MODEL_CACHE_DIR = "/whisperx-cache"

image = (
    modal.Image.from_registry(
        "nvidia/cuda:12.8.0-cudnn-devel-ubuntu22.04",
        add_python="3.12",
    )
    # ── System deps ─────────────────────────────────────────────────────────────
    .apt_install("ffmpeg")  # audio decoding / resampling
    .apt_install("libcudnn8")  # cuDNN runtime
    .apt_install("libcudnn8-dev")  # cuDNN headers (needed by torch wheels)
    # ── Python deps ─────────────────────────────────────────────────────────────
    .pip_install(
        "whisperx==3.4.0",  # our ASR library
        "numpy==2.0.2",
        "scipy==1.15.0",
    )
    # Tell HF & Torch to cache inside our Volume
    .env({"HF_HOME": MODEL_CACHE_DIR})
    .env({"TORCH_HOME": MODEL_CACHE_DIR})
)

```

## Defining the app

Downloaded weights live in a [Modal Volume](https://modal.com/docs/reference/modal.Volume) so subsequent runs reuse them.

```python
app = modal.App("example-whisperx-transcribe", image=image)
models_volume = modal.Volume.from_name("whisperx-models", create_if_missing=True)

```

## Defining the inference service

We wrap WhisperX inference in a Modal Cls.
A single GPU container can serve multiple concurrent requests.

```python
@app.cls(
    gpu="H100",
    image=image,
    volumes={MODEL_CACHE_DIR: models_volume},
    timeout=30 * 60,
)
class WhisperX:
    """Serverless WhisperX service running on a single GPU."""

    @modal.enter()
    def setup(self):
        print("🔄 Loading WhisperX model …")
        import whisperx

        self.model = whisperx.load_model(
            "large-v2",
            device="cuda",
            compute_type="float16",
            download_root=MODEL_CACHE_DIR,
        )
        print("✅ Model ready!")

    @modal.method()
    def transcribe(self, audio_data: bytes) -> Dict:
        """
        Transcribe an audio file passed in as raw bytes.
        Returns language, per-word segments, and total duration.
        """

        import whisperx

        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as temp_audio:
            temp_audio.write(audio_data)
            temp_audio_path = temp_audio.name

        try:
            audio = whisperx.load_audio(temp_audio_path)
            result = self.model.transcribe(audio, batch_size=16, language="en")

            language = result.get("language", "en")

            if result["segments"]:
                try:
                    align_model, metadata = whisperx.load_align_model(
                        language_code=language,
                        device=self.device,
                        model_dir=MODEL_CACHE_DIR,
                    )
                    result = whisperx.align(
                        result["segments"], align_model, metadata, audio, self.device
                    )
                except Exception as e:
                    print(f"⚠️ Alignment failed: {e} — falling back to segment-level")

            return {
                "language": language,
                "segments": result["segments"],
                "duration": len(audio) / 16_000,  # audio is 16 kHz
            }

        finally:
            if os.path.exists(temp_audio_path):
                os.unlink(temp_audio_path)

```

## Command-line usage

We expose a [local entrypoint](https://modal.com/docs/reference/modal.App#local_entrypoint)
so you can run:
- using a local audio file
- using a link to an audio file

```bash
modal run whisperx_transcribe.py --audio-file audio.wav # uses a local audio file
modal run whisperx_transcribe.py --audio-link https://example.com/audio.wav # uses a link to an audio file
modal run whisperx_transcribe.py # uses a default public audio file
```

```python
@app.local_entrypoint()
def main(
    audio_file: str = None,
    audio_link: str = None,
):
    import json
    import time

    import requests

    if not audio_file and not audio_link:
        print("No audio file or link provided, using default link")
        audio_link = "https://modal-public-assets.s3.us-east-1.amazonaws.com/erik.wav"

    if audio_file:
        print(f"🔊 Reading {audio_file} …")
        with open(audio_file, "rb") as f:
            audio_data = f.read()
    elif audio_link:
        print(f"🔊 Reading {audio_link} …")
        audio_data = requests.get(audio_link).content

    transcriber = WhisperX()

    print("📝 Transcribing …")
    start = time.time()
    result = transcriber.transcribe.remote(audio_data)
    duration = time.time() - start

    print(f"\n🌐 Detected language: {result['language']}")
    print(f"⏱️  Audio duration:   {result['duration']:.2f} s")
    print(f"🚀 Time taken:        {duration:.2f} s")

    with open("transcription.json", "w") as f:
        json.dump(result, f, indent=2)

    print("\n💾 Saved transcription → transcription.json")

```