# Modal llms-full.txt > Modal is a platform for running Python code in the cloud with minimal > configuration, especially for serving AI models and high-performance batch > processing. It supports fast prototyping, serverless APIs, scheduled jobs, > GPU inference, distributed volumes, and sandboxes. Important notes: - Modal's primitives are embedded in Python and tailored for AI/GPU use cases, but they can be used for general-purpose cloud compute. - Modal is a serverless platform, meaning you are only billed for resources used and can spin up containers on demand in seconds. You can sign up for free at [https://modal.com] and get $30/month of credits. ## Guides ### Custom container images #### Defining Images # Images This guide walks you through how to define the environment your Modal Functions run in. These environments are called _containers_. Containers are like light-weight virtual machines -- container engines use [operating system tricks](https://earthly.dev/blog/chroot/) to isolate programs from each other ("containing" them), making them work as though they were running on their own hardware with their own filesystem. This makes execution environments more reproducible, for example by preventing accidental cross-contamination of environments on the same machine. For added security, Modal runs containers using the sandboxed [gVisor container runtime](https://cloud.google.com/blog/products/identity-security/open-sourcing-gvisor-a-sandboxed-container-runtime). Containers are started up from a stored "snapshot" of their filesystem state called an _image_. Producing the image for a container is called _building_ the image. By default, Modal Functions are executed in a [Debian Linux](https://en.wikipedia.org/wiki/Debian) container with a basic Python installation of the same minor version `v3.x` as your local Python interpreter. To make your Apps and Functions useful, you will probably need some third party system packages or Python libraries. Modal provides a number of options to customize your container images at different levels of abstraction and granularity, from high-level convenience methods like `pip_install` through wrappers of core container image build features like `RUN` and `ENV` to full on "bring-your-own-Dockerfile". We'll cover each of these in this guide, along with tips and tricks for building Images effectively when using each tool. The typical flow for defining an image in Modal is [method chaining](https://jugad2.blogspot.com/2016/02/examples-of-method-chaining-in-python.html) starting from a base image, like this: ```python import modal image = ( modal.Image.debian_slim(python_version="3.10") .apt_install("git") .pip_install("torch==2.6.0") .env({"HALT_AND_CATCH_FIRE": "0"}) .run_commands("git clone https://github.com/modal-labs/agi && echo 'ready to go!'") ) ``` In addition to being Pythonic and clean, this also matches the onion-like [layerwise build process](https://docs.docker.com/build/guide/layers/) of container images. ## Adding Python packages The simplest and most common container modification is to add some third party Python package, like [`pandas`](https://pandas.pydata.org/). You can add Python packages to the environment by passing all the packages you need to the [`Image.uv_pip_install`](https://modal.com/docs/reference/modal.Image#uv_pip_install) method. The `Image.uv_pip_install` method takes care of some nuances that are important for using `uv` in containerized workflows, like generating bytecode files during the build phase so that cold starts are faster. You can include [typical Python dependency version specifiers](https://peps.python.org/pep-0508/), like `"torch <= 2.0"`, in the arguments. But we recommend pinning dependencies tightly, like `"torch == 1.9.1"`, to improve the reproducibility and robustness of your builds. ```python import modal datascience_image = ( modal.Image.debian_slim(python_version="3.10") .uv_pip_install("pandas==2.2.0", "numpy") ) @app.function(image=datascience_image) def my_function(): import pandas as pd import numpy as np df = pd.DataFrame() ... ``` If you run into any issues with [`Image.uv_pip_install`](https://modal.com/docs/reference/modal.Image#uv_pip_install), then you can fallback to [`Image.pip_install`](https://modal.com/docs/reference/modal.Image#pip_install) which uses standard `pip`: ```python import modal datascience_image = ( modal.Image.debian_slim(python_version="3.10") .pip_install("pandas==2.2.0", "numpy") ) ``` Note that because you can define a different environment for each and every Modal Function if you so choose, you don't need to worry about virtual environment management. Containers make for much better separation of concerns! If you want to run a specific version of Python remotely rather than just matching the one you're running locally, provide the `python_version` as a string when constructing the base image, like we did above. ## Add local files with `add_local_dir` and `add_local_file` If you want to forward files from your local system, you can do that using the `image.add_local_dir` and `image.add_local_file` image builder methods. ```python image = modal.Image.debian_slim().add_local_dir("/user/erikbern/.aws", remote_path="/root/.aws") ``` By default, these files are added to your container as it starts up rather than introducing a new image layer. This means that the redeployment after making changes is really quick, but also means you can't run additional build steps after. You can specify a `copy=True` argument to the `add_local_` methods to instead force the files to be included in a built image. ### Adding local Python modules There is a convenience method for the special case of adding local Python modules to the container: [`Image.add_local_python_source`](https://modal.com/docs/reference/modal.Image#add_local_python_source) The difference from `add_local_dir` is that `add_local_python_source` takes module names as arguments instead of a file system path and looks up the local package's or module's location via Python's importing mechanism. The files are then added to directories that make them importable in containers in the same way as they are locally. This is mostly intended for pure Python auxiliary modules that are part of your project and that your code imports, whereas third party packages should be installed via [`Image.pip_install()`](https://modal.com/docs/reference/modal.Image#pip_install) or similar. ```python import modal app = modal.App() image_with_module = modal.Image.debian_slim().add_local_python_source("my_local_module") @app.function(image=image_with_module) def f(): import my_local_module # this will now work in containers my_local_module.do_stuff() ``` ### What if I have different Python packages locally and remotely? You might want to use packages inside your Modal code that you don't have on your local computer. In the example above, we build a container that uses `pandas`. But if we don't have `pandas` locally, on the computer launching the Modal job, we can't put `import pandas` at the top of the script, since it would cause an `ImportError`. The easiest solution to this is to put `import pandas` in the function body instead, as you can see above. This means that `pandas` is only imported when running inside the remote Modal container, which has `pandas` installed. Be careful about what you return from Modal Functions that have different packages installed than the ones you have locally! Modal Functions return Python objects, like `pandas.DataFrame`s, and if your local machine doesn't have `pandas` installed, it won't be able to handle a `pandas` object (the error message you see will mention [serialization](https://hazelcast.com/glossary/serialization/)/[deserialization](https://hazelcast.com/glossary/deserialization/)). If you have a lot of functions and a lot of Python packages, you might want to keep the imports in the global scope so that every function can use the same imports. In that case, you can use the [`imports()`](https://modal.com/docs/reference/modal.Image#imports) context manager: ```python import modal pandas_image = modal.Image.debian_slim().pip_install("pandas", "numpy") with pandas_image.imports(): import pandas as pd import numpy as np @app.function(image=pandas_image) def my_function(): df = pd.DataFrame() ``` Because these imports happen before a new container processes its first input, you can combine this decorator with [memory snapshots](https://modal.com/docs/guide/memory-snapshot) to improve [cold start performance](https://modal.com/docs/guide/cold-start#share-initialization-work-across-cold-starts-with-memory-snapshots) for Functions that frequently scale from zero. ## Run shell commands with `.run_commands` You can also supply shell commands that should be executed when building the container image. You might use this to preload custom assets, like model parameters, so that they don't need to be retrieved when Functions start up: ```python import modal image_with_model = ( modal.Image.debian_slim().apt_install("curl").run_commands( "curl -O https://raw.githubusercontent.com/opencv/opencv/master/data/haarcascades/haarcascade_frontalcatface.xml", ) ) @app.function(image=image_with_model) def find_cats(): content = open("/haarcascade_frontalcatface.xml").read() ... ``` ## Run a Python function during your build with `.run_function` Instead of using shell commands, you can also run a Python function as an image build step using the [`Image.run_function`](https://modal.com/docs/reference/modal.Image#run_function) method. For example, you can use this to download model parameters from Hugging Face into your Image: ```python import os import modal def download_models() -> None: import diffusers model_name = "segmind/small-sd" pipe = diffusers.StableDiffusionPipeline.from_pretrained( model_name, use_auth_token=os.environ["HF_TOKEN"] ) pipe.save_pretrained("/model") image = ( modal.Image.debian_slim() .pip_install("diffusers[torch]", "transformers", "ftfy", "accelerate") .run_function(download_models, secrets=[modal.Secret.from_name("huggingface-secret")]) ) ``` Any kwargs accepted by [`@app.function`](https://modal.com/docs/reference/modal.App#function) ([`Volume`s](https://modal.com/docs/guide/volumes), and specifications of resources like [GPUs](https://modal.com/docs/guide/gpu)) can be supplied here. Essentially, this is equivalent to running a Modal Function and snapshotting the resulting filesystem as an image. Whenever you change other features of your image, like the base image or the version of a Python package, the image will automatically be rebuilt the next time it is used. This is a bit more complicated when changing the contents of functions. See the [reference documentation](https://modal.com/docs/reference/modal.Image#run_function) for details. ## Attach GPUs during setup If a step in the setup of your container image should be run on an instance with a GPU (e.g., so that a package can query the GPU to set compilation flags), pass a desired GPU type when defining that step: ```python import modal image = ( modal.Image.debian_slim() .pip_install("bitsandbytes", gpu="H100") ) ``` ## Use `mamba` instead of `pip` with `micromamba_install` `pip` installs Python packages, but some Python workloads require the coordinated installation of system packages as well. The `mamba` package manager can install both. Modal provides a pre-built [Micromamba](https://mamba.readthedocs.io/en/latest/user_guide/micromamba.html) base image that makes it easy to work with `micromamba`: ```python import modal app = modal.App("bayes-pgm") numpyro_pymc_image = ( modal.Image.micromamba() .micromamba_install("pymc==5.10.4", "numpyro==0.13.2", channels=["conda-forge"]) ) @app.function(image=numpyro_pymc_image) def sample(): import pymc as pm import numpyro as np print(f"Running on PyMC v{pm.__version__} with JAX/numpyro v{np.__version__} backend") ... ``` ## Use an existing container image with `.from_registry` You don't always need to start from scratch! Public registries like [Docker Hub](https://hub.docker.com/) have many pre-built container images for common software packages. You can use any public image in your function using [`Image.from_registry`](https://modal.com/docs/reference/modal.Image#from_registry), so long as: - Python 3.9 or later is installed on the `$PATH` as `python` - `pip` is installed correctly - The image is built for the [`linux/amd64` platform](https://unix.stackexchange.com/questions/53415/why-are-64-bit-distros-often-called-amd64) - The image has a [valid `ENTRYPOINT`](#entrypoint) ```python import modal sklearn_image = modal.Image.from_registry("huanjason/scikit-learn") @app.function(image=sklearn_image) def fit_knn(): from sklearn.neighbors import KNeighborsClassifier ... ``` If an existing image does not have either `python` or `pip` set up properly, you can still use it. Just provide a version number as the `add_python` argument to install a reproducible [standalone build](https://github.com/indygreg/python-build-standalone) of Python: ```python import modal image1 = modal.Image.from_registry("ubuntu:22.04", add_python="3.11") image2 = modal.Image.from_registry("gisops/valhalla:latest", add_python="3.11") ``` The `from_registry` method can load images from all public registries, such as [Nvidia's `nvcr.io`](https://catalog.ngc.nvidia.com/containers), [AWS ECR](https://aws.amazon.com/ecr/), and [GitHub's `ghcr.io`](https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry). We also support access to [private AWS ECR and GCP Artifact Registry images](https://modal.com/docs/guide/private-registries). ## Bring your own image definition with `.from_dockerfile` Sometimes, you might be already have a container image defined in a Dockerfile. You can define an Image with a Dockerfile using [`Image.from_dockerfile`](https://modal.com/docs/reference/modal.Image#from_dockerfile). It takes a path to an existing Dockerfile. For instance, we might write a Dockerfile that adds scikit-learn to the official Python image: ``` FROM python:3.9 RUN pip install sklearn ``` and then define a Modal Image with it: ```python import modal dockerfile_image = modal.Image.from_dockerfile("Dockerfile") @app.function(image=dockerfile_image) def fit(): import sklearn ... ``` Note that you can still do method chaining to extend this image! ### Dockerfile command compatibility Since Modal doesn't use Docker to build containers, we have our own implementation of the [Dockerfile specification](https://docs.docker.com/engine/reference/builder/). Most Dockerfiles should work out of the box, but there are some differences to be aware of. First, a few minor Dockerfile commands and flags have not been implemented yet. These include `ONBUILD`, `STOPSIGNAL`, and `VOLUME`. Please reach out to us if your use case requires any of these. Next, there are some command-specific things that may be useful when porting a Dockerfile to Modal. #### `ENTRYPOINT` While the [`ENTRYPOINT`](https://docs.docker.com/engine/reference/builder/#entrypoint) command is supported, there is an additional constraint to the entrypoint script provided: when used with a Modal Function, it must also `exec` the arguments passed to it at some point. This is so the Modal Function runtime's Python entrypoint can run after your own. Most entrypoint scripts in Docker containers are wrappers over other scripts, so this is likely already the case. If you wish to write your own entrypoint script, you can use the following as a template: ```bash #!/usr/bin/env bash # Your custom startup commands here. exec "$@" # Runs the command passed to the entrypoint script. ``` If the above file is saved as `/usr/bin/my_entrypoint.sh` in your container, then you can register it as an entrypoint with `ENTRYPOINT ["/usr/bin/my_entrypoint.sh"]` in your Dockerfile, or with [`entrypoint`](https://modal.com/docs/reference/modal.Image#entrypoint) as an Image build step. ```python import modal image = ( modal.Image.debian_slim() .pip_install("foo") .entrypoint(["/usr/bin/my_entrypoint.sh"]) ) ``` #### `ENV` We currently don't support default values in [interpolations](https://docs.docker.com/compose/compose-file/12-interpolation/), such as `${VAR:-default}` ## Image caching and rebuilds Modal uses the definition of an Image to determine whether it needs to be rebuilt. If the definition hasn't changed since the last time you ran or deployed your App, the previous version will be pulled from the cache. Images are cached per layer (i.e., per `Image` method call), and breaking the cache on a single layer will cause cascading rebuilds for all subsequent layers. You can shorten iteration cycles by defining frequently-changing layers last so that the cached version of all other layers can be used. In some cases, you may want to force an Image to rebuild, even if the definition hasn't changed. You can do this by adding the `force_build=True` argument to any of the Image building methods. ```python import modal image = ( modal.Image.debian_slim() .apt_install("git") .pip_install("slack-sdk", force_build=True) .run_commands("echo hi") ) ``` As in other cases where a layer's definition changes, both the `pip_install` and `run_commands` layers will rebuild, but the `apt_install` will not. Remember to remove `force_build=True` after you've rebuilt the Image, or it will rebuild every time you run your code. Alternatively, you can set the `MODAL_FORCE_BUILD` environment variable (e.g. `MODAL_FORCE_BUILD=1 modal run ...`) to rebuild all images attached to your App. But note that when you rebuild a base layer, the cache will be invalidated for _all_ Images that depend on it, and they will rebuild the next time you run or deploy any App that uses that base. If you're debugging an issue with your Image, a better option might be using `MODAL_IGNORE_CACHE=1`. This will rebuild the Image from the top without breaking the Image cache or affecting subsequent builds. ## Image builder updates Because changes to base images will cause cascading rebuilds, Modal is conservative about updating the base definitions that we provide. But many things are baked into these definitions, like the specific versions of the Image OS, the included Python, and the Modal client dependencies. We provide a separate mechanism for keeping base images up-to-date without causing unpredictable rebuilds: the "Image Builder Version". This is a workspace level-configuration that will be used for every Image built in your workspace. We release a new Image Builder Version every few months but allow you to update your workspace's configuration when convenient. After updating, your next deployment will take longer, because your Images will rebuild. You may also encounter problems, especially if your Image definition does not pin the version of the third-party libraries that it installs (as your new Image will get the latest version of these libraries, which may contain breaking changes). You can set the Image Builder Version for your workspace by going to your [workspace settings](https://modal.com/settings/image-config). This page also documents the important updates in each version. #### Private registries # Private registries Modal provides the [`Image.from_registry`](https://modal.com/docs/guide/images#use-an-existing-container-image-with-from_registry) function, which can pull public images available from registries such as Docker Hub and GitHub Container Registry, as well as private images from registries such as [AWS Elastic Container Registry (ECR)](https://aws.amazon.com/ecr/), [GCP Artifact Registry](https://cloud.google.com/artifact-registry), and Docker Hub. ## Docker Hub (Private) To pull container images from private Docker Hub repositories, [create an access token](https://docs.docker.com/security/for-developers/access-tokens/) with "Read-Only" permissions and use this token value and your Docker Hub username to create a Modal [Secret](https://modal.com/docs/guide/secrets). ``` REGISTRY_USERNAME=my-dockerhub-username REGISTRY_PASSWORD=dckr_pat_TS012345aaa67890bbbb1234ccc ``` Use this Secret with the [`modal.Image.from_registry`](https://modal.com/docs/reference/modal.Image#from_registry) method. ## Elastic Container Registry (ECR) You can pull images from your AWS ECR account by specifying the full image URI as follows: ```python import modal aws_secret = modal.Secret.from_name("my-aws-secret") image = ( modal.Image.from_aws_ecr( "000000000000.dkr.ecr.us-east-1.amazonaws.com/my-private-registry:latest", secret=aws_secret, ) .pip_install("torch", "huggingface") ) app = modal.App(image=image) ``` As shown above, you also need to use a [Modal Secret](https://modal.com/docs/guide/secrets) containing the environment variables `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and `AWS_REGION`. The AWS IAM user account associated with those keys must have access to the private registry you want to access. The user needs to have the following read-only policies: ```json { "Version": "2012-10-17", "Statement": [ { "Action": ["ecr:GetAuthorizationToken"], "Effect": "Allow", "Resource": "*" }, { "Effect": "Allow", "Action": [ "ecr:BatchCheckLayerAvailability", "ecr:GetDownloadUrlForLayer", "ecr:GetRepositoryPolicy", "ecr:DescribeRepositories", "ecr:ListImages", "ecr:DescribeImages", "ecr:BatchGetImage", "ecr:GetLifecyclePolicy", "ecr:GetLifecyclePolicyPreview", "ecr:ListTagsForResource", "ecr:DescribeImageScanFindings" ], "Resource": "" } ] } ``` You can use the IAM configuration above as a template for creating an IAM user. You can then [generate an access key](https://aws.amazon.com/premiumsupport/knowledge-center/create-access-key/) and create a Modal Secret using the AWS integration option. Modal will use your access keys to generate an ephemeral ECR token. That token is only used to pull image layers at the time a new image is built. We don't store this token but will cache the image once it has been pulled. Images on ECR must be private and follow [image configuration requirements](https://modal.com/docs/reference/modal.Image#from_aws_ecr). ## Google Artifact Registry and Google Container Registry For further detail on how to pull images from Google's image registries, see [`modal.Image.from_gcp_artifact_registry`](https://modal.com/docs/reference/modal.Image#from_gcp_artifact_registry). #### Fast pull from registry # Fast pull from registry The performance of pulling public and private images from registries into Modal can be significantly improved by adopting the [eStargz](https://github.com/containerd/stargz-snapshotter/blob/main/docs/estargz.md) compression format. By applying eStargz compression during your image build and push, Modal will be much more efficient at pulling down your image from the registry. ## How to use estargz If you have [Buildkit](https://docs.docker.com/build/buildkit/) version greater than `0.10.0`, adopting `estargz` is as simple as adding some flags to your `docker buildx build` command: - `type=registry` flag will instruct BuildKit to push the image after building. - If you do not push the image from immediately after build and instead attempt to push it later with docker push, the image will be converted to a standard gzip image. - `compression=estargz` specifies that we are using the [eStargz](https://github.com/containerd/stargz-snapshotter/blob/main/docs/estargz.md) compression format. - `oci-mediatypes=true` specifies that we are using the OCI media types, which is required for eStargz. - `force-compression=true` will recompress the entire image and convert the base image to eStargz if it is not already. ```bash docker buildx build --tag "//:" \ --output type=registry,compression=estargz,force-compression=true,oci-mediatypes=true \ . ``` Then reference the container image as normal in your Modal code. ```python notest app = modal.App( "example-estargz-pull", image=modal.Image.from_registry( "public.ecr.aws/modal/estargz-example-images:text-generation-v1-esgz" ) ) ``` At build time you should see the eStargz-enabled puller activate: ``` Building image im-TinABCTIf12345ydEwTXYZ => Step 0: FROM public.ecr.aws/modal/estargz-example-images:text-generation-v1-esgz Using estargz to speed up image pull (index loaded in 1.86s)... Progress: 10% complete... (1.11s elapsed) Progress: 20% complete... (3.10s elapsed) Progress: 30% complete... (4.18s elapsed) Progress: 40% complete... (4.76s elapsed) Progress: 50% complete... (5.51s elapsed) Progress: 62% complete... (6.17s elapsed) Progress: 74% complete... (6.99s elapsed) Progress: 81% complete... (7.23s elapsed) Progress: 99% complete... (8.90s elapsed) Progress: 100% complete... (8.90s elapsed) Copying image... Copied image in 5.81s ``` ## Supported registries Currently, Modal supports fast estargz pulling images with the following registries: - AWS Elastic Container Registry (ECR) - Docker Hub (docker.io) - Google Artifact Registry (gcr.io, pkg.dev) We are working on adding support for GitHub Container Registry (ghcr.io). ### GPUs and other resources #### GPU acceleration # GPU acceleration Modal makes it easy to run any code on GPUs. ## Quickstart Here's a simple example of a function running on an A100 in Modal: ```python import modal app = modal.App() image = modal.Image.debian_slim().pip_install("torch") @app.function(gpu="A100", image=image) def run(): import torch print(torch.cuda.is_available()) ``` This installs PyTorch on top of a base image, and is able to use GPUs with PyTorch. ## Specifying GPU type You can pick a specific GPU type for your function via the `gpu` argument. Modal supports the following values for this parameter: - `T4` - `L4` - `A10G` - `A100-40GB` - `A100-80GB` - `L40S` - `H100` - `H200` - `B200` For instance, to use an H100, you can use `@app.function(gpu="H100")`. Refer to our [pricing page](https://modal.com/pricing) for the latest pricing on each GPU type. ## Specifying GPU count You can specify more than 1 GPUs per container by appending `:n` to the GPU argument. For instance, to run a function with 8\*H100: ```python @app.function(gpu="H100:8") def run_llama_405b_fp8(): ... ``` Currently B200, H200, H100, A100, L4, T4 and L40S instances support up to 8 GPUs (up to 640 GB GPU RAM), and A10G instances support up to 4 GPUs (up to 96 GB GPU RAM). Note that requesting more than 2 GPUs per container will usually result in larger wait times. These GPUs are always attached to the same physical machine. ## Picking a GPU For running, rather than training, neural networks, we recommend starting off with the [L40S](https://resources.nvidia.com/en-us-l40s/l40s-datasheet-28413), which offers an excellent trade-off of cost and performance and 48 GB of GPU RAM for storing model weights. For more on how to pick a GPU for use with neural networks like LLaMA or Stable Diffusion, and for tips on how to make that GPU go brrr, check out [Tim Dettemers' blog post](https://timdettmers.com/2023/01/30/which-gpu-for-deep-learning/) or the [Full Stack Deep Learning page on Cloud GPUs](https://fullstackdeeplearning.com/cloud-gpus/). ## GPU fallbacks Modal allows specifying a list of possible GPU types, suitable for functions that are compatible with multiple options. Modal respects the ordering of this list and will try to allocate the most preferred GPU type before falling back to less preferred ones. ```python @app.function(gpu=["H100", "A100-40GB:2"]) def run_on_80gb(): ... ``` See [this example](https://modal.com/docs/examples/gpu_fallbacks) for more detail. ## H100 GPUs Modal's fastest GPUs are the [H100s](https://www.nvidia.com/en-us/data-center/h100/), NVIDIA's flagship data center chip for the Hopper/Lovelace [architecture](https://modal.com/gpu-glossary/device-hardware/streaming-multiprocessor-architecture). To request an H100, set the `gpu` argument to `"H100"` ```python @app.function(gpu="H100") def run_text_to_video(): ... ``` Check out [this example](https://modal.com/docs/examples/flux) to see how you can generate images from the Flux.schnell model in under a second using an H100. Before you jump for the most powerful (and so most expensive) GPU, make sure you understand where the bottlenecks are in your computations. For example, running language models with small batch sizes (e.g. one prompt at a time) results in a [bottleneck on memory, not arithmetic](https://kipp.ly/transformer-inference-arithmetic/). Since arithmetic throughput has risen faster than memory throughput in recent hardware generations, speedups for memory-bound GPU jobs are not as extreme and may not be worth the extra cost. **H200 GPUs** Modal may automatically upgrade an H100 request to an [H200](https://www.nvidia.com/en-us/data-center/h200/), NVIDIA's evolution of the H100 chip for the Hopper/Lovelace [architecture](https://modal.com/gpu-glossary/device-hardware/streaming-multiprocessor-architecture). This automatic upgrade _does not_ change the cost of the GPU. H200s are software compatible with H100s, so your code always works for both, but an upgrade to an H200 brings higher memory bandwidth! NVIDIA H200’s HBM3e memory bandwidth of 4.8TB/s is 1.4x faster than NVIDIA H100 with HBM3. In cases where an automatic upgrade to H200 would not be desired (e.g., benchmarking) you can pass `gpu=H100!` to avoid it. ## A100 GPUs [A100s](https://www.nvidia.com/en-us/data-center/a100/) are the previous generation of top-of-the-line data center chip from NVIDIA, based on the Ampere [architecture](https://modal.com/gpu-glossary/device-hardware/streaming-multiprocessor-architecture). Modal offers two versions of the A100: one with 40 GB of RAM and another with 80 GB of RAM. To request an A100 with 40 GB of [GPU memory](https://modal.com/gpu-glossary/device-hardware/gpu-ram), use `gpu="A100"`: ```python @app.function(gpu="A100") def llama_7b(): ... ``` To request an 80 GB A100, use the string `A100-80GB`: ```python @app.function(gpu="A100-80GB") def llama_70b_fp8(): ... ``` ## Multi GPU training Modal currently supports multi-GPU training on a single machine, with multi-node training in closed beta ([contact us](https://modal.com/slack) for access). Depending on which framework you are using, you may need to use different techniques to train on multiple GPUs. If the framework re-executes the entrypoint of the Python process (like [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/index.html)) you need to either set the strategy to `ddp_spawn` or `ddp_notebook` if you wish to invoke the training directly. Another option is to run the training script as a subprocess instead. ```python @app.function(gpu="A100:2") def run(): import subprocess import sys subprocess.run( ["python", "train.py"], stdout=sys.stdout, stderr=sys.stderr, check=True, ) ``` ## Examples and more resources. For more information about GPUs in general, check out our [GPU Glossary](https://modal.com/gpu-glossary/readme). Or take a look some examples of Modal apps using GPUs: - [Fine-tune a character LoRA for your pet](https://modal.com/docs/examples/dreambooth_app) - [Fast LLM inference with vLLM](https://modal.com/docs/examples/vllm_inference) - [Stable Diffusion with a CLI, API, and web UI](https://modal.com/docs/examples/stable_diffusion_cli) - [Rendering Blender videos](https://modal.com/docs/examples/blender_video) #### Using CUDA on Modal # Using CUDA on Modal Modal makes it easy to accelerate your workloads with datacenter-grade NVIDIA GPUs. To take advantage of the hardware, you need to use matching software: the CUDA stack. This guide explains the components of that stack and how to install them on Modal. For more on which GPUs are available on Modal and how to choose a GPU for your use case, see [this guide](https://modal.com/docs/guide/gpu). For a deep dive on both the [GPU hardware](https://modal.com/gpu-glossary/device-hardware) and [software](https://modal.com/gpu-glossary/device-software) and for even more detail on [the CUDA stack](https://modal.com/gpu-glossary/host-software/), see our [GPU Glossary](https://modal.com/gpu-glossary/readme). Here's the tl;dr: - The [NVIDIA Accelerated Graphics Driver for Linux-x86_64](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#driver-installation), version 575.57.08, and [CUDA Driver API](https://docs.nvidia.com/cuda/archive/12.9.0/cuda-driver-api/index.html), version 12.8, are already installed. You can call `nvidia-smi` or run compiled CUDA programs from any Modal Function with access to a GPU. - That means you can install many popular libraries like `torch` that bundle their other CUDA dependencies [with a simple `pip_install`](#install-gpu-accelerated-torch-and-transformers-with-pip_install). - For bleeding-edge libraries like `flash-attn`, you may need to install CUDA dependencies manually. To make your life easier, [use an existing image](#for-more-complex-setups-use-an-officially-supported-cuda-image). ## What is CUDA? When someone refers to "installing CUDA" or "using CUDA", they are referring not to a library, but to a [stack](https://modal.com/gpu-glossary/host-software/cuda-software-platform) with multiple layers. Your application code (and its dependencies) can interact with the stack at different levels. ![The CUDA stack](../../assets/docs/cuda-stack-diagram.png) This leads to a lot of confusion. To help clear that up, the following sections explain each component in detail. ### Level 0: Kernel-mode driver components At the lowest level are the [_kernel-mode driver components_](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#nvidia-open-gpu-kernel-modules). The Linux kernel is essentially a single program operating the entire machine and all of its hardware. To add hardware to the machine, this program is extended by loading new modules into it. These components communicate directly with hardware -- in this case the GPU. Because they are kernel modules, these driver components are tightly integrated with the host operating system that runs your containerized Modal Functions and are not something you can inspect or change yourself. ### Level 1: User-mode driver API All action in Linux that doesn't occur in the kernel occurs in [user space](https://en.wikipedia.org/wiki/User_space). To talk to the kernel drivers from our user space programs, we need _user-mode driver components_. Most prominently, that includes: - the [CUDA Driver API](https://modal.com/gpu-glossary/host-software/cuda-driver-api), a [shared object](https://en.wikipedia.org/wiki/Shared_library) called `libcuda.so`. This object exposes functions like [`cuMemAlloc`](https://docs.nvidia.com/cuda/archive/12.8.0/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gb82d2a09844a58dd9e744dc31e8aa467), for allocating GPU memory. - the [NVIDIA management library](https://developer.nvidia.com/management-library-nvml), `libnvidia-ml.so`, and its command line interface [`nvidia-smi`](https://developer.nvidia.com/system-management-interface). You can use these tools to check the status of the system's GPU(s). These components are installed on all Modal machines with access to GPUs. Because they are user-level components, you can use them directly: ```python runner:ModalRunner import modal app = modal.App() @app.function(gpu="any") def check_nvidia_smi(): import subprocess output = subprocess.check_output(["nvidia-smi"], text=True) assert "Driver Version:" in output assert "CUDA Version:" in output print(output) return output ``` ### Level 2: CUDA Toolkit Wrapping the CUDA Driver API is the [CUDA Runtime API](https://modal.com/gpu-glossary/host-software/cuda-runtime-api), the `libcudart.so` shared library. This API includes functions like [`cudaLaunchKernel`](https://docs.nvidia.com/cuda/archive/12.8.0/cuda-runtime-api/group__CUDART__HIGHLEVEL.html#group__CUDART__HIGHLEVEL_1g7656391f2e52f569214adbfc19689eb3) and is more commonly used in CUDA programs (see [this HackerNews comment](https://news.ycombinator.com/item?id=20616385) for color commentary on why). This shared library is _not_ installed by default on Modal. The CUDA Runtime API is generally installed as part of the larger [NVIDIA CUDA Toolkit](https://docs.nvidia.com/cuda/index.html), which includes the [NVIDIA CUDA compiler driver](https://modal.com/gpu-glossary/host-software/nvcc) (`nvcc`) and its toolchain and a number of [useful goodies](https://modal.com/gpu-glossary/host-software/cuda-binary-utilities) for writing and debugging CUDA programs (`cuobjdump`, `cudnn`, profilers, etc.). Contemporary GPU-accelerated machine learning workloads like LLM inference frequently make use of many components of the CUDA Toolkit, such as the run-time compilation library [`nvrtc`](https://docs.nvidia.com/cuda/archive/12.8.0/nvrtc/index.html). So why aren't these components installed along with the drivers? A compiled CUDA program can run without the CUDA Runtime API installed on the system, by [statically linking](https://en.wikipedia.org/wiki/Static_library) the CUDA Runtime API into the program binary, though this is fairly uncommon for CUDA-accelerated Python programs. Additionally, older versions of these components are needed for some applications and some application deployments even use several versions at once. Both patterns are compatible with the host machine driver provided on Modal. ## Install GPU-accelerated `torch` and `transformers` with `pip_install` The components of the CUDA Toolkit can be installed via `pip`, via PyPI packages like [`nvidia-cuda-runtime-cu12`](https://pypi.org/project/nvidia-cuda-runtime-cu12/) and [`nvidia-cuda-nvrtc-cu12`](https://pypi.org/project/nvidia-cuda-nvrtc-cu12/). These components are listed as dependencies of some popular GPU-accelerated Python libraries, like `torch`. Because Modal already includes the lower parts of the CUDA stack, you can install these libraries with [the `pip_install` method of `modal.Image`](https://modal.com/docs/guide/images#add-python-packages-with-pip_install), just like any other Python library: ```python image = modal.Image.debian_slim().pip_install("torch") @app.function(gpu="any", image=image) def run_torch(): import torch has_cuda = torch.cuda.is_available() print(f"It is {has_cuda} that torch can access CUDA") return has_cuda ``` Many libraries for running open-weights models, like `transformers` and `vllm`, use `torch` under the hood and so can be installed in the same way: ```python image = modal.Image.debian_slim().pip_install("transformers[torch]") image = image.apt_install("ffmpeg") # for audio processing @app.function(gpu="any", image=image) def run_transformers(): from transformers import pipeline transcriber = pipeline(model="openai/whisper-tiny.en", device="cuda") result = transcriber("https://modal-cdn.com/mlk.flac") print(result["text"]) # I have a dream that one day this nation will rise up live out the true meaning of its creed ``` ## For more complex setups, use an officially-supported CUDA image The disadvantage of installing the CUDA stack via `pip` is that many other libraries that depend on its components being installed as normal system packages cannot find them. For these cases, we recommend you use an image that already has the full CUDA stack installed as system packages and all environment variables set correctly, like the [`nvidia/cuda:*-devel-*` images on Docker Hub](https://hub.docker.com/r/nvidia/cuda). [TensorRT-LLM](https://nvidia.github.io/TensorRT-LLM/overview.html) is an inference engine that accelerates and optimizes performance for the large language models. It requires the full CUDA toolkit for installation. ```python cuda_version = "12.8.1" # should be no greater than host CUDA version flavor = "devel" # includes full CUDA toolkit operating_sys = "ubuntu24.04" tag = f"{cuda_version}-{flavor}-{operating_sys}" HF_CACHE_PATH = "/cache" image = ( modal.Image.from_registry(f"nvidia/cuda:{tag}", add_python="3.12") .entrypoint([]) # remove verbose logging by base image on entry .apt_install("libopenmpi-dev") # required for tensorrt .pip_install("tensorrt-llm==0.19.0", "pynvml", extra_index_url="https://pypi.nvidia.com") .pip_install("hf-transfer", "huggingface_hub[hf_xet]") .env({"HF_HUB_CACHE": HF_CACHE_PATH, "HF_HUB_ENABLE_HF_TRANSFER": "1", "PMIX_MCA_gds": "hash"}) ) app = modal.App("tensorrt-llm", image=image) hf_cache_volume = modal.Volume.from_name("hf_cache_tensorrt", create_if_missing=True) @app.function(gpu="A10G", volumes={HF_CACHE_PATH: hf_cache_volume}) def run_tiny_model(): from tensorrt_llm import LLM, SamplingParams sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0") output = llm.generate("The capital of France is", sampling_params) print(f"Generated text: {output.outputs[0].text}") return output.outputs[0].text ``` Make sure to choose a version of CUDA that is no greater than the version provided by the host machine. Older minor (`12.*`) versions are guaranteed to be compatible with the host machine's driver, but older major (`11.*`, `10.*`, etc.) versions may not be. ## What next? For more on accessing and choosing GPUs on Modal, check out [this guide](https://modal.com/docs/guide/gpu). To dive deep on GPU internals, check out our [GPU Glossary](https://modal.com/gpu-glossary/readme). To see these installation patterns in action, check out these examples: - [Fast LLM inference with vLLM](https://modal.com/docs/examples/vllm_inference) - [Finetune a character LoRA for your pet](https://modal.com/docs/examples/diffusers_lora_finetune) - [Optimized Flux inference](https://modal.com/docs/examples/flux) #### Reserving CPU and memory # Reserving CPU and memory Each Modal container has a default reservation of 0.125 CPU cores and 128 MiB of memory. Containers can exceed this minimum if the worker has available CPU or memory. You can also guarantee access to more resources by requesting a higher reservation. ## CPU cores If you have code that must run on a larger number of cores, you can request that using the `cpu` argument. This allows you to specify a floating-point number of CPU cores: ```python import modal app = modal.App() @app.function(cpu=8.0) def my_function(): # code here will have access to at least 8.0 cores ... ``` ## Memory If you have code that needs more guaranteed memory, you can request it using the `memory` argument. This expects an integer number of megabytes: ```python import modal app = modal.App() @app.function(memory=32768) def my_function(): # code here will have access to at least 32 GiB of RAM ... ``` ## How much can I request? For both CPU and memory, a maximum is enforced at function creation time to ensure your application can be scheduled for execution. Requests exceeding the maximum will be rejected with an [`InvalidError`](https://modal.com/docs/reference/modal.exception#modalexceptioninvaliderror). As the platform grows, we plan to support larger CPU and memory reservations. ## Billing For CPU and memory, you'll be charged based on whichever is higher: your reservation or actual usage. Disk requests are billed by increasing the memory request at a 20:1 ratio. For example, requesting 500 GiB of disk will increase the memory request to 25 GiB, if it is not already set higher. ## Resource limits ### CPU limits Modal containers have a default soft CPU limit that is set at 16 physical cores above the CPU request. Given that the default CPU request is 0.125 cores the default soft CPU limit is 16.125 cores. Above this limit the host will begin to throttle the CPU usage of the container. You can alternatively set the CPU limit explicitly. ```python cpu_request = 1.0 cpu_limit = 4.0 @app.function(cpu=(cpu_request, cpu_limit)) def f(): ... ``` ### Memory limits Modal containers can have a hard memory limit which will 'Out of Memory' (OOM) kill containers which attempt to exceed the limit. This functionality is useful when a container has a serious memory leak. You can set the limit and have the container killed to avoid paying for the leaked GBs of memory. ```python mem_request = 1024 mem_limit = 2048 @app.function( memory=(mem_request, mem_limit), ) def f(): ... ``` Specify this limit using the [`memory` parameter](https://modal.com/docs/reference/modal.App#function) on Modal Functions. ### Disk limits Running Modal containers have access to many GBs of SSD disk, but the amount of writes is limited by: 1. The size of the underlying worker's SSD disk capacity 2. A per-container disk quota that is set in the 100s of GBs. Hitting either limit will cause the container's disk writes to be rejected, which typically manifests as an `OSError`. Increased disk sizes can be requested with the [`ephemeral_disk` parameter](https://modal.com/docs/reference/modal.App#function). The maximum disk size is 3.0 TiB (3,145,728 MiB). Larger disks are intended to be used for [dataset processing](https://modal.com/docs/guide/dataset-ingestion). ### Scaling out #### Scaling out # Scaling out Modal makes it trivially easy to scale compute across thousands of containers. You won't have to worry about your App crashing if it goes viral or need to wait a long time for your batch jobs to complete. For the the most part, scaling out will happen automatically, and you won't need to think about it. But it can be helpful to understand how Modal's autoscaler works and how you can control its behavior when you need finer control. ## How does autoscaling work on Modal? Every Modal Function corresponds to an autoscaling pool of containers. The size of the pool is managed by Modal's autoscaler. The autoscaler will spin up new containers when there is no capacity available for new inputs, and it will spin down containers when resources are idling. By default, Modal Functions will scale to zero when there are no inputs to process. Autoscaling decisions are made quickly and frequently so that your batch jobs can ramp up fast and your deployed Apps can respond to any sudden changes in traffic. ## Configuring autoscaling behavior Modal exposes a few settings that allow you to configure the autoscaler's behavior. These settings can be passed to the `@app.function` or `@app.cls` decorators: - `max_containers`: The upper limit on containers for the specific Function. - `min_containers`: The minimum number of containers that should be kept warm, even when the Function is inactive. - `buffer_containers`: The size of the buffer to maintain while the Function is active, so that additional inputs will not need to queue for a new container. - `scaledown_window`: The maximum duration (in seconds) that individual containers can remain idle when scaling down. In general, these settings allow you to trade off cost and latency. Maintaining a larger warm pool or idle buffer will increase costs but reduce the chance that inputs will need to wait for a new container to start. Similarly, a longer scaledown window will let containers idle for longer, which might help avoid unnecessary churn for Apps that receive regular but infrequent inputs. Note that containers may not wait for the entire scaledown window before shutting down if the App is substantially overprovisioned. ## Dynamic autoscaler updates It's also possible to update the autoscaler settings dynamically (i.e., without redeploying the App) using the [`Function.update_autoscaler()`](https://modal.com/docs/reference/modal.Function#update_autoscaler) method: ```python notest f = modal.Function.from_name("my-app", "f") f.update_autoscaler(max_containers=100) ``` The autoscaler settings will revert to the configuration in the function decorator the next time you deploy the App. Or they can be overridden by further dynamic updates: ```python notest f.update_autoscaler(min_containers=2, max_containers=10) f.update_autoscaler(min_containers=4) # max_containers=10 will still be in effect ``` A common pattern is to run this method in a [scheduled function](https://modal.com/docs/guide/cron) that adjusts the size of the warm pool (or container buffer) based on the time of day: ```python @app.function() def inference_server(): ... @app.function(schedule=modal.Cron("0 6 * * *", timezone="America/New_York")) def increase_warm_pool(): inference_server.update_autoscaler(min_containers=4) @app.function(schedule=modal.Cron("0 22 * * *", timezone="America/New_York")) def decrease_warm_pool(): inference_server.update_autoscaler(min_containers=0) ``` When you have a [`modal.Cls`](https://modal.com/docs/reference/modal.Cls), `update_autoscaler` is a method on an _instance_ and will control the autoscaling behavior of containers serving the Function with that specific set of parameters: ```python notest MyClass = modal.Cls.from_name("my-app", "MyClass") obj = MyClass(model_version="3.5") obj.update_autoscaler(buffer_containers=2) # type: ignore ``` Note that it's necessary to disable type checking on this line, because the object will appear as an instance of the class that you defined rather than the Modal wrapper type. ## Parallel execution of inputs If your code is running the same function repeatedly with different independent inputs (e.g., a grid search), the easiest way to increase performance is to run those function calls in parallel using Modal's [`Function.map()`](https://modal.com/docs/reference/modal.Function#map) method. Here is an example if we had a function `evaluate_model` that takes a single argument: ```python import modal app = modal.App() @app.function() def evaluate_model(x): ... @app.local_entrypoint() def main(): inputs = list(range(100)) for result in evaluate_model.map(inputs): # runs many inputs in parallel ... ``` In this example, `evaluate_model` will be called with each of the 100 inputs (the numbers 0 - 99 in this case) roughly in parallel and the results are returned as an iterable with the results ordered in the same way as the inputs. ### Exceptions By default, if any of the function calls raises an exception, the exception will be propagated. To treat exceptions as successful results and aggregate them in the results list, pass in [`return_exceptions=True`](https://modal.com/docs/reference/modal.Function#map). ```python @app.function() def my_func(a): if a == 2: raise Exception("ohno") return a ** 2 @app.local_entrypoint() def main(): print(list(my_func.map(range(3), return_exceptions=True, wrap_returned_exceptions=False))) # [0, 1, Exception('ohno'))] ``` Note: prior to version 1.0.5, the returned exceptions inadvertently leaked an internal wrapper type (`modal.exceptions.UserCodeException`). To avoid breaking any user code that was checking exception types, we're taking a gradual approach to fixing this bug. Adding `wrap_returned_exceptions=True` will opt-in to the future default behavior and return the underlying exception type without a wrapper. ### Starmap If your function takes multiple variable arguments, you can either use [`Function.map()`](https://modal.com/docs/reference/modal.Function#map) with one input iterator per argument, or [`Function.starmap()`](https://modal.com/docs/reference/modal.Function#starmap) with a single input iterator containing sequences (like tuples) that can be spread over the arguments. This works similarly to Python's built in `map` and `itertools.starmap`. ```python @app.function() def my_func(a, b): return a + b @app.local_entrypoint() def main(): assert list(my_func.starmap([(1, 2), (3, 4)])) == [3, 7] ``` ### Gotchas Note that `.map()` is a method on the modal function object itself, so you don't explicitly _call_ the function. Incorrect usage: ```python notest results = evaluate_model(inputs).map() ``` Modal's map is also not the same as using Python's builtin `map()`. While the following will technically work, it will execute all inputs in sequence rather than in parallel. Incorrect usage: ```python notest results = map(evaluate_model, inputs) ``` ## Asynchronous usage All Modal APIs are available in both blocking and asynchronous variants. If you are comfortable with asynchronous programming, you can use it to create arbitrary parallel execution patterns, with the added benefit that any Modal functions will be executed remotely. See the [async guide](https://modal.com/docs/guide/async) or the examples for more information about asynchronous usage. ## GPU acceleration Sometimes you can speed up your applications by utilizing GPU acceleration. See the [gpu section](https://modal.com/docs/guide/gpu) for more information. ## Scaling Limits Modal enforces the following limits for every function: - 2,000 pending inputs (inputs that haven't been assigned to a container yet) - 25,000 total inputs (which include both running and pending inputs) For inputs created with `.spawn()` for async jobs, Modal allows up to 1 million pending inputs instead of 2,000. If you try to create more inputs and exceed these limits, you'll receive a `Resource Exhausted` error, and you should retry your request later. If you need higher limits, please reach out! Additionally, each `.map()` invocation can process at most 1000 inputs concurrently. #### Input concurrency # Input concurrency As traffic to your application increases, Modal will automatically scale up the number of containers running your Function:
By default, each container will be assigned one input at a time. Autoscaling across containers allows your Function to process inputs in parallel. This is ideal when the operations performed by your Function are CPU-bound. For some workloads, though, it is inefficient for containers to process inputs one-by-one. Modal supports these workloads with its _input concurrency_ feature, which allows individual containers to process multiple inputs at the same time:
When used effectively, input concurrency can reduce latency and lower costs. ## Use cases Input concurrency can be especially effective for workloads that are primarily I/O-bound, e.g.: - Querying a database - Making external API requests - Making remote calls to other Modal Functions For such workloads, individual containers may be able to concurrently process large numbers of inputs with minimal additional latency. This means that your Modal application will be more efficient overall, as it won't need to scale containers up and down as traffic ebbs and flows. Another use case is to leverage _continuous batching_ on GPU-accelerated containers. Frameworks such as [vLLM](https://modal.com/docs/examples/vllm_inference) can achieve the benefits of batching across multiple inputs even when those inputs do not arrive simultaneously (because new batches are formed for each forward pass of the model). Note that for CPU-bound workloads, input concurrency will likely not be as effective (or will even be counterproductive), and you may want to use Modal's [_dynamic batching_ feature](https://modal.com/docs/guide/dynamic-batching) instead. ## Enabling input concurrency To enable input concurrency, add the `@modal.concurrent` decorator: ```python @app.function() @modal.concurrent(max_inputs=100) def my_function(input: str): ... ``` When using the class pattern, the decorator should be applied at the level of the _class_, not on individual methods: ```python @app.cls() @modal.concurrent(max_inputs=100) class MyCls: @modal.method() def my_method(self, input: str): ... ``` Because all methods on a class will be served by the same containers, a class with input concurrency enabled will concurrently run distinct methods in addition to multiple inputs for the same method. **Note:** The `@modal.concurrent` decorator was added in v0.73.148 of the Modal Python SDK. Input concurrency could previously be enabled by setting the `allow_concurrent_inputs` parameter on the `@app.function` decorator. ## Setting a concurrency target When using the `@modal.concurrent` decorator, you must always configure the maximum number of inputs that each container will concurrently process. If demand exceeds this limit, Modal will automatically scale up more containers. Additional inputs may need to queue up while these additional containers cold start. To help avoid degraded latency during scaleup, the `@modal.concurrent` decorator has a separate `target_inputs` parameter. When set, Modal's autoscaler will aim for this target as it provisions resources. If demand increases faster than new containers can spin up, the active containers will be allowed to burst above the target up to the `max_inputs` limit: ```python @app.function() @modal.concurrent(max_inputs=120, target_inputs=100) # Allow a 20% burst def my_function(input: str): ... ``` It may take some experimentation to find the right settings for these parameters in your particular application. Our suggestion is to set the `target_inputs` based on your desired latency and the `max_inputs` based on resource constraints (i.e., to avoid GPU OOM). You may also consider the relative latency cost of scaling up a new container versus overloading the existing containers. ## Concurrency mechanisms Modal uses different concurrency mechanisms to execute your Function depending on whether it is defined as synchronous or asynchronous. Each mechanism imposes certain requirements on the Function implementation. Input concurrency is an advanced feature, and it's important to make sure that your implementation complies with these requirements to avoid unexpected behavior. For synchronous Functions, Modal will execute concurrent inputs on separate threads. _This means that the Function implementation must be thread-safe._ ```python # Each container can execute up to 10 inputs in separate threads @app.function() @modal.concurrent(max_inputs=10) def sleep_sync(): # Function must be thread-safe time.sleep(1) ``` For asynchronous Functions, Modal will execute concurrent inputs using separate `asyncio` tasks on a single thread. This does not require thread safety, but it does mean that the Function needs to participate in collaborative multitasking (i.e., it should not block the event loop). ```python # Each container can execute up to 10 inputs with separate async tasks @app.function() @modal.concurrent(max_inputs=10) async def sleep_async(): # Function must not block the event loop await asyncio.sleep(1) ``` ## Gotchas Input concurrency is a powerful feature, but there are a few caveats that can be useful to be aware of before adopting it. ### Input cancellations Synchronous and asynchronous Functions handle input cancellations differently. Modal will raise a `modal.exception.InputCancellation` exception in synchronous Functions and an `asyncio.CancelledError` in asynchronous Functions. When using input concurrency with a synchronous Function, a single input cancellation will terminate the entire container. If your workflow depends on graceful input cancellations, we recommend using an asynchronous implementation. ### Concurrent logging The separate threads or tasks that are executing the concurrent inputs will write any logs to the same stream. This makes it difficult to associate logs with a specific input, and filtering for a specific function call in Modal's web dashboard will show logs for all inputs running at the same time. To work around this, we recommend including a unique identifier in the messages you log (either your own identifier or the `modal.current_input_id()`) so that you can use the search functionality to surface logs for a specific input: ```python @app.function() @modal.concurrent(max_inputs=10) async def better_concurrent_logging(x: int): logger.info(f"{modal.current_input_id()}: Starting work with {x}") ``` #### Batch processing # Batch Processing Modal is optimized for large-scale batch processing, allowing functions to scale to thousands of parallel containers with zero additional configuration. Function calls can be submitted asynchronously for background execution, eliminating the need to wait for jobs to finish or tune resource allocation. This guide covers Modal's batch processing capabilities, from basic invocation to integration with existing pipelines. ## Background Execution with `.spawn_map` The fastest way to submit multiple jobs for asynchronous processing is by invoking a function with `.spawn_map`. When combined with the [`--detach`](https://modal.com/docs/reference/cli/run) flag, your App continues running until all jobs are completed. Here's an example of submitting 100,000 videos for parallel embedding. You can disconnect after submission, and the processing will continue to completion in the background: ```python # Kick off asynchronous jobs with `modal run --detach batch_processing.py` import modal app = modal.App("batch-processing-example") volume = modal.Volume.from_name("video-embeddings", create_if_missing=True) @app.function(volumes={"/data": volume}) def embed_video(video_id: int): # Business logic: # - Load the video from the volume # - Embed the video # - Save the embedding to the volume ... @app.local_entrypoint() def main(): embed_video.spawn_map(range(100_000)) ``` This pattern works best for jobs that store results externally—for example, in a [Modal Volume](https://modal.com/docs/guide/volumes), [Cloud Bucket Mount](https://modal.com/docs/guide/cloud-bucket-mounts), or your own database\*. _\* For database connections, consider using [Modal Proxy](https://modal.com/docs/guide/proxy-ips) to maintain a static IP across thousands of containers._ ## Parallel Processing with `.map` Using `.map` allows you to offload expensive computations to powerful machines while gathering results. This is particularly useful for pipeline steps with bursty resource demands. Modal handles all infrastructure provisioning and de-provisioning automatically. Here's how to implement parallel video similarity queries as a single Modal function call: ```python # Run jobs and collect results with `modal run gather.py` import modal app = modal.App("gather-results-example") @app.function(gpu="L40S") def compute_video_similarity(query: str, video_id: int) -> tuple[int, int]: # Embed video with GPU acceleration & compute similarity with query return video_id, score @app.local_entrypoint() def main(): import itertools queries = itertools.repeat("Modal for batch processing") video_ids = range(100_000) for video_id, score in compute_video_similarity.map(queries, video_ids): # Process results (e.g., extract top 5 most similar videos) pass ``` This example runs `compute_video_similarity` on an autoscaling pool of L40S GPUs, returning scores to a local process for further processing. ## Integration with Existing Systems The recommended way to use Modal Functions within your existing data pipeline is through [deployed function invocation](https://modal.com/docs/guide/trigger-deployed-functions). After deployment, you can call Modal functions from external systems: ```python def external_function(inputs): compute_similarity = modal.Function.from_name( "gather-results-example", "compute_video_similarity" ) for result in compute_similarity.map(inputs): # Process results pass ``` You can invoke Modal Functions from any Python context, gaining access to built-in observability, resource management, and GPU acceleration. #### Job queues # Job processing Modal can be used as a scalable job queue to handle asynchronous tasks submitted from a web app or any other Python application. This allows you to offload up to 1 million long-running or resource-intensive tasks to Modal, while your main application remains responsive. ## Creating jobs with .spawn() The basic pattern for using Modal as a job queue involves three key steps: 1. Defining and deploying the job processing function using `modal deploy`. 2. Submitting a job using [`modal.Function.spawn()`](https://modal.com/docs/reference/modal.Function#spawn) 3. Polling for the job's result using [`modal.FunctionCall.get()`](https://modal.com/docs/reference/modal.FunctionCall#get) Here's a simple example that you can run with `modal run my_job_queue.py`: ```python # my_job_queue.py import modal app = modal.App("my-job-queue") @app.function() def process_job(data): # Perform the job processing here return {"result": data} def submit_job(data): # Since the `process_job` function is deployed, need to first look it up process_job = modal.Function.from_name("my-job-queue", "process_job") call = process_job.spawn(data) return call.object_id def get_job_result(call_id): function_call = modal.FunctionCall.from_id(call_id) try: result = function_call.get(timeout=5) except modal.exception.OutputExpiredError: result = {"result": "expired"} except TimeoutError: result = {"result": "pending"} return result @app.local_entrypoint() def main(): data = "my-data" # Submit the job to Modal call_id = submit_job(data) print(get_job_result(call_id)) ``` In this example: - `process_job` is the Modal function that performs the actual job processing. To deploy the `process_job` function on Modal, run `modal deploy my_job_queue.py`. - `submit_job` submits a new job by first looking up the deployed `process_job` function, then calling `.spawn()` with the job data. It returns the unique ID of the spawned function call. - `get_job_result` attempts to retrieve the result of a previously submitted job using [`FunctionCall.from_id()`](https://modal.com/docs/reference/modal.FunctionCall#from_id) and [`FunctionCall.get()`](https://modal.com/docs/reference/modal.FunctionCall#get). [`FunctionCall.get()`](https://modal.com/docs/reference/modal.FunctionCall#get) waits indefinitely by default. It takes an optional timeout argument that specifies the maximum number of seconds to wait, which can be set to 0 to poll for an output immediately. Here, if the job hasn't completed yet, we return a pending response. - The results of a `.spawn()` are accessible via `FunctionCall.get()` for up to 7 days after completion. After this period, we return an expired response. [Document OCR Web App](https://modal.com/docs/examples/doc_ocr_webapp) is an example that uses this pattern. ## Integration with web frameworks You can easily integrate the job queue pattern with web frameworks like FastAPI. Here's an example, assuming that you have already deployed `process_job` on Modal with `modal deploy` as above. This example won't work if you haven't deployed your app yet. ```python # my_job_queue_endpoint.py import fastapi import modal image = modal.Image.debian_slim().pip_install("fastapi[standard]") app = modal.App("fastapi-modal", image=image) web_app = fastapi.FastAPI() @app.function() @modal.asgi_app() def fastapi_app(): return web_app @web_app.post("/submit") async def submit_job_endpoint(data): process_job = modal.Function.from_name("my-job-queue", "process_job") call = process_job.spawn(data) return {"call_id": call.object_id} @web_app.get("/result/{call_id}") async def get_job_result_endpoint(call_id: str): function_call = modal.FunctionCall.from_id(call_id) try: result = function_call.get(timeout=0) except modal.exception.OutputExpiredError: return fastapi.responses.JSONResponse(content="", status_code=404) except TimeoutError: return fastapi.responses.JSONResponse(content="", status_code=202) return result ``` In this example: - The `/submit` endpoint accepts job data, submits a new job using `process_job.spawn()`, and returns the job's ID to the client. - The `/result/{call_id}` endpoint allows the client to poll for the job's result using the job ID. If the job hasn't completed yet, it returns a 202 status code to indicate that the job is still being processed. If the job has expired, it returns a 404 status code to indicate that the job is not found. You can try this app by serving it with `modal serve`: ```shell modal serve my_job_queue_endpoint.py ``` Then interact with its endpoints with `curl`: ```shell # Make a POST request to your app endpoint with. $ curl -X POST $YOUR_APP_ENDPOINT/submit?data=data {"call_id":"fc-XXX"} # Use the call_id value from above. $ curl -X GET $YOUR_APP_ENDPOINT/result/fc-XXX ``` ## Scaling and reliability Modal automatically scales the job queue based on the workload, spinning up new instances as needed to process jobs concurrently. It also provides built-in reliability features like automatic retries and timeout handling. You can customize the behavior of the job queue by configuring the `@app.function()` decorator with options like [`retries`](https://modal.com/docs/guide/retries#function-retries), [`timeout`](https://modal.com/docs/guide/timeouts#timeouts), and [`max_containers`](https://modal.com/docs/guide/scale#configuring-autoscaling-behavior). #### Dynamic batching (beta) # Dynamic batching (beta) Modal's `@batched` feature allows you to accumulate requests and process them in dynamically-sized batches, rather than one-by-one. Batching increases throughput at a potential cost to latency. Batched requests can share resources and reuse work, reducing the time and cost per request. Batching is particularly useful for GPU-accelerated machine learning workloads, as GPUs are designed to maximize throughput and are frequently bottlenecked on shareable resources, like weights stored in memory. Static batching can lead to unbounded latency, as the function waits for a fixed number of requests to arrive. Modal's dynamic batching waits for the lesser of a fixed time _or_ a fixed number of requests before executing, maximizing the throughput benefit of batching while minimizing the latency penalty. ## Enable dynamic batching with `@batched` To enable dynamic batching, apply the [`@modal.batched` decorator](https://modal.com/docs/reference/modal.batched) to the target Python function. Then, wrap it in `@app.function()` and run it on Modal, and the inputs will be accumulated and processed in batches. Here's what that looks like: ```python import modal app = modal.App() @app.function() @modal.batched(max_batch_size=2, wait_ms=1000) async def batch_add(xs: list[int], ys: list[int]) -> list[int]: return [x + y for x, y in zip(xs, ys)] ``` When you invoke a function decorated with `@batched`, you invoke it asynchronously on individual inputs. Outputs are returned where they were invoked. For instance, the code below invokes the decorated `batch_add` function above three times, but `batch_add` only executes twice: ```python continuation @app.local_entrypoint() async def main(): inputs = [(1, 300), (2, 200), (3, 100)] async for result in batch_add.starmap.aio(inputs): print(f"Sum: {result}") # Sum: 301 # Sum: 202 # Sum: 103 ``` The first time it is executed with `xs` batched to `[1, 2]` and `ys` batched to `[300, 200]`. After about a one second delay, it is executed with `xs` batched to `[3]` and `ys` batched to `[100]`. The result is an iterator that yields `301`, `202`, and `101`. ## Use `@batched` with functions that take and return lists For a Python function to be compatible with `@modal.batched`, it must adhere to the following rules: - ** The inputs to the function must be lists. ** In the example above, we pass `xs` and `ys`, which are both lists of `int`s. - ** The function must return a list**. In the example above, the function returns a list of sums. - ** The lengths of all the input lists and the output list must be the same. ** In the example above, if `L == len(xs) == len(ys)`, then `L == len(batch_add(xs, ys))`. ## Modal `Cls` methods are compatible with dynamic batching Methods on Modal [`Cls`](https://modal.com/docs/guide/lifecycle-functions)es also support dynamic batching. ```python import modal app = modal.App() @app.cls() class BatchedClass(): @modal.batched(max_batch_size=2, wait_ms=1000) async def batch_add(self, xs: list[int], ys: list[int]) -> list[int]: return [x + y for x, y in zip(xs, ys)] ``` One additional rule applies to classes with Batched Methods: - If a class has a Batched Method, it **cannot have other Batched Methods or [Methods](https://modal.com/docs/reference/modal.method#modalmethod)**. ## Configure the wait time and batch size of dynamic batches The `@batched` decorator takes in two required configuration parameters: - `max_batch_size` limits the number of inputs combined into a single batch. - `wait_ms` limits the amount of time the Function waits for more inputs after the first input is received. The first invocation of the Batched Function initiates a new batch, and subsequent calls add requests to this ongoing batch. If `max_batch_size` is reached, the batch immediately executes. If the `max_batch_size` is not met but `wait_ms` has passed since the first request was added to the batch, the unfilled batch is executed. ### Selecting a batch configuration To optimize the batching configurations for your application, consider the following heuristics: - Set `max_batch_size` to the largest value your function can handle, so you can amortize and parallelize as much work as possible. - Set `wait_ms` to the difference between your targeted latency and the execution time. Most applications have a targeted latency, and this allows the latency of any request to stay within that limit. ## Serve web endpoints with dynamic batching Here's a simple example of serving a Function that batches requests dynamically with a [`@modal.fastapi_endpoint`](https://modal.com/docs/guide/webhooks). Run [`modal serve`](https://modal.com/docs/reference/cli/serve), submit requests to the endpoint, and the Function will batch your requests on the fly. ```python import modal app = modal.App(image=modal.Image.debian_slim().pip_install("fastapi")) @app.function() @modal.batched(max_batch_size=2, wait_ms=1000) async def batch_add(xs: list[int], ys: list[int]) -> list[int]: return [x + y for x, y in zip(xs, ys)] @app.function() @modal.fastapi_endpoint(method="POST", docs=True) async def add(body: dict[str, int]) -> dict[str, int]: result = await batch_add.remote.aio(body["x"], body["y"]) return {"result": result} ``` Now, you can submit requests to the web endpoint and process them in batches. For instance, the three requests in the following example, which might be requests from concurrent clients in a real deployment, will be batched into two executions: ```python notest import asyncio import aiohttp async def send_post_request(session, url, data): async with session.post(url, json=data) as response: return await response.json() async def main(): # Enter the URL of your web endpoint here url = "https://workspace--app-name-endpoint-name.modal.run" async with aiohttp.ClientSession() as session: # Submit three requests asynchronously tasks = [ send_post_request(session, url, {"x": 1, "y": 300}), send_post_request(session, url, {"x": 2, "y": 200}), send_post_request(session, url, {"x": 3, "y": 100}), ] results = await asyncio.gather(*tasks) for result in results: print(f"Sum: {result['result']}") asyncio.run(main()) ``` #### Multi-node clusters (beta) # Multi-node clusters (beta) > 🚄 Multi-node clusters with RDMA are in **private beta.** Please contact us via the [Modal Slack](https://modal.com/slack) or support@modal.com to get access. Modal supports running a training job across several coordinated containers. Each container can saturate the available GPU devices on its host (a.k.a node) and communicate with peer containers which do the same. By scaling a training job from a single GPU to 16 GPUs you can achieve nearly 16x improvements in training time. ### Cluster compute capability Modal H100 clusters provide: - A 50 Gbps [IPv6 private network](https://modal.com/docs/guide/private-networking) for orchestration, dataset downloading. - A 3200 Gbps RDMA scale-out network ([RoCE](https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet)). - Up-to 64 H100 SXM devices. - At least 1TB of RAM and 4TB of local NVMe SSD per node. - Deep burn-in testing. - Interopability with all Modal platform functionality (Volumes, Dicts, Tunnels, etc.). The guide will walk you through how the Modal client library enables multi-node training and integrates with `torchrun`. ### @clustered Unlike standard Modal serverless containers, containers in a multi-node training job must be able to: 1. Perform fast, direct network communication between each other. 2. Be scheduled together, all or nothing, at the same time. The `@clustered` decorator enables this behavior. ```python notest import modal import modal.experimental @app.function( gpu="H100:8", timeout=60 * 60 * 24, retries=modal.Retries(initial_delay=0.0, max_retries=10), ) @modal.experimental.clustered(size=4) def train_model(): cluster_info = modal.experimental.get_cluster_info() container_rank = cluster_info.rank world_size = len(cluster_info.container_ips) main_addr = cluster_info.container_ips[0] is_main = "(main)" if container_rank == 0 else "" print(f"{container_rank=} {is_main} {world_size=} {main_addr=}") ... ``` Applying this decorator under `@app.function` modifies the Function so that remote calls to it are serviced by a multi-node container group. The above configuration creates a group of four containers each having 8 H100 GPU devices, for a total of 32 devices. ## Scheduling A `modal.experimental.clustered` Function runs on multiple nodes in our cloud, but executes like a normal function call. For example, all nodes are scheduled together ([gang scheduling](https://en.wikipedia.org/wiki/Gang_scheduling)) so that your code runs on all of the requested hardware or not at all. Traditionally this kind of cluster and scheduling management would be handled by SLURM, Kubernetes, or manually. But with Modal it’s all provided serverlessly with just an application of the decorator! ### Rank & input broadcast ![diagram](https://modal-cdn.com/cdnbot/multinodepmgnla70_4b57a155.webp) You may notice above that a single `.remote` Function call created three input executions but returned only one output. This is how input-output is structured for multi-node training jobs on Modal. The Function call’s arguments are replicated to each container, but only the rank zero container’s is returned to the caller. A container’s rank is a key concept in multi-node training jobs. Rank zero is the ‘leader’ rank and typically coordinates the job. Rank zero is also known as the “main” container. Rank zero’s output will always be the output of a multi-node training run. ## Networking Function containers cannot normally make direct network connections to other Function containers, but this is a requirement for multi-node training communication. So, along with gang scheduling, the `@clustered` decorator enables Modal’s workspace-private inter-container networking called [i6pn](https://www.notion.so/Multi-node-docs-1281e7f16949806f966adedfe8b2cb74?pvs=21). The [cluster networking guide](https://modal.com/docs/guide/private-networking) goes into more detail on i6pn, but the upshot is that each container in the cluster is made aware of the network address of all the other containers in the cluster, enabling them to communicate with each other quickly via [TCP](https://pytorch.org/docs/stable/elastic/rendezvous.html). ### RDMA (Infiniband) H100 clusters are equipped with Infiniband providing up-to 3,200Gbps scale-out bandwidth for inter-node communication. RDMA scale-out networking is enabled with the `rdma` parameter of `modal.experimental.clustered.` ```python notest @modal.experimental.clustered(size=2, rdma=True) def train(): ... ``` To run a simple Infiniband RDMA performance test see the [`modal-examples` repository example](https://github.com/modal-labs/multinode-training-guide/tree/main/benchmark). ## Cluster Info `modal.experimental.get_cluster_info()` exposes the following information about the cluster: - `rank: int` is the container's order within the cluster, starting from `0`, the leader. - `container_ips: list[str]` contains the ipv6 addresses of each container in the cluster, sorted by rank. ## Fault Tolerance For a clustered Function, failures in inputs and containers are handled differently. If an input fails on any container, this failure **is not propagated** to other containers in the cluster. Containers are responsible for detecting and responding to input failures on other containers. Only rank 0’s output matters: if an input fails on the leader container (rank 0), the input is marked as failed, even if the input succeeds on another container. Similarly, if an input succeeds on the leader container but fails on another container, the input will still be marked as successful. If a container in the cluster is preempted, Modal will terminate all remaining containers in the cluster, and retry the input. ### Input Synchronization _**Important:**_ synchronization is not relevant for single training runs, and applies mostly to inference use-cases. Modal does not synchronize input execution across containers. Containers are responsible for ensuring that they do not process inputs faster than other containers in their cluster. In particular, it is important that the leader container (rank 0) only starts processing the next input after all other containers have finished processing the current input. ## Examples To get hands-on with multi-node training you can jump into the [`multinode-training-guide` repository](https://github.com/modal-labs/multinode-training-guide) or [`modal-examples` repository](https://github.com/modal-labs/modal-examples/tree/main/12_datasets) and `modal run` something! - [Simple ‘hello world’ 4 X 1 H100 torch cluster example](https://github.com/modal-labs/modal-examples/blob/main/14_clusters/simple_torch_cluster.py) - [Infiniband RDMA performance test](https://github.com/modal-labs/multinode-training-guide/tree/main/benchmark) - [Use 2 x 8 H100s to train a ResNet50 model on the ImageNet dataset](https://github.com/modal-labs/multinode-training-guide/tree/main/resnet50) - [Speedrun GPT-2 training with modded-nanogpt](https://github.com/modal-labs/multinode-training-guide/tree/main/nanoGPT) ### Torchrun Example ```python import modal import modal.experimental image = ( modal.Image.debian_slim(python_version="3.12") .pip_install("torch~=2.5.1", "numpy~=2.2.1") .add_local_dir( "training", remote_path="/root/training" ) ) app = modal.App("example-simple-torch-cluster", image=image) n_nodes = 4 @app.function( gpu=f"H100:8", timeout=60 * 60 * 24, ) @modal.experimental.clustered(size=n_nodes) def launch_torchrun(): # import the 'torchrun' interface directly. from torch.distributed.run import parse_args, run cluster_info = modal.experimental.get_cluster_info() run( parse_args( [ f"--nnodes={n_nodes}", f"--node_rank={cluster_info.rank}", f"--master_addr={cluster_info.container_ips[0]}", f"--nproc-per-node=8", "--master_port=1234", "training/train.py", ] ) ) ``` ### Scheduling and cron jobs # Scheduling remote cron jobs A common requirement is to perform some task at a given time every day or week automatically. Modal facilitates this through function schedules. ## Basic scheduling Let's say we have a Python module `heavy.py` with a function, `perform_heavy_computation()`. ```python # heavy.py def perform_heavy_computation(): ... if __name__ == "__main__": perform_heavy_computation() ``` To schedule this function to run once per day, we create a Modal App and attach our function to it with the `@app.function` decorator and a schedule parameter: ```python # heavy.py import modal app = modal.App() @app.function(schedule=modal.Period(days=1)) def perform_heavy_computation(): ... ``` To activate the schedule, deploy your app, either through the CLI: ```shell modal deploy --name daily_heavy heavy.py ``` Or programmatically: ```python if __name__ == "__main__": app.deploy() ``` Now the function will run every day, at the time of the initial deployment, without any further interaction on your part. When you make changes to your function, just rerun the deploy command to overwrite the old deployment. Note that when you redeploy your function, `modal.Period` resets, and the schedule will run X hours after this most recent deployment. If you want to run your function at a regular schedule not disturbed by deploys, `modal.Cron` (see below) is a better option. ## Monitoring your scheduled runs To see past execution logs for the scheduled function, go to the [Apps](https://modal.com/apps) section on the Modal web site. Schedules currently cannot be paused. Instead the schedule should be removed and the app redeployed. Schedules can be started manually on the app's dashboard page, using the "run now" button. ## Schedule types There are two kinds of base schedule values - [`modal.Period`](https://modal.com/docs/reference/modal.Period) and [`modal.Cron`](https://modal.com/docs/reference/modal.Cron). [`modal.Period`](https://modal.com/docs/reference/modal.Period) lets you specify an interval between function calls, e.g. `Period(days=1)` or `Period(hours=5)`: ```python # runs once every 5 hours @app.function(schedule=modal.Period(hours=5)) def perform_heavy_computation(): ... ``` [`modal.Cron`](https://modal.com/docs/reference/modal.Cron) gives you finer control using [cron](https://en.wikipedia.org/wiki/Cron) syntax: ```python # runs at 8 am (UTC) every Monday @app.function(schedule=modal.Cron("0 8 * * 1")) def perform_heavy_computation(): ... # runs daily at 6 am (New York time) @app.function(schedule=modal.Cron("0 6 * * *", timezone="America/New_York")) def send_morning_report(): ... ``` For more details, see the API reference for [Period](https://modal.com/docs/reference/modal.Period), [Cron](https://modal.com/docs/reference/modal.Cron) and [Function](https://modal.com/docs/reference/modal.Function) ### Deployment #### Apps, Functions, and entrypoints # Apps, Functions, and entrypoints An `App` is the object that represents an application running on Modal. All functions and classes are associated with an [`App`](https://modal.com/docs/reference/modal.App#modalapp). When you [`run`](https://modal.com/docs/reference/cli/run) or [`deploy`](https://modal.com/docs/reference/cli/deploy) an `App`, it creates an ephemeral or a deployed `App`, respectively. You can view a list of all currently running Apps on the [`apps`](https://modal.com/apps) page. ## Ephemeral Apps An ephemeral App is created when you use the [`modal run`](https://modal.com/docs/reference/cli/run) CLI command, or the [`app.run`](https://modal.com/docs/reference/modal.App#run) method. This creates a temporary App that only exists for the duration of your script. Ephemeral Apps are stopped automatically when the calling program exits, or when the server detects that the client is no longer connected. You can use [`--detach`](https://modal.com/docs/reference/cli/run) in order to keep an ephemeral App running even after the client exits. By using `app.run` you can run your Modal apps from within your Python scripts: ```python def main(): ... with app.run(): some_modal_function.remote() ``` By default, running your app in this way won't propagate Modal logs and progress bar messages. To enable output, use the [`modal.enable_output`](https://modal.com/docs/reference/modal.enable_output) context manager: ```python def main(): ... with modal.enable_output(): with app.run(): some_modal_function.remote() ``` ## Deployed Apps A deployed App is created using the [`modal deploy`](https://modal.com/docs/reference/cli/deploy) CLI command. The App is persisted indefinitely until you delete it via the [web UI](https://modal.com/apps). Functions in a deployed App that have an attached [schedule](https://modal.com/docs/guide/cron) will be run on a schedule. Otherwise, you can invoke them manually using [web endpoints or Python](https://modal.com/docs/guide/trigger-deployed-functions). Deployed Apps are named via the [`App`](https://modal.com/docs/reference/modal.App#modalapp) constructor. Re-deploying an existing `App` (based on the name) will update it in place. ## Entrypoints for ephemeral Apps The code that runs first when you `modal run` an App is called the "entrypoint". You can register a local entrypoint using the [`@app.local_entrypoint()`](https://modal.com/docs/reference/modal.App#local_entrypoint) decorator. You can also use a regular Modal function as an entrypoint, in which case only the code in global scope is executed locally. ### Argument parsing If your entrypoint function takes arguments with primitive types, `modal run` automatically parses them as CLI options. For example, the following function can be called with `modal run script.py --foo 1 --bar "hello"`: ```python # script.py @app.local_entrypoint() def main(foo: int, bar: str): some_modal_function.remote(foo, bar) ``` If you wish to use your own argument parsing library, such as `argparse`, you can instead accept a variable-length argument list for your entrypoint or your function. In this case, Modal skips CLI parsing and forwards CLI arguments as a tuple of strings. For example, the following function can be invoked with `modal run my_file.py --foo=42 --bar="baz"`: ```python import argparse @app.function() def train(*arglist): parser = argparse.ArgumentParser() parser.add_argument("--foo", type=int) parser.add_argument("--bar", type=str) args = parser.parse_args(args = arglist) ``` ### Manually specifying an entrypoint If there is only one `local_entrypoint` registered, [`modal run script.py`](https://modal.com/docs/reference/cli/run) will automatically use it. If you have no entrypoint specified, and just one decorated Modal function, that will be used as a remote entrypoint instead. Otherwise, you can direct `modal run` to use a specific entrypoint. For example, if you have a function decorated with [`@app.function()`](https://modal.com/docs/reference/modal.App#function) in your file: ```python # script.py @app.function() def f(): print("Hello world!") @app.function() def g(): print("Goodbye world!") @app.local_entrypoint() def main(): f.remote() ``` Running [`modal run script.py`](https://modal.com/docs/reference/cli/run) will execute the `main` function locally, which would call the `f` function remotely. However you can instead run `modal run script.py::app.f` or `modal run script.py::app.g` to execute `f` or `g` directly. ## Apps were once Stubs The `modal.App` class in the client was previously called `modal.Stub`. The old name was kept as an alias for some time, but from Modal 1.0.0 onwards, using `modal.Stub` will result in an error. #### Managing deployments # Managing deployments Once you've finished using `modal run` or `modal serve` to iterate on your Modal code, it's time to deploy. A Modal deployment creates and then persists an application and its objects, providing the following benefits: - Repeated application function executions will be grouped under the deployment, aiding observability and usage tracking. Programmatically triggering lots of ephemeral App runs can clutter your web and CLI interfaces. - Function calls are much faster because deployed functions are persistent and reused, not created on-demand by calls. Learn how to trigger deployed functions in [Invoking deployed functions](https://modal.com/docs/guide/trigger-deployed-functions). - [Scheduled functions](https://modal.com/docs/guide/cron) will continue scheduling separate from any local iteration you do, and will notify you on failure. - [Web endpoints](https://modal.com/docs/guide/webhooks) keep running when you close your laptop, and their URL address matches the deployment name. ## Creating deployments Deployments are created using the [`modal deploy` command](https://modal.com/docs/reference/cli/app#modal-app-list). ``` % modal deploy -m whisper_pod_transcriber.main ✓ Initialized. View app page at https://modal.com/apps/ap-PYc2Tb7JrkskFUI8U5w0KG. ✓ Created objects. ├── 🔨 Created populate_podcast_metadata. ├── 🔨 Mounted /home/ubuntu/whisper_pod_transcriber at /root/whisper_pod_transcriber ├── 🔨 Created fastapi_app => https://modal-labs-whisper-pod-transcriber-fastapi-app.modal.run ├── 🔨 Mounted /home/ubuntu/whisper_pod_transcriber/whisper_frontend/dist at /assets ├── 🔨 Created search_podcast. ├── 🔨 Created refresh_index. ├── 🔨 Created transcribe_segment. ├── 🔨 Created transcribe_episode.. └── 🔨 Created fetch_episodes. ✓ App deployed! 🎉 View Deployment: https://modal.com/apps/modal-labs/whisper-pod-transcriber ``` Running this command on an existing deployment will redeploy the App, incrementing its version. For detail on how live deployed apps transition between versions, see the [Updating deployments](#updating-deployments) section. Deployments can also be created programmatically using Modal's [Python API](https://modal.com/docs/reference/modal.App#deploy). ## Viewing deployments Deployments can be viewed either on the [apps](https://modal.com/apps) web page or by using the [`modal app list` command](https://modal.com/docs/reference/cli/app#modal-app-list). ## Updating deployments A deployment can deploy a new App or redeploy a new version of an existing deployed App. It's useful to understand how Modal handles the transition between versions when an App is redeployed. In general, Modal aims to support zero-downtime deployments by gradually transitioning traffic to the new version. If the deployment involves building new versions of the Images used by the App, the build process will need to complete succcessfully. The existing version of the App will continue to handle requests during this time. Errors during the build will abort the deployment with no change to the status of the App. After the build completes, Modal will start to bring up new containers running the latest version of the App. The existing containers will continue handling requests (using the previous version of the App) until the new containers have completed their cold start. Once the new containers are ready, old containers will stop accepting new requests. However, the old containers will continue running any requests they had previously accepted. The old containers will not terminate until they have finished processing all ongoing requests. Any warm pool containers will also be cycled during a deployment, as the previous version's warm pool are now outdated. ## Deployment rollbacks To quickly reset an App back to a previous version, you can perform a deployment _rollback_. Rollbacks can be triggered from either the App dashboard or the CLI. Rollback deployments look like new deployments: they increment the version number and are attributed to the user who triggered the rollback. But the App's functions and metadata will be reset to their previous state independently of your current App codebase. Note that deployment rollbacks are supported only on the Team and Enterprise plans. ## Stopping deployments Deployed apps can be stopped in the web UI by clicking the red "Stop app" button on the App's "Overview" page, or alternatively from the command line using the [`modal app stop` command](https://modal.com/docs/reference/cli/app#modal-app-stop). Stopping an App is a destructive action. Apps cannot be restarted from this state; a new App will need to be deployed from the same source files. Objects associated with stopped deployments will eventually be garbage collected. #### Invoking deployed functions # Invoking deployed functions Modal lets you take a function created by a [deployment](https://modal.com/docs/guide/managing-deployments) and call it from other contexts. There are two ways of invoking deployed functions. If the invoking client is running Python, then the same [Modal client library](https://pypi.org/project/modal/) used to write Modal code can be used. HTTPS is used if the invoking client is not running Python and therefore cannot import the Modal client library. ## Invoking with Python Some use cases for Python invocation include: - An existing Python web server (eg. Django, Flask) wants to invoke Modal functions. - You have split your product or system into multiple Modal applications that deploy independently and call each other. ### Function lookup and invocation basics Let's say you have a script `my_shared_app.py` and this script defines a Modal app with a function that computes the square of a number: ```python import modal app = modal.App("my-shared-app") @app.function() def square(x: int): return x ** 2 ``` You can deploy this app to create a persistent deployment: ``` % modal deploy shared_app.py ✓ Initialized. ✓ Created objects. ├── 🔨 Created square. ├── 🔨 Mounted /Users/erikbern/modal/shared_app.py. ✓ App deployed! 🎉 View Deployment: https://modal.com/apps/erikbern/my-shared-app ``` Let's try to run this function from a different context. For instance, let's fire up the Python interactive interpreter: ```bash % python Python 3.9.5 (default, May 4 2021, 03:29:30) [Clang 12.0.0 (clang-1200.0.32.27)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import modal >>> f = modal.Function.from_name("my-shared-app", "square") >>> f.remote(42) 1764 >>> ``` This works exactly the same as a regular modal `Function` object. For example, you can `.map()` over functions invoked this way too: ```bash >>> f = modal.Function.from_name("my-shared-app", "square") >>> f.map([1, 2, 3, 4, 5]) [1, 4, 9, 16, 25] ``` #### Authentication The Modal Python SDK will read the token from `~/.modal.toml` which typically is created using `modal token new`. Another method of providing the credentials is to set the environment variables `MODAL_TOKEN_ID` and `MODAL_TOKEN_SECRET`. If you want to call a Modal function from a context such as a web server, you can expose these environment variables to the process. #### Lookup of lifecycle functions [Lifecycle functions](https://modal.com/docs/guide/lifecycle-functions) are defined on classes, which you can look up in a different way. Consider this code: ```python import modal app = modal.App("my-shared-app") @app.cls() class MyLifecycleClass: @modal.enter() def enter(self): self.var = "hello world" @modal.method() def foo(self): return self.var ``` Let's say you deploy this app. You can then call the function by doing this: ```bash >>> cls = modal.Cls.from_name("my-shared-app", "MyLifecycleClass") >>> obj = cls() # You can pass any constructor arguments here >>> obj.foo.remote() 'hello world' ``` ### Asynchronous invocation In certain contexts, a Modal client will need to trigger Modal functions without waiting on the result. This is done by spawning functions and receiving a [`FunctionCall`](https://modal.com/docs/reference/modal.FunctionCall) as a handle to the triggered execution. The following is an example of a Flask web server (running outside Modal) which accepts model training jobs to be executed within Modal. Instead of the HTTP POST request waiting on a training job to complete, which would be infeasible, the relevant Modal function is spawned and the [`FunctionCall`](https://modal.com/docs/reference/modal.FunctionCall) object is stored for later polling of execution status. ```python from uuid import uuid4 from flask import Flask, jsonify, request app = Flask(__name__) pending_jobs = {} ... @app.route("/jobs", methods = ["POST"]) def create_job(): predict_fn = modal.Function.from_name("example", "train_model") job_id = str(uuid4()) function_call = predict_fn.spawn( job_id=job_id, params=request.json, ) pending_jobs[job_id] = function_call return { "job_id": job_id, "status": "pending", } ``` ### Importing a Modal function between Modal apps You can also import one function defined in an app from another app: ```python import modal app = modal.App("another-app") square = modal.Function.from_name("my-shared-app", "square") @app.function() def cube(x): return x * square.remote(x) @app.local_entrypoint() def main(): assert cube.remote(42) == 74088 ``` ### Comparison with HTTPS Compared with HTTPS invocation, Python invocation has the following benefits: - Avoids the need to create web endpoint functions. - Avoids handling serialization of request and response data between Modal and your client. - Uses the Modal client library's built-in authentication. - Web endpoints are public to the entire internet, whereas function `lookup` only exposes your code to you (and your org). - You can work with shared Modal functions as if they are normal Python functions, which might be more convenient. ## Invoking with HTTPS Any non-Python application client can interact with deployed Modal applications via [web endpoint functions](https://modal.com/docs/guide/webhooks). Anything able to make HTTPS requests can trigger a Modal web endpoint function. Note that all deployed web endpoint functions have [a stable HTTPS URL](https://modal.com/docs/guide/webhook-urls). Some use cases for HTTPS invocation include: - Calling Modal functions from a web browser client running Javascript - Calling Modal functions from non-Python backend services (Java, Go, Ruby, NodeJS, etc) - Calling Modal functions using UNIX tools (`curl`, `wget`) However, if the client of your Modal deployment is running Python, it's better to use the [Modal client library](https://pypi.org/project/modal/) to invoke your Modal code. For more detail on setting up functions for invocation over HTTP see the [web endpoints guide](https://modal.com/docs/guide/webhooks). #### Continuous deployment # Continuous deployment It's a common pattern to auto-deploy your Modal App as part of a CI/CD pipeline. To get you started, below is a guide to doing continuous deployment of a Modal App in GitHub. ## GitHub Actions Here's a sample GitHub Actions workflow that deploys your App on every push to the `main` branch. This requires you to create a [Modal token](https://modal.com/settings/tokens) and add it as a [secret for your Github Actions workflow](https://github.com/Azure/actions-workflow-samples/blob/master/assets/create-secrets-for-GitHub-workflows.md). After setting up secrets, create a new workflow file in your repository at `.github/workflows/ci-cd.yml` with the following contents: ```yaml name: CI/CD on: push: branches: - main jobs: deploy: name: Deploy runs-on: ubuntu-latest env: MODAL_TOKEN_ID: ${{ secrets.MODAL_TOKEN_ID }} MODAL_TOKEN_SECRET: ${{ secrets.MODAL_TOKEN_SECRET }} steps: - name: Checkout Repository uses: actions/checkout@v4 - name: Install Python uses: actions/setup-python@v5 with: python-version: "3.10" - name: Install Modal run: | python -m pip install --upgrade pip pip install modal - name: Deploy job run: | modal deploy -m my_package.my_file ``` Be sure to replace `my_package.my_file` with your actual entrypoint. If you use multiple Modal [Environments](https://modal.com/docs/guide/environments), you can additionally specify the target environment in the YAML using `MODAL_ENVIRONMENT=xyz`. #### Running untrusted code in Functions # Running untrusted code in Functions Modal provides two primitives for running untrusted code: Restricted Functions and [Sandboxes](https://modal.com/docs/guide/sandbox). While both can be used for running untrusted code, they serve different purposes: Sandboxes provide a container-like interface while Restricted Functions provide an interface similar to a traditional Function. Restricted Functions are useful for executing: - Code generated by language models (LLMs) - User-submitted code in interactive environments - Third-party plugins or extensions ## Using `restrict_modal_access` To restrict a Function's access to Modal resources, set `restrict_modal_access=True` on the Function definition: ```python import modal app = modal.App() @app.function(restrict_modal_access=True) def run_untrusted_code(code_input: str): # This function cannot access Modal resources return eval(code_input) ``` When `restrict_modal_access` is enabled: - The Function cannot access Modal resources (Queues, Dicts, etc.) - The Function cannot call other Functions - The Function cannot access Modal's internal APIs ## Comparison with Sandboxes While both `restrict_modal_access` and [Sandboxes](https://modal.com/docs/guide/sandbox) can be used for running untrusted code, they serve different purposes: | Feature | Restricted Function | Sandbox | | --------- | ------------------------------ | ---------------------------------------------- | | State | Stateless | Stateful | | Interface | Function-like | Container-like | | Setup | Simple decorator | Requires explicit creation/termination | | Use case | Quick, isolated code execution | Interactive development, long-running sessions | ## Best Practices When running untrusted code, consider these additional security measures: 1. Use `max_inputs=1` to ensure each container only handles one request. Containers that get reused could cause information leakage between users. ```python @app.function(restrict_modal_access=True, max_inputs=1) def isolated_function(input_data): # Each input gets a fresh container return process(input_data) ``` 2. Set appropriate timeouts to prevent long-running operations: ```python @app.function( restrict_modal_access=True, timeout=30, # 30 second timeout max_inputs=1 ) def time_limited_function(input_data): return process(input_data) ``` 3. Consider using `block_network=True` to prevent the container from making outbound network requests: ```python @app.function( restrict_modal_access=True, block_network=True, max_inputs=1 ) def network_isolated_function(input_data): return process(input_data) ``` ## Example: Running LLM-generated Code Below is a complete example of running code generated by a language model: ```python import modal app = modal.App("restricted-access-example") @app.function(restrict_modal_access=True, max_inputs=1, timeout=30, block_network=True) def run_llm_code(generated_code: str): try: # Create a restricted environment execution_scope = {} # Execute the generated code exec(generated_code, execution_scope) # Return the result if it exists return execution_scope.get("result", None) except Exception as e: return f"Error executing code: {str(e)}" @app.local_entrypoint() def main(): # Example LLM-generated code code = """ def calculate_fibonacci(n): if n <= 1: return n return calculate_fibonacci(n-1) + calculate_fibonacci(n-2) result = calculate_fibonacci(10) """ result = run_llm_code.remote(code) print(f"Result: {result}") ``` This example locks down the container to ensure that the code is safe to execute by: - Restricting Modal access - Using a fresh container for each execution - Setting a timeout - Blocking network access - Catching and handling potential errors ## Error Handling When a restricted Function attempts to access Modal resources, it will raise an `AuthError`: ```python @app.function(restrict_modal_access=True) def restricted_function(q: modal.Queue): try: # This will fail because the Function is restricted return q.get() except modal.exception.AuthError as e: return f"Access denied: {e}" ``` The error message will indicate that the operation is not permitted due to restricted Modal access. ### Secrets and environment variables #### Secrets # Secrets Securely provide credentials and other sensitive information to your Modal Functions with Secrets. You can create and edit Secrets via the [dashboard](https://modal.com/secrets), the command line interface ([`modal secret`](https://modal.com/docs/reference/cli/secret)), and programmatically from Python code ([`modal.Secret`](https://modal.com/docs/reference/modal.Secret)). To inject Secrets into the container running your Function, add the `secrets=[...]` argument to your `app.function` or `app.cls` decoration. ## Deploy Secrets from the Modal Dashboard The most common way to create a Modal Secret is to use the [Secrets panel of the Modal dashboard](https://modal.com/secrets), which also shows any existing Secrets. When you create a new Secret, you'll be prompted with a number of templates to help you get started. These templates demonstrate standard formats for credentials for everything from Postgres and MongoDB to Weights & Biases and Hugging Face. ## Use Secrets in your Modal Apps You can then use your Secret by constructing it `from_name` when defining a Modal App and then accessing its contents as environment variables. For example, if you have a Secret called `secret-keys` containing the key `MY_PASSWORD`: ```python @app.function(secrets=[modal.Secret.from_name("secret-keys")]) def some_function(): import os secret_key = os.environ["MY_PASSWORD"] ... ``` Each Secret can contain multiple keys and values but you can also inject multiple Secrets, allowing you to separate Secrets into smaller reusable units: ```python @app.function(secrets=[ modal.Secret.from_name("my-secret-name"), modal.Secret.from_name("other-secret"), ]) def other_function(): ... ``` The Secrets are applied in order, so key-values from later `modal.Secret` objects in the list will overwrite earlier key-values in the case of a clash. For example, if both `modal.Secret` objects above contained the key `FOO`, then the value from `"other-secret"` would always be present in `os.environ["FOO"]`. ## Create Secrets programmatically In addition to defining Secrets on the web dashboard, you can programmatically create a Secret directly in your script and send it along to your Function using `Secret.from_dict(...)`. This can be useful if you want to send Secrets from your local development machine to the remote Modal App. ```python import os if modal.is_local(): local_secret = modal.Secret.from_dict({"FOO": os.environ["LOCAL_FOO"]}) else: local_secret = modal.Secret.from_dict({}) @app.function(secrets=[local_secret]) def some_function(): import os print(os.environ["FOO"]) ``` If you have [`python-dotenv`](https://pypi.org/project/python-dotenv/) installed, you can also use `Secret.from_dotenv()` to create a Secret from the variables in a `.env` file ```python @app.function(secrets=[modal.Secret.from_dotenv()]) def some_other_function(): print(os.environ["USERNAME"]) ``` ## Interact with Secrets from the command line You can create, list, and delete your Modal Secrets with the `modal secret` command line interface. View your Secrets and their timestamps with ```bash modal secret list ``` Create a new Secret by passing `{KEY}={VALUE}` pairs to `modal secret create`: ```bash modal secret create database-secret PGHOST=uri PGPORT=5432 PGUSER=admin PGPASSWORD=hunter2 ``` or using environment variables (assuming below that the `PGPASSWORD` environment variable is set e.g. by your CI system): ```bash modal secret create database-secret PGHOST=uri PGPORT=5432 PGUSER=admin PGPASSWORD="$PGPASSWORD" ``` Remove Secrets by passing their name to `modal secret delete`: ```bash modal secret delete database-secret ``` #### Environment variables # Environment variables The Modal runtime sets several environment variables during initialization. The keys for these environment variables are reserved and cannot be overridden by your Function or Sandbox configuration. These variables provide information about the containers's runtime environment. ## Container runtime environment variables The following variables are present in every Modal container: - **`MODAL_CLOUD_PROVIDER`** — Modal executes containers across a number of cloud providers ([AWS](https://aws.amazon.com/), [GCP](https://cloud.google.com/), [OCI](https://www.oracle.com/cloud/)). This variable specifies which cloud provider the Modal container is running within. - **`MODAL_IMAGE_ID`** — The ID of the [`modal.Image`](https://modal.com/docs/reference/modal.Image) used by the Modal container. - **`MODAL_REGION`** — This will correspond to a geographic area identifier from the cloud provider associated with the Modal container (see above). For AWS, the identifier is a "region". For GCP it is a "zone", and for OCI it is an "availability domain". Example values are `us-east-1` (AWS), `us-central1` (GCP), `us-ashburn-1` (OCI). - **`MODAL_TASK_ID`** — The ID of the container running the Modal Function or Sandbox. ## Function runtime environment variables The following variables are present in containers running Modal Functions: - **`MODAL_ENVIRONMENT`** — The name of the [Modal Environment](https://modal.com/docs/guide/environments) the container is running within. - **`MODAL_IS_REMOTE`** - Set to '1' to indicate that Modal Function code is running in a remote container. - **`MODAL_IDENTITY_TOKEN`** — An [OIDC token](https://modal.com/docs/guide/oidc-integration) encoding the identity of the Modal Function. ## Sandbox environment variables The following variables are present within [`modal.Sandbox`](https://modal.com/docs/reference/modal.Sandbox) instances. - **`MODAL_SANDBOX_ID`** — The ID of the Sandbox. ## Container image environment variables The container image layers used by a `modal.Image` may set environment variables. These variables will be present within your container's runtime environment. For example, the [`debian_slim`](https://modal.com/docs/reference/modal.Image#debian_slim) image sets the `GPG_KEY` variable. To override image variables or set new ones, use the [`.env`](https://modal.com/docs/reference/modal.Image#env) method provided by `modal.Image`. ### Web endpoints #### Web endpoints # Web endpoints This guide explains how to set up web endpoints with Modal. All deployed Modal Functions can be [invoked from any other Python application](https://modal.com/docs/guide/trigger-deployed-functions) using the Modal client library. We additionally provide multiple ways to expose your Functions over the web for non-Python clients. You can [turn any Python function into a web endpoint](#simple-endpoints) with a single line of code, you can [serve a full app](#serving-asgi-and-wsgi-apps) using frameworks like FastAPI, Django, or Flask, or you can [serve anything that speaks HTTP and listens on a port](#non-asgi-web-servers). Below we walk through each method, assuming you're familiar with web applications outside of Modal. For a detailed walkthrough of basic web endpoints on Modal aimed at developers new to web applications, see [this tutorial](https://modal.com/docs/examples/basic_web). ## Simple endpoints The easiest way to create a web endpoint from an existing Python function is to use the [`@modal.fastapi_endpoint` decorator](https://modal.com/docs/reference/modal.fastapi_endpoint). ```python image = modal.Image.debian_slim().pip_install("fastapi[standard]") @app.function(image=image) @modal.fastapi_endpoint() def f(): return "Hello world!" ``` This decorator wraps the Modal Function in a [FastAPI application](#how-do-web-endpoints-run-in-the-cloud). _Note: Prior to v0.73.82, this function was named `@modal.web_endpoint`_. ### Developing with `modal serve` You can run this code as an ephemeral app, by running the command ```shell modal serve server_script.py ``` Where `server_script.py` is the file name of your code. This will create an ephemeral app for the duration of your script (until you hit Ctrl-C to stop it). It creates a temporary URL that you can use like any other REST endpoint. This URL is on the public internet. The `modal serve` command will live-update an app when any of its supporting files change. Live updating is particularly useful when working with apps containing web endpoints, as any changes made to web endpoint handlers will show up almost immediately, without requiring a manual restart of the app. ### Deploying with `modal deploy` You can also deploy your app and create a persistent web endpoint in the cloud by running `modal deploy`: ### Passing arguments to an endpoint When using `@modal.fastapi_endpoint`, you can add [query parameters](https://fastapi.tiangolo.com/tutorial/query-params/) which will be passed to your Function as arguments. For instance ```python image = modal.Image.debian_slim().pip_install("fastapi[standard]") @app.function(image=image) @modal.fastapi_endpoint() def square(x: int): return {"square": x**2} ``` If you hit this with a URL-encoded query string with the `x` parameter present, the Function will receive the value as an argument: ``` $ curl https://modal-labs--web-endpoint-square-dev.modal.run?x=42 {"square":1764} ``` If you want to use a `POST` request, you can use the `method` argument to `@modal.fastapi_endpoint` to set the HTTP verb. To accept any valid JSON object, [use `dict` as your type annotation](https://fastapi.tiangolo.com/tutorial/body-nested-models/?h=dict#bodies-of-arbitrary-dicts) and FastAPI will handle the rest. ```python image = modal.Image.debian_slim().pip_install("fastapi[standard]") @app.function(image=image) @modal.fastapi_endpoint(method="POST") def square(item: dict): return {"square": item['x']**2} ``` This now creates an endpoint that takes a JSON body: ``` $ curl -X POST -H 'Content-Type: application/json' --data-binary '{"x": 42}' https://modal-labs--web-endpoint-square-dev.modal.run {"square":1764} ``` This is often the easiest way to get started, but note that FastAPI recommends that you use [typed Pydantic models](https://fastapi.tiangolo.com/tutorial/body/) in order to get automatic validation and documentation. FastAPI also lets you pass data to web endpoints in other ways, for instance as [form data](https://fastapi.tiangolo.com/tutorial/request-forms/) and [file uploads](https://fastapi.tiangolo.com/tutorial/request-files/). ## How do web endpoints run in the cloud? Note that web endpoints, like everything else on Modal, only run when they need to. When you hit the web endpoint the first time, it will boot up the container, which might take a few seconds. Modal keeps the container alive for a short period in case there are subsequent requests. If there are a lot of requests, Modal might create more containers running in parallel. For the shortcut `@modal.fastapi_endpoint` decorator, Modal wraps your function in a [FastAPI](https://fastapi.tiangolo.com/) application. This means that the [Image](https://modal.com/docs/guide/images) your Function uses must have FastAPI installed, and the Functions that you write need to follow its request and response [semantics](https://fastapi.tiangolo.com/tutorial). Web endpoint Functions can use all of FastAPI's powerful features, such as Pydantic models for automatic validation, typed query and path parameters, and response types. Here's everything together, combining Modal's abilities to run functions in user-defined containers with the expressivity of FastAPI: ```python import modal from fastapi.responses import HTMLResponse from pydantic import BaseModel image = modal.Image.debian_slim().pip_install("fastapi[standard]", "boto3") app = modal.App(image=image) class Item(BaseModel): name: str qty: int = 42 @app.function() @modal.fastapi_endpoint(method="POST") def f(item: Item): import boto3 # do things with boto3... return HTMLResponse(f"Hello, {item.name}!") ``` This endpoint definition would be called like so: ```bash curl -d '{"name": "Erik", "qty": 10}' \ -H "Content-Type: application/json" \ -X POST https://ecorp--web-demo-f-dev.modal.run ``` Or in Python with the [`requests`](https://pypi.org/project/requests/) library: ```python import requests data = {"name": "Erik", "qty": 10} requests.post("https://ecorp--web-demo-f-dev.modal.run", json=data, timeout=10.0) ``` ## Serving ASGI and WSGI apps You can also serve any app written in an [ASGI](https://asgi.readthedocs.io/en/latest/) or [WSGI](https://en.wikipedia.org/wiki/Web_Server_Gateway_Interface)-compatible web framework on Modal. ASGI provides support for async web frameworks. WSGI provides support for synchronous web frameworks. ### ASGI apps - FastAPI, FastHTML, Starlette For ASGI apps, you can create a function decorated with [`@modal.asgi_app`](https://modal.com/docs/reference/modal.asgi_app) that returns a reference to your web app: ```python image = modal.Image.debian_slim().pip_install("fastapi[standard]") @app.function(image=image) @modal.concurrent(max_inputs=100) @modal.asgi_app() def fastapi_app(): from fastapi import FastAPI, Request web_app = FastAPI() @web_app.post("/echo") async def echo(request: Request): body = await request.json() return body return web_app ``` Now, as before, when you deploy this script as a Modal App, you get a URL for your app that you can hit: The `@modal.concurrent` decorator enables a single container to process multiple inputs at once, taking advantage of the asynchronous event loops in ASGI applications. See [this guide](https://modal.com/docs/guide/concurrent-inputs) for details. #### ASGI Lifespan While we recommend using [`@modal.enter`](https://modal.com/docs/guide/lifecycle-functions#enter) for defining container lifecycle hooks, we also support the [ASGI lifespan protocol](https://asgi.readthedocs.io/en/latest/specs/lifespan.html). Lifespans begin when containers start, typically at the time of the first request. Here's an example using [FastAPI](https://fastapi.tiangolo.com/advanced/events/#lifespan): ```python import modal app = modal.App("fastapi-lifespan-app") image = modal.Image.debian_slim().pip_install("fastapi[standard]") @app.function(image=image) @modal.asgi_app() def fastapi_app_with_lifespan(): from fastapi import FastAPI, Request def lifespan(wapp: FastAPI): print("Starting") yield print("Shutting down") web_app = FastAPI(lifespan=lifespan) @web_app.get("/") async def hello(request: Request): return "hello" return web_app ``` ### WSGI apps - Django, Flask You can serve WSGI apps using the [`@modal.wsgi_app`](https://modal.com/docs/reference/modal.wsgi_app) decorator: ```python image = modal.Image.debian_slim().pip_install("flask") @app.function(image=image) @modal.concurrent(max_inputs=100) @modal.wsgi_app() def flask_app(): from flask import Flask, request web_app = Flask(__name__) @web_app.post("/echo") def echo(): return request.json return web_app ``` See [Flask's docs](https://flask.palletsprojects.com/en/2.1.x/deploying/asgi/) for more information on using Flask as a WSGI app. Because WSGI apps are synchronous, concurrent inputs will be run on separate threads. See [this guide](https://modal.com/docs/guide/concurrent-inputs) for details. ## Non-ASGI web servers Not all web frameworks offer an ASGI or WSGI interface. For example, [`aiohttp`](https://docs.aiohttp.org/) and [`tornado`](https://www.tornadoweb.org/) use their own asynchronous network binding, while others like [`text-generation-inference`](https://github.com/huggingface/text-generation-inference) actually expose a Rust-based HTTP server running as a subprocess. For these cases, you can use the [`@modal.web_server`](https://modal.com/docs/reference/modal.web_server) decorator to "expose" a port on the container: ```python @app.function() @modal.concurrent(max_inputs=100) @modal.web_server(8000) def my_file_server(): import subprocess subprocess.Popen("python -m http.server -d / 8000", shell=True) ``` Just like all web endpoints on Modal, this is only run on-demand. The function is executed on container startup, creating a file server at the root directory. When you hit the web endpoint URL, your request will be routed to the file server listening on port `8000`. For `@web_server` endpoints, you need to make sure that the application binds to the external network interface, not just localhost. This usually means binding to `0.0.0.0` instead of `127.0.0.1`. See our examples of how to serve [Streamlit](https://modal.com/docs/examples/serve_streamlit) and [ComfyUI](https://modal.com/docs/examples/comfyapp) on Modal. ## Serve many configurations with parametrized functions Python functions that launch ASGI/WSGI apps or web servers on Modal cannot take arguments. One simple pattern for allowing client-side configuration of these web endpoints is to use [parametrized functions](https://modal.com/docs/guide/parametrized-functions). Each different choice for the values of the parameters will create a distinct auto-scaling container pool. ```python @app.cls() @modal.concurrent(max_inputs=100) class Server: root: str = modal.parameter(default=".") @modal.web_server(8000) def files(self): import subprocess subprocess.Popen(f"python -m http.server -d {self.root} 8000", shell=True) ``` The values are provided in URLs as query parameters: ```bash curl https://ecorp--server-files.modal.run # use the default value curl https://ecorp--server-files.modal.run?root=.cache # use a different value curl https://ecorp--server-files.modal.run?root=%2F # don't forget to URL encode! ``` For details, see [this guide to parametrized functions](https://modal.com/docs/guide/parametrized-functions). ## WebSockets Functions annotated with `@web_server`, `@asgi_app`, or `@wsgi_app` also support the WebSocket protocol. Consult your web framework for appropriate documentation on how to use WebSockets with that library. WebSockets on Modal maintain a single function call per connection, which can be useful for keeping state around. Most of the time, you will want to set your handler function to [allow concurrent inputs](https://modal.com/docs/guide/concurrent-inputs), which allows multiple simultaneous WebSocket connections to be handled by the same container. We support the full WebSocket protocol as per [RFC 6455](https://www.rfc-editor.org/rfc/rfc6455), but we do not yet have support for [RFC 8441](https://www.rfc-editor.org/rfc/rfc8441) (WebSockets over HTTP/2) or [RFC 7692](https://datatracker.ietf.org/doc/html/rfc7692) (`permessage-deflate` extension). WebSocket messages can be up to 2 MiB each. ## Performance and scaling If you have no active containers when the web endpoint receives a request, it will experience a "cold start". Consult the guide page on [cold start performance](https://modal.com/docs/guide/cold-start) for more information on when Functions will cold start and advice how to mitigate the impact. If your Function uses `@modal.concurrent`, multiple requests to the same endpoint may be handled by the same container. Beyond this limit, additional containers will start up to scale your App horizontally. When you reach the Function's limit on containers, requests will queue for handling. Each workspace on Modal has a rate limit on total operations. For a new account, this is set to 200 function inputs or web endpoint requests per second, with a burst multiplier of 5 seconds. If you reach the rate limit, excess requests to web endpoints will return a [429 status code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429), and you'll need to [get in touch](mailto:support@modal.com) with us about raising the limit. Web endpoint request bodies can be up to 4 GiB, and their response bodies are unlimited in size. ## Authentication Modal offers first-class web endpoint protection via [proxy auth tokens](https://modal.com/docs/guide/webhook-proxy-auth). Proxy auth tokens protect web endpoints by requiring a key and token combination to be passed in the `Modal-Key` and `Modal-Secret` headers. Modal works as a proxy, rejecting requests that aren't authorized to access your endpoint. We also support standard techniques for securing web servers. ### Token-based authentication This is easy to implement in whichever framework you're using. For example, if you're using `@modal.fastapi_endpoint` or `@modal.asgi_app` with FastAPI, you can validate a Bearer token like this: ```python from fastapi import Depends, HTTPException, status, Request from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials import modal image = modal.Image.debian_slim().pip_install("fastapi[standard]") app = modal.App("auth-example", image=image) auth_scheme = HTTPBearer() @app.function(secrets=[modal.Secret.from_name("my-web-auth-token")]) @modal.fastapi_endpoint() async def f(request: Request, token: HTTPAuthorizationCredentials = Depends(auth_scheme)): import os print(os.environ["AUTH_TOKEN"]) if token.credentials != os.environ["AUTH_TOKEN"]: raise HTTPException( status_code=status.HTTP_401_UNAUTHORIZED, detail="Incorrect bearer token", headers={"WWW-Authenticate": "Bearer"}, ) # Function body return "success!" ``` This assumes you have a [Modal Secret](https://modal.com/secrets) named `my-web-auth-token` created, with contents `{AUTH_TOKEN: secret-random-token}`. Now, your endpoint will return a 401 status code except when you hit it with the correct `Authorization` header set (note that you have to prefix the token with `Bearer `): ```bash curl --header "Authorization: Bearer secret-random-token" https://modal-labs--auth-example-f.modal.run ``` ### Client IP address You can access the IP address of the client making the request. This can be used for geolocation, whitelists, blacklists, and rate limits. ```python from fastapi import Request import modal image = modal.Image.debian_slim().pip_install("fastapi[standard]") app = modal.App(image=image) @app.function() @modal.fastapi_endpoint() def get_ip_address(request: Request): return f"Your IP address is {request.client.host}" ``` #### Streaming endpoints # Streaming endpoints Modal web endpoints support streaming responses using FastAPI's [`StreamingResponse`](https://fastapi.tiangolo.com/advanced/custom-response/#streamingresponse) class. This class accepts asynchronous generators, synchronous generators, or any Python object that implements the [_iterator protocol_](https://docs.python.org/3/library/stdtypes.html#typeiter), and can be used with Modal Functions! ## Simple example This simple example combines Modal's `@modal.fastapi_endpoint` decorator with a `StreamingResponse` object to produce a real-time SSE response. ```python import time def fake_event_streamer(): for i in range(10): yield f"data: some data {i}\n\n".encode() time.sleep(0.5) @app.function(image=modal.Image.debian_slim().pip_install("fastapi[standard]")) @modal.fastapi_endpoint() def stream_me(): from fastapi.responses import StreamingResponse return StreamingResponse( fake_event_streamer(), media_type="text/event-stream" ) ``` If you serve this web endpoint and hit it with `curl`, you will see the ten SSE events progressively appear in your terminal over a ~5 second period. ```shell curl --no-buffer https://modal-labs--example-streaming-stream-me.modal.run ``` The MIME type of `text/event-stream` is important in this example, as it tells the downstream web server to return responses immediately, rather than buffering them in byte chunks (which is more efficient for compression). You can still return other content types like large files in streams, but they are not guaranteed to arrive as real-time events. ## Streaming responses with `.remote` A Modal Function wrapping a generator function body can have its response passed directly into a `StreamingResponse`. This is particularly useful if you want to do some GPU processing in one Modal Function that is called by a CPU-based web endpoint Modal Function. ```python @app.function(gpu="any") def fake_video_render(): for i in range(10): yield f"data: finished processing some data from GPU {i}\n\n".encode() time.sleep(1) @app.function(image=modal.Image.debian_slim().pip_install("fastapi[standard]")) @modal.fastapi_endpoint() def hook(): from fastapi.responses import StreamingResponse return StreamingResponse( fake_video_render.remote_gen(), media_type="text/event-stream" ) ``` ## Streaming responses with `.map` and `.starmap` You can also combine Modal Function parallelization with streaming responses, enabling applications to service a request by farming out to dozens of containers and iteratively returning result chunks to the client. ```python @app.function() def map_me(i): return f"segment {i}\n" @app.function(image=modal.Image.debian_slim().pip_install("fastapi[standard]")) @modal.fastapi_endpoint() def mapped(): from fastapi.responses import StreamingResponse return StreamingResponse(map_me.map(range(10)), media_type="text/plain") ``` This snippet will spread the ten `map_me(i)` executions across containers, and return each string response part as it completes. By default the results will be ordered, but if this isn't necessary pass `order_outputs=False` as keyword argument to the `.map` call. ### Asynchronous streaming The example above uses a synchronous generator, which automatically runs on its own thread, but in asynchronous applications, a loop over a `.map` or `.starmap` call can block the event loop. This will stop the `StreamingResponse` from returning response parts iteratively to the client. To avoid this, you can use the `.aio()` method to convert a synchronous `.map` into its async version. Also, other blocking calls should be offloaded to a separate thread with `asyncio.to_thread()`. For example: ```python @app.function(gpu="any", image=modal.Image.debian_slim().pip_install("fastapi[standard]")) @modal.fastapi_endpoint() async def transcribe_video(request): from fastapi.responses import StreamingResponse segments = await asyncio.to_thread(split_video, request) return StreamingResponse(wrapper(segments), media_type="text/event-stream") # Notice that this is an async generator. async def wrapper(segments): async for partial_result in transcribe_video.map.aio(segments): yield "data: " + partial_result + "\n\n" ``` ## Further examples - Complete code the for the simple examples given above is available [in our modal-examples Github repository](https://github.com/modal-labs/modal-examples/blob/main/07_web_endpoints/streaming.py). - [An end-to-end example of streaming Youtube video transcriptions with OpenAI's whisper model.](https://github.com/modal-labs/modal-examples/blob/main/06_gpu_and_ml/openai_whisper/streaming/main.py) #### Web endpoint URLs # Web endpoint URLs This guide documents the behavior of URLs for [web endpoints](https://modal.com/docs/guide/webhooks) on Modal: automatic generation, configuration, programmatic retrieval, and more. ## Determine the URL of a web endpoint from code Modal Functions with the [`fastapi_endpoint`](https://modal.com/docs/reference/modal.fastapi_endpoint), [`asgi_app`](https://modal.com/docs/reference/modal.asgi_app), [`wsgi_app`](https://modal.com/docs/reference/modal.wsgi_app), or [`web_server`](https://modal.com/docs/reference/modal.web_server) decorator are made available over the Internet when they are [`serve`d](https://modal.com/docs/reference/cli/serve) or [`deploy`ed](https://modal.com/docs/reference/cli/deploy) and so they have a URL. This URL is displayed in the `modal` CLI output and is available in the Modal [dashboard](https://modal.com/apps) for the Function. To determine a Function's URL programmatically, check its [`get_web_url()`](https://modal.com/docs/reference/modal.Function#get_web_url) property: ```python @app.function(image=modal.Image.debian_slim().pip_install("fastapi[standard]")) @modal.fastapi_endpoint(docs=True) def show_url() -> str: return show_url.get_web_url() ``` For deployed Functions, this also works from other Python code! You just need to do a [`from_name`](https://modal.com/docs/reference/modal.Function#from_name) based on the name of the Function and its [App](https://modal.com/docs/guide/apps): ```python notest import requests remote_function = modal.Function.from_name("app", "show_url") remote_function.get_web_url() == requests.get(handle.get_web_url()).json() ``` ## Auto-generated URLs By default, Modal Functions will be served from the `modal.run` domain. The full URL will be constructed from a number of pieces of information to uniquely identify the endpoint. At a high-level, web endpoint URLs for deployed applications have the following structure: `https://--