Endpoints
Deploy a production-ready LLM inference endpoint on Modal’s managed infrastructure with a single command:
modal endpoint create --model Qwen/Qwen3.5-4BEndpoints support:
- Open and custom weights — deploy a base model from Modal’s catalog directly, or serve your own fine-tune of it from a Hugging Face repo or a Modal Volume.
- Scale-to-zero autoscaling — endpoints scale up under load and down to zero when idle.
- Usage-based pricing — you pay only for the compute your endpoint uses while it’s running.
- Fast inference by default — every endpoint runs behind a low-latency request proxy on tuned open-source inference engines, with SOTA speculative decoding wherever the recipe supports it.
This page is a high-level guide to Modal Endpoints.
Getting started
Modal supports deploying pre-trained open and custom weight models from the following families:
- Qwen
- Kimi
- Gemma4
- DeepSeek
- Nemotron
- GPT-OSS
- GLM
Browse the full catalog on the Endpoints tab in the dashboard.
Spin up an endpoint for Qwen/Qwen3.5-4B:
modal endpoint create --model Qwen/Qwen3.5-4BModal resolves the model, selects a compatible recipe, and starts provisioning. The command prints the endpoint ID and a dashboard link where you can watch it come online. You can also create endpoints from the Endpoints tab in the dashboard — the form collects the same options.
If you omit the name argument, Modal derives one from the model
(Qwen/Qwen3.5-4B → qwen3-5-4b).
Proxy tokens
Endpoints are authenticated by default. To create and call an authenticated
endpoint you need a proxy token in the same environment as the endpoint. The token is passed on every request
as the Modal-Key and Modal-Secret headers.
Create a proxy token with the CLI:
modal workspace proxy-tokens createThis prints the token ID and secret. The secret is only shown at creation time and can’t be retrieved later, so store it somewhere safe:
Modal-Key: wk-...
Modal-Secret: ws-...With a token available in the environment, create the endpoint:
modal endpoint create --model Qwen/Qwen3.5-4BTo create an endpoint that accepts unauthenticated requests instead, pass --unauthenticated.
Calling your endpoint
Once the endpoint is live, it serves the OpenAI Chat Completions API at the
endpoint URL — find it in the dashboard or with modal endpoint list. The API
is served under /v1, and the model name to pass is the base model repo ID (for
catalog and Volume models) or your custom Hugging Face repo ID.
List the models the endpoint serves with a GET request, passing your proxy token in the Modal-Key and Modal-Secret headers:
curl "<your-endpoint-url>/v1/models" \
-H "Modal-Key: $MODAL_PROXY_AUTH_TOKEN_ID" \
-H "Modal-Secret: $MODAL_PROXY_AUTH_TOKEN_SECRET"Serving custom weights
Point an endpoint at a fine-tuned checkpoint instead of a catalog model. A
custom model is always served against a base model from the catalog: pass that
base model with --model so Modal can pick a compatible recipe, then point at
your weights with the --custom-hf-* or --custom-volume-* flags.
From a Hugging Face repo (use --custom-hf-token for gated or private repos):
modal endpoint create my-ft \
--model Qwen/Qwen3.6-27B \
--custom-hf-repo aisingapore/Qwen-SEA-LION-v4.5-27B-IT \
--custom-hf-revision da42f2c0984d716fb2032e4176d81adfac98c630From a Modal Volume (the model directory must contain config.json):
modal endpoint create my-volume-ft \
--model Qwen/Qwen3.5-4B \
--custom-volume-name my-volume \
--custom-volume-path /checkpoints/1234Choosing where it runs
Two placement controls:
- Routing region (
--routing-region) — where the request proxy is anchored. Pick the region closest to your callers:us-west(default),us-east,eu-west, orap-south. - Compute placement (
--colocate-compute) — by default Modal places containers by availability. Pass--colocate-computeto pin them to the routing region instead.
modal endpoint create \
--model Qwen/Qwen3.5-4B \
--routing-region us-east \
--colocate-computePinning compute to the routing region with --colocate-compute keeps containers
close to the proxy but applies a region selection multiplier.
Managing endpoints
List endpoints in an environment, with their status and URL:
modal endpoint list --env prod
modal endpoint list --env prod --json # machine-readableStop an endpoint when you no longer need it. This tears down its serving containers and stops billing; the endpoint stays listed for reference.
modal endpoint stop qwen3-5-4b --env prodPricing
Endpoints bill for the GPU and CPU their containers use while running, at standard Modal compute rates. Because endpoints scale to zero by default, you pay nothing for compute while idle. You can adjust the autoscaling configuration overrides in the UI. Region pinning applies a region selection multiplier.