Endpoints

Deploy a production-ready LLM inference endpoint on Modal’s managed infrastructure with a single command:

modal endpoint create --model Qwen/Qwen3.5-4B

Endpoints support:

  • Open and custom weights — deploy a base model from Modal’s catalog directly, or serve your own fine-tune of it from a Hugging Face repo or a Modal Volume.
  • Scale-to-zero autoscaling — endpoints scale up under load and down to zero when idle.
  • Usage-based pricing — you pay only for the compute your endpoint uses while it’s running.
  • Fast inference by default — every endpoint runs behind a low-latency request proxy on tuned open-source inference engines, with SOTA speculative decoding wherever the recipe supports it.

This page is a high-level guide to Modal Endpoints.

Getting started 

Modal supports deploying pre-trained open and custom weight models from the following families:

  • Qwen
  • Kimi
  • Gemma4
  • DeepSeek
  • Nemotron
  • GPT-OSS
  • GLM

Browse the full catalog on the Endpoints tab in the dashboard.

Spin up an endpoint for Qwen/Qwen3.5-4B:

modal endpoint create --model Qwen/Qwen3.5-4B

Modal resolves the model, selects a compatible recipe, and starts provisioning. The command prints the endpoint ID and a dashboard link where you can watch it come online. You can also create endpoints from the Endpoints tab in the dashboard — the form collects the same options.

If you omit the name argument, Modal derives one from the model (Qwen/Qwen3.5-4Bqwen3-5-4b).

Proxy tokens 

Endpoints are authenticated by default. To create and call an authenticated endpoint you need a proxy token in the same environment as the endpoint. The token is passed on every request as the Modal-Key and Modal-Secret headers.

Create a proxy token with the CLI:

modal workspace proxy-tokens create

This prints the token ID and secret. The secret is only shown at creation time and can’t be retrieved later, so store it somewhere safe:

Modal-Key: wk-...
Modal-Secret: ws-...

With a token available in the environment, create the endpoint:

modal endpoint create --model Qwen/Qwen3.5-4B

To create an endpoint that accepts unauthenticated requests instead, pass --unauthenticated.

Calling your endpoint 

Once the endpoint is live, it serves the OpenAI Chat Completions API at the endpoint URL — find it in the dashboard or with modal endpoint list. The API is served under /v1, and the model name to pass is the base model repo ID (for catalog and Volume models) or your custom Hugging Face repo ID.

List the models the endpoint serves with a GET request, passing your proxy token in the Modal-Key and Modal-Secret headers:

curl "<your-endpoint-url>/v1/models" \
  -H "Modal-Key: $MODAL_PROXY_AUTH_TOKEN_ID" \
  -H "Modal-Secret: $MODAL_PROXY_AUTH_TOKEN_SECRET"

Serving custom weights 

Point an endpoint at a fine-tuned checkpoint instead of a catalog model. A custom model is always served against a base model from the catalog: pass that base model with --model so Modal can pick a compatible recipe, then point at your weights with the --custom-hf-* or --custom-volume-* flags.

From a Hugging Face repo (use --custom-hf-token for gated or private repos):

modal endpoint create my-ft \
  --model Qwen/Qwen3.6-27B \
  --custom-hf-repo aisingapore/Qwen-SEA-LION-v4.5-27B-IT \
  --custom-hf-revision da42f2c0984d716fb2032e4176d81adfac98c630

From a Modal Volume (the model directory must contain config.json):

modal endpoint create my-volume-ft \
  --model Qwen/Qwen3.5-4B \
  --custom-volume-name my-volume \
  --custom-volume-path /checkpoints/1234

Choosing where it runs 

Two placement controls:

  • Routing region (--routing-region) — where the request proxy is anchored. Pick the region closest to your callers: us-west (default), us-east, eu-west, or ap-south.
  • Compute placement (--colocate-compute) — by default Modal places containers by availability. Pass --colocate-compute to pin them to the routing region instead.
modal endpoint create \
  --model Qwen/Qwen3.5-4B \
  --routing-region us-east \
  --colocate-compute

Pinning compute to the routing region with --colocate-compute keeps containers close to the proxy but applies a region selection multiplier.

Managing endpoints 

List endpoints in an environment, with their status and URL:

modal endpoint list --env prod
modal endpoint list --env prod --json   # machine-readable

Stop an endpoint when you no longer need it. This tears down its serving containers and stops billing; the endpoint stays listed for reference.

modal endpoint stop qwen3-5-4b --env prod

Pricing 

Endpoints bill for the GPU and CPU their containers use while running, at standard Modal compute rates. Because endpoints scale to zero by default, you pay nothing for compute while idle. You can adjust the autoscaling configuration overrides in the UI. Region pinning applies a region selection multiplier.