Run Falcon-40B with AutoGPTQ

In this example, we run a quantized 4-bit version of Falcon-40B, the first open-source large language model of its size, using HuggingFace’s transformers library and AutoGPTQ.

Due to the current limitations of the library, the inference speed is a little under 1 token/second and the cold start time on Modal is around 25s.

For faster inference at the expense of a slower cold start, check out Running Falcon-40B with bitsandbytes quantization. You can also run a smaller, 7-billion-parameter model with the OpenLLaMa example.


First we import the components we need from modal.

from modal import Image, Stub, gpu, method, web_endpoint

Define a container image

To take advantage of Modal’s blazing fast cold-start times, we download model weights into a folder inside our container image. These weights come from a quantized model found on Huggingface.

IMAGE_MODEL_DIR = "/model"

def download_model():
    from huggingface_hub import snapshot_download

    model_name = "TheBloke/falcon-40b-instruct-GPTQ"
    snapshot_download(model_name, local_dir=IMAGE_MODEL_DIR)

Now, we define our image. We’ll use the debian-slim base image, and install the dependencies we need using pip_install. At the end, we’ll use run_function to run the function defined above as part of the image build.

image = (
        "auto-gptq @ git+",
        "transformers @ git+",
    # Use huggingface's hi-perf hf-transfer library to download this large model.
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})

Let’s instantiate and name our Stub.

stub = Stub(name="example-falcon-gptq", image=image)

The model class

Next, we write the model code. We want Modal to load the model into memory just once every time a container starts up, so we use class syntax and the __enter__ method.

Within the @stub.cls decorator, we use the gpu parameter to specify that we want to run our function on an A100 GPU. We also allow each call 10 mintues to complete, and request the runner to stay live for 5 minutes after its last request.

The rest is just using the transformers library to run the model. Refer to the documentation for more parameters and tuning.

Note that we need to create a separate thread to call the generate function because we need to yield the text back from the streamer in the main thread. This is an idiosyncrasy with streaming in transformers.

@stub.cls(gpu=gpu.A100(), timeout=60 * 10, container_idle_timeout=60 * 5)
class Falcon40BGPTQ:
    def __enter__(self):
        from auto_gptq import AutoGPTQForCausalLM
        from transformers import AutoTokenizer

        self.tokenizer = AutoTokenizer.from_pretrained(
            IMAGE_MODEL_DIR, use_fast=True
        print("Loaded tokenizer.")

        self.model = AutoGPTQForCausalLM.from_quantized(
        print("Loaded model.")

    def generate(self, prompt: str):
        from threading import Thread

        from transformers import TextIteratorStreamer

        inputs = self.tokenizer(prompt, return_tensors="pt")
        streamer = TextIteratorStreamer(
            self.tokenizer, skip_special_tokens=True
        generation_kwargs = dict(

        # Run generation on separate thread to enable response streaming.
        thread = Thread(target=self.model.generate, kwargs=generation_kwargs)
        for new_text in streamer:
            yield new_text


Run the model

We define a local_entrypoint to call our remote function sequentially for a list of inputs. You can run this locally with modal run -q The -q flag enables streaming to work in the terminal output.

prompt_template = (
    "A chat between a curious human user and an artificial intelligence assistant. The assistant give a helpful, detailed, and accurate answer to the user's question."

def cli():
    question = "What are the main differences between Python and JavaScript programming languages?"
    model = Falcon40BGPTQ()
    for text in model.generate.remote_gen(prompt_template.format(question)):
        print(text, end="", flush=True)

Serve the model

Finally, we can serve the model from a web endpoint with modal deploy If you visit the resulting URL with a question parameter in your URL, you can view the model’s stream back a response. You can try our deployment here.

@stub.function(timeout=60 * 10)
def get(question: str):
    from itertools import chain

    from fastapi.responses import StreamingResponse

    model = Falcon40BGPTQ()
    return StreamingResponse(
            ("Loading model. This usually takes around 20s ...\n\n"),