QuilLLMan: Voice Chat with LLMs

Vicuna is the latest in a series of open-source chatbots that approach the quality of proprietary models like GPT-4, but in addition can be self-hosted at a fraction of the cost. We’ve enjoyed playing around with Vicuna enough at Modal HQ that we decided we wanted to have it available at all times in the form of a voice chat app.

So, we built QuiLLMan, a complete chat app that transcribes audio in real-time using Whisper, streams back a response from a language model, and synthesizes this response as natural-sounding speech.

voicechat-demo

Everything (including the React frontend and backend API) is deployed serverlessly on Modal, and you can play around with the live demo here. This post provides a high-level walkthrough of the repo. We’re looking to add more models and features to this as time goes on, and contributions are welcome!

Code overview

Traditionally, building a serverless web application with a backend API and three different ML services, each of these running in its own custom container and autoscaling independently, would require a lot of work. But with Modal, it’s as simple as writing 4 different classes and running a CLI command.

Our project structure looks like this:

  1. Language model service
  2. Transcription service
  3. Text-to-speech service
  4. FastAPI server
  5. React frontend

Let’s go through each of these components in more detail.

Language model

We use GPTQ-for-LLaMa, an implementation of GPTQ, to quantize our model to 4 bits for faster inference. This repository needs to be built from scratch against CUDA runtime. Fortunately, Modal makes it easy to express a complex image definition like this one as a series of functions:

vicuna_image = (
    modal.Image.from_registry("nvidia/cuda:12.2.0-devel-ubuntu20.04", add_python="3.8")
    .apt_install("git", "gcc", "build-essential")
    .run_commands(
        "git clone https://github.com/thisserand/FastChat.git",
        "cd FastChat && pip install -e .",
    )
    .run_commands(
        # FastChat hard-codes a path for GPTQ, so this needs to be cloned inside repositories.
        "git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda /FastChat/repositories/GPTQ-for-LLaMa",
        "cd /FastChat/repositories/GPTQ-for-LLaMa && python setup_cuda.py install",
        gpu="any",
    )
)

Above, we use from_registry to select the official CUDA container as the base image, install python and build requirements, clone the FastChat repo, and finally build GPTQ-for-LLaMa. Note that in order for the compilation step to work, a CUDA environment is required; adding gpu="any" lets us run that step on a GPU machine.

The generate function itself constructs a prompt using the current input, previous history and a prompt template. Then, it simply yields tokens as they are produced. Python generators just work out-of-the-box in Modal, so building streaming interactions is easy.

Although we’re going to call this model from our backend API, it’s useful to test it directly as well. To do this, we define a local_entrypoint:

@stub.local_entrypoint()
def main(input: str):
    model = Vicuna()
    for val in model.generate.remote(input):
        print(val, end="", flush=True)

Now, we can run the model with a prompt of our choice from the terminal:

modal run -q src.llm_vicuna --input "How do antihistamines work?"

Transcription

In this file we define a Modal class that uses OpenAI’s Whisper to transcribe audio in real-time. The helper function load_audio downsamples the audio to 16kHz (required by Whisper) using ffmpeg.

We’re using an A10G GPU for transcriptions, which lets us transcribe most segments in under 2 seconds.

Text-to-speech

The text-to-speech service is adapted from tortoise-tts-modal-api, a Modal deployment of Tortoise TTS. Take a look at those repos if you’re interested in understanding how the code works, or for a full list of parameters and voices you can use.

FastAPI server

Our backend is a FastAPI app. Modal provides an @asgi_app decorator that lets us serve this app on the internet without any extra effort.

Of the 4 endpoints in the file, POST /generate is the most interesting. It calls Vicuna.generate and streams the text results back. When a sentence is completed, it also calls Tortoise.speak asynchronously to generate audio, and return a handle to the function call. This handle can be used to poll for the audio later (take a look at our job queue example for an explanation of this pattern). If Tortoise is not enabled, we return the sentence directly so that the frontend can use the browser’s built-in text-to-speech.

In order to send these different types of messages over the same stream, each is sent as a serialized JSON consisting of a type and payload. The ASCII record separator character (\x1e) is used to delimit the messages, since it cannot appear in JSON.

    def gen_serialized():
        for i in gen():
            yield json.dumps(i) + "\x1e"

    return StreamingResponse(
        gen_serialized(),
        media_type="text/event-stream",
    )

In addition, the function checks if the body contains a noop flag. This is used to warm the containers when the user first loads the page, so that the models can be loaded into memory ahead of time.

The other endpoints are more straightforward:

  • POST /transcribe: Calls Whisper.transcribe and returns the results directly.
  • GET /audio/{call_id}: Polls to check if a Tortoise.speak call ID generated above has completed. If yes, it returns the audio data. If not, it returns a 202 status code to indicate that the request should be retried again.
  • DELETE /audio/{call_id}: Cancels a Tortoise.speak call ID generated above. Useful if we want to stop generating audio for a given user.

React frontend

We use the Web Audio API to record snippets of audio from the user’s microphone. The file src/frontend/processor.js defines an AudioWorkletProcessor that distinguishes between speech and silence, and emits events for speech segments so we can transcribe them.

Pending text-to-speech syntheses are stored in a queue. For the next item in the queue, we use the GET /audio/{call_id} endpoint to poll for the audio data.

Finally, the frontend maintains a state machine to manage the state of the conversation and transcription progress. This is implemented with the help of the incredible XState library.

const chatMachine = createMachine(
  {
    initial: "botDone",
    states: {
      botGenerating: {
        on: {
          GENERATION_DONE: { target: "botDone", actions: "resetTranscript" },
        },
      },
      botDone: { ... },
      userTalking: {  ... },
      userSilent: { ... },
    },
    ...
  },

Steal this example

The code for this entire example is available on GitHub. Follow the instructions in the README for how to run or deploy it yourself on Modal.