QuilLLMan: Voice Chat with LLMs
Vicuna is the latest in a series of open-source chatbots that approach the quality of proprietary models like GPT-4, but in addition can be self-hosted at a fraction of the cost. We’ve enjoyed playing around with Vicuna enough at Modal HQ that we decided we wanted to have it available at all times in the form of a voice chat app.
So, we built QuiLLMan, a complete chat app that transcribes audio in real-time using Whisper, streams back a response from a language model, and synthesizes this response as natural-sounding speech.
Everything (including the React frontend and backend API) is deployed serverlessly on Modal, and you can play around with the live demo here. This post provides a high-level walkthrough of the repo. We’re looking to add more models and features to this as time goes on, and contributions are welcome!
Code overview
Traditionally, building a serverless web application with a backend API and three different ML services, each of these running in its own custom container and autoscaling independently, would require a lot of work. But with Modal, it’s as simple as writing 4 different classes and running a CLI command.
Our project structure looks like this:
Let’s go through each of these components in more detail.
We use GPTQ-for-LLaMa, an implementation of GPTQ, to quantize our model to 4 bits for faster inference. This repository needs to be built from scratch against CUDA runtime. Fortunately, Modal makes it easy to express a complex image definition like this one as a series of functions:
vicuna_image = (
modal.Image.from_dockerhub(
"nvidia/cuda:11.7.0-devel-ubuntu20.04",
setup_dockerfile_commands=[
"RUN apt-get update",
"RUN apt-get install -y python3 python3-pip python-is-python3",
],
)
.apt_install("git", "gcc", "build-essential")
.run_commands(
"git clone https://github.com/thisserand/FastChat.git",
"cd FastChat && pip install -e .",
)
.run_commands(
# FastChat hard-codes a path for GPTQ, so this needs to be cloned inside repositories.
"git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda /FastChat/repositories/GPTQ-for-LLaMa",
"cd /FastChat/repositories/GPTQ-for-LLaMa && python setup_cuda.py install",
gpu="any",
)
)
Above, we use from_dockerhub
to
select the official CUDA container as
the base image, install python and build requirements, clone the FastChat repo,
and finally build GPTQ-for-LLaMa
. Note that in order for the compilation step
to work, a CUDA environment is required; adding gpu="any"
lets us run that
step on a GPU machine.
The
generate
function itself constructs a prompt using the current input, previous history
and a prompt template. Then, it simply yields
tokens as they are produced.
Python generators just work out-of-the-box in Modal, so building streaming
interactions is easy.
Although we’re going to call this model from our backend API, it’s useful to
test it directly as well. To do this, we define a
local_entrypoint
:
@stub.local_entrypoint()
def main(input: str):
model = Vicuna()
for val in model.generate.call(input):
print(val, end="", flush=True)
Now, we can run the model with a prompt of our choice from the terminal:
modal run -q src.llm_vicuna --input "How do antihistamines work?"
In this file we define a Modal class that uses
OpenAI’s Whisper to transcribe audio in
real-time. The helper function load_audio
downsamples the audio to 16kHz
(required by Whisper) using ffmpeg
.
We’re using an A10G GPU for transcriptions, which lets us transcribe most segments in under 2 seconds.
The text-to-speech service is adapted from tortoise-tts-modal-api, a Modal deployment of Tortoise TTS. Take a look at those repos if you’re interested in understanding how the code works, or for a full list of parameters and voices you can use.
Our backend is a FastAPI app. Modal provides an @asgi_app decorator that lets us serve this app on the internet without any extra effort.
Of the 4 endpoints in the file, POST /generate
is the most interesting. It
calls Vicuna.generate
and streams the text results back. When a sentence is
completed, it also calls Tortoise.speak
asynchronously to generate audio, and
return a handle to the
function call. This
handle can be used to poll for the audio later (take a look at our
job queue example for an
explanation of this pattern). If Tortoise is not enabled, we return the sentence
directly so that the frontend can use the browser’s built-in text-to-speech.
In order to send these different types of messages over the same
stream, each is sent as a serialized JSON
consisting of a type
and payload
. The ASCII record separator character
(\x1e
) is used to delimit the messages, since it cannot appear in JSON.
def gen_serialized():
for i in gen():
yield json.dumps(i) + "\x1e"
return StreamingResponse(
gen_serialized(),
media_type="text/event-stream",
)
In addition, the function checks if the body contains a noop
flag. This is
used to warm the containers when the user first loads the page, so that the
models can be loaded into memory ahead of time.
The other endpoints are more straightforward:
POST /transcribe
: CallsWhisper.transcribe
and returns the results directly.GET /audio/{call_id}
: Polls to check if aTortoise.speak
call ID generated above has completed. If yes, it returns the audio data. If not, it returns a202
status code to indicate that the request should be retried again.DELETE /audio/{call_id}
: Cancels aTortoise.speak
call ID generated above. Useful if we want to stop generating audio for a given user.
We use the
Web Audio API
to record snippets of audio from the user’s microphone. The file
src/frontend/processor.js
defines an
AudioWorkletProcessor
that distinguishes between speech and silence, and emits events for speech
segments so we can transcribe them.
Pending text-to-speech syntheses are stored in a queue. For the next item in the
queue, we use the GET /audio/{call_id}
endpoint to poll for the audio data.
Finally, the frontend maintains a state machine to manage the state of the conversation and transcription progress. This is implemented with the help of the incredible XState library.
const chatMachine = createMachine(
{
initial: "botDone",
states: {
botGenerating: {
on: {
GENERATION_DONE: { target: "botDone", actions: "resetTranscript" },
},
},
botDone: { ... },
userTalking: { ... },
userSilent: { ... },
},
...
},
Steal this example
The code for this entire example is available on GitHub. Follow the instructions in the README for how to run or deploy it yourself on Modal.