
If you are looking to build a real-time voice or video application, you can’t just use HTTP. It’s too slow. Traditional HTTP is request-response based, creating overhead for each interaction. Establishing new TCP connections and handshaking also creates additional latency.
Instead, you should be using technologies like WebRTC. WebRTC is purpose-built for peer-to-peer audio/video streaming and data sharing without requiring plugins or additional software.
But WebRTC is complex. It’s not easy to get right. You often have to write thousands of lines of boilerplate code to handle connections, signalling, media capture, peer connections, ICE candidates, STUN/TURN servers etc.
That’s why LiveKit has become so popular. LiveKit is an open-source library that abstracts away the complexity of working with WebRTC. Rather than having to deal with all the boilerplate yourself, you just use LiveKit’s SDK.
LiveKit Agents
Recently, LiveKit has launched a framework for building real-time voice assistants, called LiveKit Agents.
It allows you to define an AI agent that will join as a participant in a LiveKit room.
This guide will walk you through deploying LiveKit agents on Modal using Python. We’ll cover the LiveKit agent setup, the different configuration options within LiveKit, and how to actually deploy on Modal.
LiveKit Agent Lifecycle
Here’s a high-level overview of the agent lifecycle:
Worker registration: Your agent connects to the LiveKit server, registering as a “worker” via a WebSocket.
Agent dispatch: When a user connects to a room, the LiveKit server selects an available worker, which then instantiates your program and joins the room. A worker can run multiple agent instances in separate processes.
Your program: Here, you utilize the LiveKit Python SDK and can leverage plugins for processing voice and video data.
Room close: The room closes automatically when the last non-agent participant leaves, and then disconnects remaining agents.
Why Deploy LiveKit Agents on Modal?
You can also deploy LiveKit Agents on Render, Kubernetes, and other cloud providers, but we think that Modal is the best option. Modal is a serverless cloud platform and Python library. With Modal, you can write a Python function, add a Modal decorator, and deploy your application in a container in the cloud in seconds.
✅ No Infrastructure Management
Modal removes the complexity of managing Kubernetes clusters or provisioning cloud instances. Your LiveKit agents run in a fully managed environment with zero operational overhead.
✅ Automatic Scaling
With Modal, you can scale your LiveKit workloads dynamically based on demand. Modal’s serverless execution model ensures you only pay for what you use.
✅ Optimized GPU Execution
If your agent needs to run deep learning models, Modal supports running your workloads on GPUs like NVIDIA H100s.
Prerequisites
To run the following code, you will need:
- A LiveKit account
- A Modal account
- Accounts with the different AI API providers you want to use (OpenAI , Cartesia, Deepgram, etc.), along with their API keys.
- Run
pip install modal
to install the modal Python package - Run
modal setup
to authenticate (if this doesn’t work, trypython -m modal setup
) - Copy the code below into a file called
app.py
- Run
modal run app.py
Setting Up LiveKit Agents on Modal
Step 1: Adding Secrets in Modal Dashboard
Before deploying your LiveKit agent, you need to add your API keys and secrets to the Modal Dashboard to securely store and access them.
Navigate to the Secrets section in the Modal dashboard and add the following secrets (for example purposes, we’re using OpenAI, Cartesia, and Deepgram):
LIVEKIT_URL
- Your LiveKit WebRTC server URLLIVEKIT_API_KEY
- API key for authenticating LiveKit requestsLIVEKIT_API_SECRET
- API secret for LiveKit authentication
You can find your LiveKit URL and API keys under Settings > Project and Settings > Keys in the LiveKit dashboard.
OPENAI_API_KEY
- API key for OpenAI’s GPT-based processingCARTESIA_API_KEY
- API key for Cartesia’s TTS servicesDEEPGRAM_API_KEY
- API key for Deepgram’s STT services
Once added, you can reference these secrets in your Modal functions.
Step 2: Define the Modal Application
We define a Modal App with a lightweight Debian-based container image, then install the necessary Python packages.
We also pre-import libraries that will be used by the functions we run on Modal in a given image using the with image.imports
context manager.
from modal import App, Image, Secret, fastapi_endpoint, FunctionCall, Dict
import asyncio
image = Image.debian_slim().pip_install(
"livekit>=0.19.1",
"livekit-agents>=0.12.11",
"livekit-plugins-openai>=0.10.17",
"livekit-plugins-silero>=0.7.4",
"livekit-plugins-cartesia==0.4.7",
"livekit-plugins-deepgram==0.6.19",
"python-dotenv~=1.0",
"cartesia==2.0.0a0",
"fastapi[standard]",
"aiohttp",
)
app = App("livekit-example", image=image)
# Create a persisted dict - the data gets retained between app runs
room_dict = Dict.from_name("room-dict", create_if_missing=True)
with image.imports():
from livekit import rtc
from livekit.agents import AutoSubscribe, JobContext
from livekit.agents.worker import Worker, WorkerOptions
from livekit.agents import llm
from livekit.agents.pipeline import VoicePipelineAgent
from livekit.plugins import openai, deepgram, silero, cartesia
Step 3: LiveKit Agent Entrypoint
Define the entrypoint function that connects the agent to a LiveKit room:
async def livekit_entrypoint(ctx: JobContext):
print("Connecting to room", ctx.room.name)
await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)
participant = await ctx.wait_for_participant()
run_multimodal_agent(ctx, participant)
This function:
- Connects to a LiveKit room
- Subscribes to audio-only streams
- Waits for a participant to join
- Starts a multimodal AI-powered agent
Step 4: Running the Multimodal AI Agent
Finally, we define a multimodal agent that uses Deepgram’s API for speech recognition (STT), OpenAI’s GPT-4o-mini for the large language model (LLM), and Cartesia’s TTS for text-to-speech (TTS) to process voice interactions. You can also use other LLMs and TTS services - LiveKit supports a wide range of plugins.
def run_multimodal_agent(ctx: JobContext, participant: rtc.RemoteParticipant):
print("Starting multimodal agent")
initial_ctx = llm.ChatContext().append(
role="system",
text="You are a voice assistant created by Modal. You answer questions and help with tasks."
)
agent = VoicePipelineAgent(
vad=silero.VAD.load(),
stt=deepgram.STT(model="nova-2-general"),
llm=openai.LLM(model="gpt-4o-mini"),
tts=cartesia.TTS(),
chat_ctx=initial_ctx,
)
agent.start(ctx.room, participant)
Step 5: Handle LiveKit Web Events on room creation and deletion
LiveKit can be configured to send webhooks upon different events, like when a room is started or finished.
To handle these events, we use Modal’s @fastapi_endpoint
decorator to create a
FastAPI endpoint that listens for these events. Upon room creation, we spawn a
container in Modal to run the LiveKit worker. Upon room completion, the function
is cancelled and the Modal function is spun down. What this means is that you
are only charged for when a room is actually open and running.
@app.function(image=image)
@fastapi_endpoint(method="POST")
async def run_livekit_agent(request: dict):
from aiohttp import web
room_name = request["room"]["sid"]
## check whether the room is already in the room_dict
if room_name in room_dict and request["event"] == "room_started":
print(
f"Received web event for room {room_name} that already has a worker running"
)
return web.Response(status=200)
if request["event"] == "room_started":
call = run_agent_worker.spawn(room_name)
room_dict[room_name] = call.object_id
print(f"Worker for room {room_name} spawned")
elif request["event"] == "room_finished":
if room_name in room_dict:
function_call = FunctionCall.from_id(room_dict[room_name])
# spin down the Modal function
function_call.cancel()
# delete the room from the room_dict
del room_dict[room_name]
print(f"Worker for room {room_name} spun down")
return web.Response(status=200)
Step 6: Running the LiveKit Worker
Next, we define a Modal function that runs the LiveKit worker. We specify that we want to run this function (i.e. the LiveKit worker) with a GPU. We also want to handle the case where the worker is cancelled, whereupon it will receive a cancellation signal and clean up.
@app.function(
gpu="A100", timeout=3000, secrets=[Secret.from_name("livekit-voice-agent")]
)
async def run_agent_worker(room_name: str):
import os
print("Running worker")
worker = Worker(
WorkerOptions(
entrypoint_fnc=livekit_entrypoint,
ws_url=os.environ.get("LIVEKIT_URL"),
)
)
try:
await worker.run() # Wait for the worker to finish
except asyncio.CancelledError:
print(f"Worker for room {room_name} was cancelled. Cleaning up...")
# Perform cleanup before termination
await worker.drain()
await worker.aclose()
print(f"Worker for room {room_name} shutdown complete.")
raise # Re-raise to propagate the cancellation
finally:
await worker.drain()
await worker.aclose()
Step 7: Deploy the Modal App
With all this code in an app.py
file, we can deploy both the Modal function
and the FastAPI endpoint by running modal deploy app.py
.
In stdout, you’ll see the URL of the FastAPI endpoint, which you need to copy and add to the LiveKit dashboard as the webhook URL.
Step 8: Spinning up a LiveKit frontend
LiveKit provides a frontend Sandbox that you can use to test your agent.
Go to the LiveKit dashboard > Sandbox > Voice assistant. You should be able to
instantiate a voice assistant frontend sandbox. Since you have deployed your
agent with the appropriate LIVEKIT_URL
, the frontend sandbox will automatically connect to
your agent.
Conclusion
LiveKit Agents allows developers to build real-time voice assistants with minimal effort.
And the best way to deploy is with Modal!