Retrieval-augmented generation (RAG) for question-answering with LangChain
In this example we create a large-language-model (LLM) powered question answering web endpoint and CLI. Only a single document is used as the knowledge-base of the application, the 2022 USA State of the Union address by President Joe Biden. However, this same application structure could be extended to do question-answering over all State of the Union speeches, or other large text corpuses.
It’s the LangChain library that makes this all so easy. This demo is only around 100 lines of code!
Defining dependencies
The example uses packages to implement scraping, the document parsing & LLM API interaction, and web serving.
These are installed into a Debian Slim base image using the uv_pip_install method.
Because OpenAI’s API is used, we also specify the openai-secret Modal Secret, which contains an OpenAI API key.
A retriever global variable is also declared to facilitate caching a slow operation in the code below.
Scraping the speech
It’s super easy to scrape the transcript of Biden’s speech using httpx and BeautifulSoup.
This speech is just one document and it’s relatively short, but it’s enough to demonstrate
the question-answering capability of the LLM chain.
Since we’re fetching from an external server, we use Modal’s built-in Retries to handle transient
network failures or server issues with exponential backoff.
Constructing the Q&A chain
At a high-level, this LLM chain will be able to answer questions asked about Biden’s speech and provide references to which parts of the speech contain the evidence for given answers.
The chain combines a text-embedding index over parts of Biden’s speech with an OpenAI LLM. The index is used to select the most likely relevant parts of the speech given the question, and these are used to build a specialized prompt for the OpenAI language model.
Mapping onto Modal
With our application’s functionality implemented we can hook it into Modal.
As said above, we’re implementing a web endpoint, web, and a CLI command, cli.
Test run the CLI
To see the text of the sources the model chain used to provide the answer, set the --show-sources flag.
Test run the web endpoint
Modal makes it trivially easy to ship LangChain chains to the web. We can test drive this app’s web endpoint
by running modal serve potus_speech_qanda.py and then hitting the endpoint with curl:
You can also find interactive docs for the endpoint at the /docs route of the web endpoint URL.
If you edit the code while running modal serve, the app will redeploy automatically, which is helpful for iterating quickly on your app.
Once you’re ready to deploy to production, use modal deploy.