Cutting edge platforms like Contextual AI often find that their software development practices require more flexible resources than legacy providers can offer. With Modal, Contextual AI was able to automate and parallelize their continuous integration (CI) on GPUs.
About Contextual AI
Contextual AI offers an end-to-end platform for building RAG 2.0 (retrieval-augmented generation) enterprise AI applications. The product integrates the entire RAG pipeline into a single optimized system which can be specialized for customer needs, delivering greater accuracy and transparency for knowledge-intensive tasks. The company is led by CEO Douwe Kiela, who pioneered the industry-standard RAG technique, and CTO Amanpreet Singh, who was a research engineer at Hugging Face and Meta’s Fundamental AI Research team.
A bottleneck on testing
CI is a practice where engineers integrate their code changes frequently, and each integration is verified by an automated build and automated tests. Because Contextual AI’s product uses LLMs, they needed a way to run CI using GPUs. There were two scenarios when they ran test suites:
- Before a pull request (PR) was merged, they would run a large suite of small tests to ensure that the PR didn’t break any plumbing in the product. To optimize for efficiency, they used tiny, several-MB models as stand-ins.
- Once a day, they would run more in-depth “quality” tests using larger models that customers would actually use, to ensure there were no regressions in model output.
Developers had to run these tests manually on in-house GPU nodes, which was inconvenient and time-consuming. It was easy to forget to run the tests before merging PRs, resulting in broken master code that would slow down the whole team.
Another pain point was procuring GPUs on demand. While Contextual AI had a massive quantity of GPUs reserved with GCP, the research team’s training and prototyping needs took priority. It didn’t make sense for CI to divert resources away from them, which is why Stas Bekman, an ML engineer at Contextual AI, wanted to find a reliable external provider.
Stas searched for CI-on-GPUs options, but didn’t find a good fit. Their CI required at least two GPUs but neither GitHub nor CircleCI provided more than one GPU per job. Furthermore, the GPUs they had available were old, slow, and expensive.
Back in his time at Hugging Face, Stas used an AWS on-demand GPU instance to solve this problem, but it wasn’t ideal. Updating the machine image was slow and cumbersome, and it could take 5+ minutes just to get an instance running. Often times CI would fail because no instance could be found, even when he tried searching across multiple availability zones. He wanted to avoid repeating the same mistake at Contextual AI.
Parallelizable CI on Modal GPUs
After making a request on Twitter for suggestions, Stas decided to try Modal because he could access flexible configurations of GPUs on-demand. This is what the CI workflow looked like:
- PR is submitted on GitHub.
- A GitHub Action is triggered which calls a Modal Function. The Function has multiple GPUs attached and uses an image with custom requirements and
pytest
installed. - The Modal Function invokes
pytest
as a subprocess to run a suite of tests. - The first time the Function runs, Modal builds and caches the custom image. On subsequent runs, no image rebuild is needed, allowing the tests to start running within 30 seconds of job submission.
Simplified pattern of CI using Modal:
import modal
image = (
modal.Image.debian_slim()
.pip_install("pytest")
.pip_install_from_requirements("requirements.txt")
)
app = modal.App("ci-testing", image=image)
@app.function(gpu="any", mounts=[tests])
def pytest():
import subprocess
subprocess.run(["pytest", "-vs"], check=True, cwd="/root")
This workflow allowed Contextual AI to fully automate their test suite. As a result, they can maximize their developer iteration speed while maintaining a high quality bar. Other key benefits:
- GitHub Actions can directly trigger Modal, so there’s no need to manage self-hosted runners.
- Modal spins up GPUs for each job submission, allowing CI for multiple PRs to run in parallel.
- Modal bills by usage, which keeps costs low. Because image builds are cached, 99% of what’s billed is actual test run-time.
All of this has been enabled by Modal’s custom infrastructure—including our own file system and scheduler—for running containers in the cloud. Modal can spin up GPU-enabled containers in as little as one second, which helps companies iterate fast and scale up to large production workloads.
Interested in CI on Modal? Check out our sample repo.