How Contextual AI automated CI with Modal GPUs

Customer Stories

September 18, 2024•4 minute read

Cutting edge platforms like Contextual AI often find that their software development practices require more flexible resources than legacy providers can offer. With Modal, Contextual AI was able to automate and parallelize their continuous integration (CI) on GPUs.

About Contextual AI

Contextual AI offers an end-to-end platform for building RAG 2.0 (retrieval-augmented generation) enterprise AI applications. The product integrates the entire RAG pipeline into a single optimized system which can be specialized for customer needs, delivering greater accuracy and transparency for knowledge-intensive tasks. The company is led by CEO Douwe Kiela, who pioneered the industry-standard RAG technique, and CTO Amanpreet Singh, who was a research engineer at Hugging Face and Meta’s Fundamental AI Research team.

A bottleneck on testing

CI is a practice where engineers integrate their code changes frequently, and each integration is verified by an automated build and automated tests. Because Contextual AI’s product uses LLMs, they needed a way to run CI using GPUs. There were two scenarios when they ran test suites:

Before a pull request (PR) was merged, they would run a large suite of small tests to ensure that the PR didn’t break any plumbing in the product. To optimize for efficiency, they used tiny, several-MB models as stand-ins.
Once a day, they would run more in-depth “quality” tests using larger models that customers would actually use, to ensure there were no regressions in model output.

Developers had to run these tests manually on in-house GPU nodes, which was inconvenient and time-consuming. It was easy to forget to run the tests before merging PRs, resulting in broken master code that would slow down the whole team.

“Previously you were just trusting that people would trigger these GPU tests manually before they merged code. I would have to ask, “Well, did you run the tests?” before I approved the PRs. But now we’re 50 people and you can’t rely on that.”

— Stas Bekman, ML Engineer at Contextual AI

Another pain point was procuring GPUs on demand. While Contextual AI had a massive quantity of GPUs reserved with GCP, the research team’s training and prototyping needs took priority. It didn’t make sense for CI to divert resources away from them, which is why Stas Bekman, an ML engineer at Contextual AI, wanted to find a reliable external provider.

Stas searched for CI-on-GPUs options, but didn’t find a good fit. Their CI required at least two GPUs but neither GitHub nor CircleCI provided more than one GPU per job. Furthermore, the GPUs they had available were old, slow, and expensive.

Back in his time at Hugging Face, Stas used an AWS on-demand GPU instance to solve this problem, but it wasn’t ideal. Updating the machine image was slow and cumbersome, and it could take 5+ minutes just to get an instance running. Often times CI would fail because no instance could be found, even when he tried searching across multiple availability zones. He wanted to avoid repeating the same mistake at Contextual AI.

After making a request on Twitter for suggestions, Stas decided to try Modal because he could access flexible configurations of GPUs on-demand. This is what the CI workflow looked like:

PR is submitted on GitHub.
A GitHub Action is triggered which calls a Modal Function. The Function has multiple GPUs attached and uses an image with custom requirements and pytest installed.
The Modal Function invokes pytest as a subprocess to run a suite of tests.
The first time the Function runs, Modal builds and caches the custom image. On subsequent runs, no image rebuild is needed, allowing the tests to start running within 30 seconds of job submission.

Simplified pattern of CI using Modal:

import modal

image = (
    modal.Image.debian_slim()
    .pip_install("pytest")
    .pip_install_from_requirements("requirements.txt")
)

app = modal.App("ci-testing", image=image)

@app.function(gpu="any", mounts=[tests])
def pytest():
    import subprocess

    subprocess.run(["pytest", "-vs"], check=True, cwd="/root")

This workflow allowed Contextual AI to fully automate their test suite. As a result, they can maximize their developer iteration speed while maintaining a high quality bar. Other key benefits:

GitHub Actions can directly trigger Modal, so there’s no need to manage self-hosted runners.
Modal spins up GPUs for each job submission, allowing CI for multiple PRs to run in parallel.
Modal bills by usage, which keeps costs low. Because image builds are cached, 99% of what’s billed is actual test run-time.

“I was shocked at the amazing support I received from Modal's team. They quickly created a sample repo that catered exactly to our needs and within a few hours we had our CI running. In this day and age it's very difficult to find excellent technical support within seconds of posting a request. It has been an amazing experience for our team collaborating with Modal.”

— Stas Bekman, ML Engineer at Contextual AI

All of this has been enabled by Modal’s custom infrastructure—including our own file system and scheduler—for running containers in the cloud. Modal can spin up GPU-enabled containers in as little as one second, which helps companies iterate fast and scale up to large production workloads.

Interested in CI on Modal? Check out our sample repo.

About Contextual AI

A bottleneck on testing

Parallelizable CI on Modal GPUs

Ship your first app in minutes.