What is the skill ceiling for prompting? The team at Basis has been examining automatic prompt optimization tools and seeing promising results. But even when automatically-discovered prompts outperformed their hand-designed prompts, they faced that nagging question.
To find out, they organized the Prompt Olympics, an event that brought the culture of competitive programming to prompt engineering. What better way to find the peaks of human performance than some friendly competition — and a $5,000 prize?
At Modal, we love programming competitions (we hired IOI medalists before it was cool), so we jumped at the chance to provide them the infrastructure they needed to run the Prompt Olympics.
In this article, we walk through two components of the competition, explaining both the prompting challenges and how Basis deployed them on Modal.
Challenge: Writing a prompt to write code
In one challenge, participants were asked to do something familiar to any competitive programmer — write code that passes tests. But instead of writing it themselves, they had to write prompts to control language models writing code.
The underlying LLM for this challenge, LLaMA 3 70B, was deployed on Modal, which provides serverless GPU acceleration and simple web deployment for Python programs. If you want to deploy your own LLM on Modal, check out this guide.
To grade this challenge, the LLM-produced code needed to be executed and tested. But this is a competitive environment and the code is untrusted. How do you handle a foolish LLM (or an angry player) generating code that runs sudo rm -rf /
or unzips a zip bomb?
Modal provides just the right tool for this: a Modal Sandbox is a dynamically allocated secure container with all the features of normal Modal Functions, from controlled access to Secrets and custom environments to GPU acceleration. For more on running untrusted LLM code with Modal Sandboxes, see this video from LangChain.
Finale: Playing Crafter
The winner of the event was decided by playing a video game. But instead of controlling the character directly, participants had to construct a prompt for a language model that controlled the character. The video game chosen was Crafter, a simplified two-dimensional version of Minecraft designed for use in agent research. The video above shows the best run, where an agent successfully creates a pickaxe and mines coal while dodging enemies.
This challenge resembled a different type of competitive programming. In some competitions, participants are challenged not to pass test cases or minimize latency but to write programs that control virtual agents, like simulated delivery drones or tanks.
For this challenge, the team at Basis found that even the most capable open source models, like LLaMA 3 70B, were difficult to prompt effectively — though they were able to run those models on Modal. So they switched instead to a proprietary model API service for the LLM component.
But they still used Modal! In addition to running a client for the external modeling service, the Basis team ran the Crafter game on Modal. Modal provides general purpose computing infrastructure with serverless semantics and a easy-to-use Pythonic SDK, not just the tools you need for LLM inference.
You can read more about the competition, including the results and the winning prompts, on the Basis blog.
We were incredibly excited to bring our infrastructure to support the Basis team in bringing the spirit of competitive programming to prompting. Much of the team at Modal learned to program through competitions. If you did too, know that we’re always hiring!