Beat GPT-4o at Python by searching with 100 dumb LLaMAs

Charles Frye@charles_irl

AI Engineer

Howard Halim@HowardHalim

Software Engineer

View on GitHub

One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.

Richard Sutton, The Bitter Lesson

The eponymously distasteful take-away of Richard Sutton’s essay has often been misconstrued: because scale is all you need, they say, smaller models are doomed to irrelevance. The rapid increase in model size above one trillion parameters and the technological limitations of GPU memory together seemed to foreclose on economical frontier intelligence anywhere except at an oligopoly of intelligence-as-a-service providers. Open models and self-serve inference were in retreat.

But as the quote above indicates, there are in fact two arrows in the scaling quiver: learning and search. Learning, as we do it now with neural networks, scales with memory at inference time — larger models perform better, ceteris paribus, because they can extract more data from their training set into more circuits and more templates. Search scales smoothly with compute at inference time — compute that can be spent on either producing higher quality candidates or on producing more candidates. In the ideal case, the scaling behavior can be predicted via so-called scaling laws.

Recent papers indicate that generative models like LLMs can be scaled up with search. The Large Language Monkeys paper, published on arXiv by Brown, Juravsky, and co-authors last week, includes several results in this vein and indicates that frontier-level intelligence in certain domains can be elicited from smaller models that can run on a single, past-generation GPU. Further, they observed smooth, predictable improvement of performance with scale.

Put more simply: where before, it seemed frontier capabilities required one horse-sized duck, it is clear we can now alternatively get them with one hundred duck-sized horses (or, rather, LLaMAs).

This weekend, we set out to replicate this finding.

Running all of our experiments, including configuration and testing, cost well under $50.

You can find our code here. You can run it yourself without exceeding the $30/month in credits included in Modal’s free tier.

Metrics and data: HumanEval and pass@k

We picked a dataset that was not covered in the Large Language Monkeys paper: HumanEval, a somewhat misleadingly-named dataset of Python function specifications and their tests from OpenAI.

The existence of these tests is crucial to enabling search. Any candidate solution can be evaluated by running it against the tests — no humans are required to evaluate HumanEval. That means correctness can be assessed objectively, with none of the issues that bedevil LLM-as-judge approaches. The Large Language Monkeys paper further indicates that majority-voting and other techniques fall off their scaling laws quickly.

We set out to demonstrate that by running LLaMA 3.1 8B Instruct many times, we could match or exceed GPT-4o’s performance on HumanEval. Performance is measured via the “pass@k” metric: the chance that out of the k programs produced by the LLM, at least one will pass the tests. We consider also “fail@k”, the chance that no program will pass the tests, which is always 1 - pass@k. Result aggregator PapersWithCode reports GPT-4o’s pass@1 performance as 90.2% (0-shot, taken from Claude 3.5 Sonnet evals), so that was our target.

Infrastructure: LLM inference

We ran our experiments with Modal’s serverless GPUs. Smaller models are generally more compatible with a serverless approach because they can be more rapidly loaded from remote storage or disk. The arithmetic throughput of GPUs is many orders of magnitude greater than the read throughput of disks (H100 FLOP throughput is measured in PB/s), so it is in general a good idea to trade more computing time for less loading time. Of course, that means you want to make sure that setup time is as fast as possible, as we’ve done at Modal by rewriting the container stack.

Our experiments were enabled by the open source vLLM inference server software. Follow-up on promising initial research into scaling out search with LLMs last year was slowed by the need to implement performant caching mechanisms. These mechanisms are now a standard part of inference servers, pioneered by vLLM. Caching ensures that token sequences that are repeatedly processed (like the prompt whose solution is being searched for) incur only constant cost with respect to search scale. Executing batch inference was as simple as changing the n parameter in our generic OpenAI-compatible client’s ChatCompletion requests to a vLLM server running in OpenAI-compatible mode on Modal. Check out this guide to running vLLM on Modal for more details.

We scaled up to ten A100-40GB GPUs and hit ~40,000 output tokens per second without particular attention to tuning — a decided benefit of vLLM over other (nominally more performant) inference servers. This scale is compatible with Modal’s free tier, but enterprises running on Modal can easily scale at least two orders of magnitude higher, or 4,000,000 output tokens/second. With our new reduced prices, that’d cost roughly $0.25 per million output tokens, competitive with dedicated inference-as-a-service providers — plus greater control over your deployment.

Infrastructure: Evaluation

Evaluating the model’s output requires executing arbitrary Python code, which means we need a technique for secure isolation. That would be a tricky proposition for a platform that offers inference-as-a-service or serverless GPUs alone. Good thing we have Modal Sandboxes! Sandboxes use the same fast-booting, secure containerization technology that powers the rest of Modal, but provide a simple interface for dynamic creation and teardown in the middle of program execution.

Again restricting ourselves to the concurrency limits of Modal’s free tier, we were able to run ~3,000 tests in parallel (32 workers per node on 100 nodes). This was more than sufficient for our needs, so we didn’t press further on scaling evaluation.

Matching and exceeding GPT-4o’s performance

We were able to replicate the core results of the Large Language Monkeys paper with a new model (they used the LLaMA 3 series, we used LLaMA 3.1) and a new dataset (they showed results for math datasets like GSM8K and for the software engineering dataset SWE-bench, we used HumanEval).

Specifically, we found that (with minimal prompt tuning and no tuning of other hyperparameters) we could boost the performance of LLaMA 3.1 8B from 66.4% with only one generation to comparable performance with GPT-4o with 100 generations (90.5% versus 90.2%) and clearly superior performance with 1000 (95.1%).

Results for LLaMA 3.1 8B on HumanEval pass@k demonstrating better performance than GPT-4o pass@1 for 100 or more samples

LLaMA 3.1 8B pass@k on HumanEval, for k from 1 to 1000, versus GPT-4o's reported pass@1 performance.

We also found that our results on HumanEval were smoothly predictable (“enjoy scaling laws”) across three orders of magnitude. We prefer the following presentation, which inverts “pass@k” to “fail@k” and logarithmically transforms both axes.

Results for LLaMA 3.1 8B on HumanEval fail@k demonstrating smooth log-linear scaling across three orders of magnitude

LLaMA 3.1 8B fail@k on HumanEval, for k from 1 to 1000, versus GPT-4o's reported fail@1 performance.
Both axes are log-transformed.

Note that these results are distinct from a “replication” of the original paper’s results in the strict sense. Instead, these are a replication of the core claim, which is that, when augmented with search, smaller models can outperform bigger ones in a predictable way. We consider that a stronger signal for the industrial relevance of the underlying work than replication sensu stricto.

What’s next?

Search is a powerful technique for improving intelligent systems that has been relatively under-appreciated in this past decade focused on (deep) learning.

Search is powerful precisely because it can be transparently scaled. In the words of Richard Sutton, search “continue[s] to scale with increased computation even as the available computation becomes very great”. Modal is designed to make the available computation very great indeed. Search also shifts the balance of resource consumption from memory to compute, which has, due to semiconductor trends, historically been a winning move. Not coincidentally, it favors Modal’s serverless execution model.

Search is enabled by high quality evaluation of outcomes. Impressive recent results in mathematics, like DeepMind’s AlphaProof and AlphaGeometry 2 getting a silver medal in the 2024 International Math Olympiad, have been enabled by the translation of informal natural language mathematical problems into formal statements in Lean, which enables their detailed supervision by a proof verifier/compiler. The increased parallelization of mathematical work made possible by formalization also played a role in the recent verification that the fifth Busy Beaver number is 47,176,870.

By the Curry-Howard-Lambek correspondence, mathematical proofs can be identified with computer programs. We can expect similar gains in the use of generative models in programming by pairing them with compilers and test suites, as in our small experiment and in the original paper’s experiments on SWE-bench.

The extension of this technique to domains outside of mathematics and programming is not obvious — how do you effectively search over open-ended natural language responses to “write an email to my insurer contesting this claim denial” or “summarize this email”?. But we can loosely expect that generative models will see gains in performance in domains in proportion to those domains’ ability to precisely specify and speed up their measurement of outcomes and thence search. Agents in repeatable digital environments seem like a good frontier to target.

From this point of view, search is downstream of evaluation. Hence the claim that many AI engineers are making: evaluation is the missing ingredient in the productionization and continual improvement of generative model applications.