AI evals are becoming the new compute bottleneck

AI evals are becoming the new compute bottleneck

5 0 0

Back when Stanford’s CRFM dropped HELM in 2022, the cost of evaluating a single model ranged from $85 for OpenAI’s code-cushman-001 to $10,926 for AI21’s J1-Jumbo. That was for static benchmarks—predicting labels on a fixed set of examples. Fast forward to 2026, and the Holistic Agent Leaderboard (HAL) just spent about $40,000 to run 21,730 agent rollouts across 9 models and 9 benchmarks. A single GAIA run on a frontier model can hit $2,829 before caching. The cost problem has shifted from training to evaluation, and it’s now a real bottleneck.

The static benchmark cost story

The HELM paper’s own accounting showed GPU-hours ranging from 540 to 4,200 for open models, with BLOOM and OPT at the top end. IBM Research noted that putting Granite-13B through HELM “can consume as many as 1,000 GPU hours.” Across HELM’s 30 models and 42 scenarios, the aggregate cost came to roughly $100,000. That’s not trivial, but it’s manageable for a well-funded lab.

Perlitz et al. looked at EleutherAI’s Pythia checkpoints—154 checkpoints for each of 16 models across 8 sizes, or 2,464 checkpoints total. Running the LM Evaluation Harness across all those checkpoints turns eval into a multiplier on training. For small models, evaluation becomes the dominant compute line item across the whole development cycle. When you scale inference-time compute, you scale evaluation costs.

Perlitz et al. then asked how much of HELM actually carried the rankings. The result was striking: a 100× to 200× reduction in compute preserved nearly the same ordering. Flash-HELM turned that into a coarse-to-fine procedure: run cheap evaluations first, then spend high-resolution compute only on the top candidates. Much of HELM’s compute was confirming rankings that the field could have inferred much more cheaply.

tinyBenchmarks compressed MMLU from 14,000 items to 100 anchor items at about 2% error using Item Response Theory. The Open LLM Leaderboard collapsed from 29,000 examples to 180. Anchor Points showed that as few as 1 to 30 examples could rank-order 87 language-model/prompt pairs on GLUE. Static benchmarks had a weakness you could exploit: model differences often concentrate in a small subset of items, so ranking can survive aggressive subsampling.

That trick weakened sharply once benchmarks moved from static predictions to agents.

Agent evals are messier

The HAL paper (Kapoor et al., ICLR 2026) is a nice public accounting of agent evaluation costs. It runs standardized agent harnesses across nine benchmarks covering coding, web navigation, science tasks, and customer service, with shared scaffolds and centralized cost tracking. The headline cost: $40,000 for 21,730 rollouts across 9 models and 9 benchmarks. By April 2026, the leaderboard had grown to 26,597 rollouts. Ndzomga’s independent reproduction arrives at almost the same number: $46,000 across 242 agent runs.

Behind that aggregate, the cost of a single benchmark run varies by four orders of magnitude across HAL tasks, and by three orders within some individual benchmarks. Claude Opus 4.1 charges $15 per million input tokens and $75 per million output. Gemini 2.0 Flash charges $0.10 and $0.40, a two-order-of-magnitude spread on input alone. Agent benchmarks rarely benchmark “the model” in isolation. They benchmark a model × scaffold × token-budget product, and small scaffold choices can multiply costs 10×.

Worse, higher spend does not reliably buy better results. On Online Mind2Web, Browser-Use with Claude Sonnet 4 cost $1,577 for 40% accuracy. SeeAct with GPT-5 Medium hit 42% for $171. The HAL paper notes “a 9× difference in cost despite just a two-percentage-point difference in accuracy.” On GAIA, an HAL Generalist with o3 Medium cost $2,828 for 28.5% accuracy, while a different agent hit 57.6% for $1,686. CLEAR finds across 6 SOTA agents on 300 enterprise tasks that “accuracy-optimal configurations cost 4.4 to 10.8× more than Pareto-efficient alternatives” with comparable real-world performance.

What this means for the field

The cost of evaluation is now a first-order constraint on who can participate in AI research. If you’re not a well-funded lab or a company with API credits to burn, you’re effectively locked out of agent evaluation. The compression techniques that worked for static benchmarks—subsampling, anchor points, Item Response Theory—don’t translate well to agent benchmarks because agent behavior is noisy, scaffold-sensitive, and only partly compressible. Training-in-the-loop benchmarks are expensive by construction, and when you try to add reliability to these evals, repeated runs further multiply the cost.

I’ve been watching this trend for a while, and it’s frustrating. The field is moving toward more complex, interactive evaluations, which is good for measuring real-world capability. But the cost structure is creating a two-tier system: those who can afford to run agent evals and those who can’t. The HAL paper’s $40K price tag is just the beginning. As agent benchmarks scale to more tasks, more models, and more configurations, the cost will only grow.

There are some bright spots. Flash-HELM’s coarse-to-fine approach could be adapted for agent benchmarks, though the noise makes it harder. Exgentic’s $22,000 sweep found a 33× cost spread on identical tasks, isolating scaffold choice as a first-order cost driver, which suggests that smarter scaffold selection could reduce costs. UK-AISI’s work on scaling agentic steps into the millions to study inference-time compute is interesting, but it’s not a cost-reduction strategy.

For now, if you’re building an agent benchmark, think carefully about the cost structure. If you’re evaluating an agent, consider whether you really need to run every configuration. And if you’re a researcher without a big budget, be prepared to work with smaller, cheaper benchmarks or to rely on shared infrastructure. The eval cost problem isn’t going away, but it’s worth understanding so you can work around it.

Comments (0)

Be the first to comment!