How many raters do you actually need for a good AI benchmark?

How many raters do you actually need for a good AI benchmark?

3 0 0

If you’ve spent any time around ML evaluation, you know the drill: grab a handful of raters, have them label a bunch of items, take the majority vote, call it ground truth. It’s fast, it’s cheap, and it’s probably wrong more often than we’d like to admit.

Google Research just published a paper called “Forest vs Tree: The (N,K) Trade-off in Reproducible ML Evaluation” that finally asks the question nobody wants to hear: are we spending our annotation budget wrong?

The forest vs. tree problem

Here’s the core tension. You have a fixed budget for human evaluation. Do you rate lots of items with few raters each (the forest), or fewer items with many raters each (the tree)?

Most of the field has defaulted to the forest approach. Grab 1 to 5 raters per item, cover as many examples as possible, and hope the noise averages out. Google’s simulation work suggests this is often a bad bet.

The intuition is simple: when tasks involve subjective judgment – toxicity, hate speech, even something like image quality – humans disagree. A lot. Collapsing that disagreement into a single plurality label throws away information and makes your benchmark less reproducible. Two different research groups can run the same evaluation and get different results simply because of who happened to be in the rater pool.

What the simulation actually found

The team built a simulator using real-world datasets for subjective tasks and stress-tested thousands of combinations. They varied N (total items rated) from 100 to 50,000 and K (raters per item) from 1 to 500. The goal was finding configurations that produced statistically reliable results (p < 0.05) – meaning another team repeating the experiment would likely get the same answer.

Unsurprisingly, the sweet spot depends on how subjective the task is. For highly subjective tasks, you need more raters per item. The paper provides an open-source simulator so you can run the numbers for your own use case, which I appreciate. Too many research papers hand-wave this stuff.

What caught my attention was how far off the default settings are. A lot of common benchmarks using 1-3 raters per item are essentially gambling on reproducibility. You might get lucky, but you’re not building reliable science.

The practical takeaway

If you’re building a benchmark or running an evaluation, don’t just default to the cheapest annotation setup. Run the simulator (they released it, link in the paper) and figure out where your budget actually needs to go. Sometimes that means fewer items with more raters each.

This isn’t a new problem – the NLP community has been grumbling about annotation disagreements for years. But having a concrete framework to optimize the N/K trade-off is genuinely useful. I just wish more of these insights made it into the standard evaluation pipelines people actually use.

The paper is worth a read if you care about evaluation methodology. And if you don’t care about evaluation methodology, you should start caring, because bad benchmarks mean bad science.

Comments (0)

Be the first to comment!