Google’s AI Overviews, the Gemini-powered summary robot that sits at the top of search results, has been a punching bag since its 2024 launch. People have complained about it serving up nonsense, hallucinating facts, and generally making a mess of what used to be a simple link list. It has gotten better, I’ll give it that. But “better” doesn’t mean “good.”
The New York Times teamed up with a startup called Oumi to actually measure how often AI Overviews gets things wrong. Oumi used the SimpleQA benchmark, which OpenAI released back in 2024. It’s just a list of over 4,000 questions with verifiable answers. You feed them into an AI and check whether the answers match reality. Simple stuff.
When Oumi first ran the test last year, on the Gemini 2.5 model, AI Overviews was accurate 85 percent of the time. After the Gemini 3 update, that number climbed to 91 percent. So it’s right roughly 9 out of 10 times.
Sounds okay, right? But flip that around. One in ten answers is wrong. That’s a 10 percent error rate on a product that handles billions of queries daily. Extrapolate that out, and you’re looking at tens of millions of incorrect answers per day. Hundreds of thousands per hour. Every minute, thousands of people are getting confidently wrong information from the biggest search engine on the planet.
I’ve been watching this space for years, and this is higher than I expected. Not the accuracy number, but the sheer volume of garbage being pushed out. Google has been treating accuracy as a percentage game, which is the wrong way to think about it when your scale is this massive. A 99 percent accuracy rate would still mean millions of errors daily. At 90 percent? That’s a disaster.
The problem isn’t just the errors themselves. It’s the presentation. AI Overviews sits at the top of the page, formatted like an authoritative answer. A lot of people don’t scroll past it. They take that summary as gospel. So when it’s wrong, it’s not just a wrong search result — it’s a wrong answer delivered with the full weight of Google’s branding behind it.
Oumi’s test also only covers questions with clear, factual answers. The kind of stuff you can look up in an encyclopedia. Real-world queries are messier. Opinions, recommendations, subjective advice. I’d bet the error rate on those is even higher.
Google has been playing catch-up on AI accuracy for years now. Every update brings incremental improvements, but the fundamental issue remains: large language models are not databases. They generate text that looks correct. They don’t know things. They guess, based on patterns. And when they guess wrong, there’s no shame, no hesitation. Just a confident lie.
I’m not saying AI Overviews is useless. For simple, well-documented facts, it’s probably fine most of the time. But “most of the time” is not a standard we should accept for a tool that hundreds of millions of people rely on daily. Google needs to stop benchmarking against itself and start thinking about what happens when a tenth of your answers are poison.
Comments (0)
Login Log in to comment.
Be the first to comment!