Groundsource: Google’s Gemini turns news articles into a flood database

Groundsource: Google’s Gemini turns news articles into a flood database

2 0 0

Google Research just dropped something I’ve been waiting for: a way to turn the noise of global news into actual, usable data. They call it Groundsource, and the first output is a massive open-access dataset of urban flash floods.

Let’s cut through the corporate speak. The problem is simple: we don’t have good historical data on flash floods. Earthquakes? We’ve got seismometers everywhere. Floods? Not so much. Satellites are great, but clouds block them, they only pass by every few days, and they mostly catch the big, slow disasters. The Dartmouth Flood Observatory has been doing good work for years, but it’s still a relatively sparse record. The UN’s GDACS system? About 10,000 entries. That’s not nothing, but it’s a drop in the bucket when you’re trying to train AI models that need to work globally.

Groundsource’s approach is clever: instead of trying to build a better sensor, they mine what already exists — news reports. Think about it. Every time a flash flood hits a city, there’s a local newspaper article, a government bulletin, maybe a social media post. The problem has always been scale. Nobody can read millions of articles and manually extract the location, date, and severity of each flood.

That’s where Gemini comes in. The Groundsource pipeline feeds news articles into Gemini, which extracts structured information: where did the flood happen, when, how severe, how many people were affected. The team then applies some validation steps to filter out noise. The result? 2.6 million flood events across 150 countries, from 2000 to the present. That’s not a typo — 2.6 million records, compared to GDACS’s 10,000.

The chart they published shows the explosion of digitized news over the last 25 years, and Groundsource’s event count tracks that curve pretty closely. It makes sense — more news means more raw material. But the interesting part is the density in recent years (2020–2025). We’re not just getting more articles; we’re getting better coverage of smaller, localized events that would have been missed entirely a decade ago.

Is this dataset perfect? No. News reports have their own biases. Rich countries with more media coverage will be overrepresented. A flash flood in a remote village in Bangladesh might get one short wire service report, while a similar event in a US suburb gets wall-to-wall coverage. The team acknowledges this, and the dataset comes with confidence scores so you can filter by reliability.

But here’s why I think this matters: it’s not just about floods. The Groundsource methodology is general. You could apply it to wildfires, landslides, disease outbreaks — any event that gets reported in the news. If Google follows through and releases more datasets, this could fundamentally change how we model natural disasters.

For now, the flash flood dataset is available openly. If you do any work in hydrology, urban planning, or climate risk, go grab it. Having 2.6 million data points to train on is going to change what’s possible with flood forecasting. And for the rest of us, it’s a reminder that the data we need is often already out there — we just need the right tools to extract it.

Comments (0)

Be the first to comment!