Can LLMs Actually Help Physicists? Google Put 6 Models to the Test on Superconductivity

I’ve been watching the “AI for science” space with a mix of excitement and skepticism for a while now. It’s one thing to have a chatbot summarize an email or draft a grocery list. It’s another thing entirely to ask it to weigh in on a decades-old open problem in condensed matter physics. So when I saw that Google Research published a paper in PNAS testing LLMs on high-temperature superconductivity, I had to dig in.

The setup is smart: take six LLMs, throw expert-level questions at them about cuprate superconductors (those copper-oxide compounds that conduct electricity with zero resistance at -140°C or so), and have a panel of physicists grade the answers. Not just for factual accuracy, but for how well they handle competing theories, unresolved debates, and the kind of nuanced thinking a grad student or experienced researcher would need.

The top performers? NotebookLM and a custom-built system that both draw from a closed ecosystem of curated, quality-controlled sources. That’s not surprising—when you’re dealing with a field where the literature is thousands of papers deep and full of conflicting interpretations, having a trusted reference set matters. The open-web models struggled more, often giving answers that sounded plausible but missed key subtleties or failed to acknowledge where the field disagrees.

Here’s what I found most interesting: the study didn’t just test whether the models could regurgitate facts. They asked questions that required balancing evidence across experimental techniques—like comparing results from angle-resolved photoemission spectroscopy (ARPES) versus neutron scattering. That’s the kind of thing a human expert does by intuition and deep reading, and it’s genuinely hard for an LLM to get right without a solid grounding mechanism.

One example from the paper: a question about the role of charge density waves in cuprate superconductivity. This is an active debate—some groups think these waves are central to the pairing mechanism, others think they’re a competing phase. The best models gave a balanced view, citing both sides and pointing to specific experiments. The worst ones picked a side without acknowledging the controversy, which would mislead a student who doesn’t know better.

Now, the caveats. This is a single field, and a notoriously tricky one. High-Tc superconductivity has been a puzzle since 1987, and even the experts don’t agree on the mechanism. So if an LLM can’t nail this, it doesn’t mean it’s useless for other scientific domains. But the methodology—expert evaluation on open questions, not just benchmark datasets—is exactly what we need more of.

Also worth noting: the closed systems aren’t a panacea. They’re limited by their source material. If a key paper from 2023 isn’t in the curated set, the model won’t know about it. And the custom system required significant engineering to set up. This isn’t something a lab can just plug in overnight.

Still, I’m cautiously optimistic. The idea of an LLM as a “thought partner” for scientists—someone who can catch you up on decades of literature, point out where the field disagrees, and help you formulate new hypotheses—is compelling. We’re not there yet, but this study shows a path. The next step is to scale the curated approach to more fields and make the systems easier to deploy.

If you’re a physicist or just someone who cares about how AI actually performs under pressure, this paper is worth a read. It’s honest about the limitations and clear about what worked. No hype, just results.

Can LLMs Actually Help Physicists? Google Put 6 Models to the Test on Superconductivity

Comments (0)