Google’s New Framework Tests Whether LLMs Actually Behave Like Humans

Google’s New Framework Tests Whether LLMs Actually Behave Like Humans

6 0 0

Google Research just dropped a paper that tries to answer a question I’ve been asking for a while: how closely do LLMs actually mimic human behavior in realistic social scenarios? Not just in self-report surveys where they can claim anything, but in situations where their advice could have real consequences.

The team, led by Amir Taubenfeld, Zorik Gekhman, and Lior Nezry, built a framework that transforms established psychological questionnaires into Situational Judgment Tests (SJTs). If you’re not familiar with SJTs, they’re the standard in psychology and HR for evaluating how someone would actually behave in complex situations. Think of them as multiple-choice tests where each option represents a different behavioral tendency.

The key insight here is that LLMs are notoriously good at gaming self-report questionnaires. Ask a model “Are you empathetic?” and it’ll say yes every time. But put it in a scenario where it has to choose between being empathetic or being efficient, and you start seeing the cracks.

The methodology is actually solid

They started with validated psychological instruments like the IRI for empathy and ERQ for emotion regulation. These aren’t random internet quizzes — they’re peer-reviewed, standardized measures used in international research. Each questionnaire item was adapted into a declaration of the model’s general advising tendency, then used to generate realistic scenarios.

For example, instead of asking “Do you agree with ‘I am quick to express an opinion’?” they’d create a workplace scenario where the model has to advise a user on whether to speak up in a meeting or hold back. Two clear courses of action, one supporting assertiveness, one opposing it.

Each SJT was reviewed by three independent annotators to ensure coherence and fidelity to the underlying behavioral markers. That’s a level of rigor I don’t see enough in AI evaluation papers.

What they found

They tested 25 different LLMs and found two distinct types of alignment gaps. The first is straightforward: model dispositions deviate from the consensus among human annotators. The second is more subtle: models fail to capture the range of human opinions when there’s no clear consensus.

This second gap is interesting because it highlights a fundamental limitation of current alignment approaches. We tend to train models to converge on a single “correct” answer, but real human behavior is diverse. Sometimes there’s no right answer, just different preferences.

The scenarios covered professional composure, conflict resolution, practical tasks like booking a trip, and everyday decision-making. This is higher than I expected in terms of scope — they didn’t just test empathy in a vacuum.

The pipeline

BehavioralLLMs1_Pipeline

The pipeline works like this: start with psychological questionnaires, adapt them to model declarations, generate SJTs, validate with human annotators, then compare model responses to human preferences collected from 550 participants. It’s clean, reproducible, and addresses the core problem of self-report validity.

What’s missing

I’m a bit disappointed they didn’t dig deeper into why these gaps exist. Is it a training data issue? A reward model limitation? The paper acknowledges the gaps but doesn’t offer much speculation on root causes. Also, 25 models is a decent sample, but I’d love to see this run on newer models and different architectures.

The LLM-as-a-judge mapping is also a bit of a black box. They use another model to determine which action the test model chose, which introduces its own biases. It’s a practical solution, but not ideal.

Why this matters

As LLMs move into advisory roles — healthcare, financial planning, even just day-to-day decision support — understanding their behavioral tendencies isn’t academic. It’s about trust. If a model consistently advises users to be more assertive than the average human would, that could lead to real-world friction.

This framework is a step toward quantifying that gap. It’s not perfect, but it’s better than the current state of affairs where we mostly rely on vibe checks and anecdotal evidence.

I expect future research to build on this, maybe by incorporating more diverse human populations or by testing models in interactive settings rather than static scenarios. For now, it’s a solid contribution to the alignment conversation.

Comments (0)

Be the first to comment!