Google and NYU Built a Sim to Grade 'Future-Ready' Skills. Here's How It Works.

I’ve been watching the “future-ready skills” conversation for a while now. Every few months, some think tank or consulting firm puts out a report saying critical thinking, collaboration, and creativity will save us from the robots. The problem? Nobody has a good way to measure whether students actually have these skills. Multiple choice tests sure as hell don’t cut it.

Google Research, in partnership with New York University, has been working on that problem. They just released Vantage, a research experiment that uses generative AI to create simulated team conversations for assessing these soft skills. And the early results are actually interesting: the AI’s scoring is on par with human experts.

The measurement problem

Here’s the thing about future-ready skills: they’re a pain to assess. You can’t give someone a multiple choice test for “does this person handle conflict well” or “can they build on someone else’s ideas.” Real human interactions would be ideal, but they’re expensive, hard to standardize, and a nightmare to grade consistently. How do you fairly evaluate conflict resolution if your group never disagrees? What if a team settles on the first idea that comes up and never has to build creatively?

Standardized tests are too rigid. They strip away the context and interaction that make these skills meaningful. So most schools just don’t assess them at all, which means they don’t get taught. It’s the classic “what gets measured gets managed” problem, except nobody’s measuring.

How Vantage works

The Vantage setup is clever. Students enter a simulated conversation with multiple AI avatars, all working together on a task. Think preparing for a debate or pitching a creative idea. The avatars aren’t just passive chatbots — there’s an “Executive LLM” running the show, using a rubric to steer the conversation toward effective assessment.

Here’s the part I like: the system dynamically introduces challenges. It’ll have an avatar push back on an idea, or introduce a conflict. It’s watching how the student responds. Do they get defensive? Do they find a way to incorporate the feedback? Can they keep the group moving forward? The Executive LLM keeps adjusting the scenario until it has enough information to assess the student’s skills.

This is basically an adaptive test, but for things that actually matter in the real world. The AI acts as both the assessment engine and the simulated teammates. It’s a sandbox where students can practice and be evaluated on the same skills that frameworks like the OECD Learning Compass 2030 and the WEF’s Future of Jobs report keep telling us are critical.

Does it actually work?

The study with NYU found that the AI’s scoring was on par with human experts. That’s not nothing. Human raters disagree with each other plenty on subjective assessments, so matching human performance is a meaningful benchmark. The system is consistent, scalable, and doesn’t get tired after grading the 50th student.

That said, I have questions. The research paper mentions that the Executive LLM uses a “provided assessment rubric.” Who writes that rubric? How do you ensure it’s not encoding the biases of whoever designed it? And there’s a difference between matching human experts in a controlled study and actually deploying this in a classroom where students might game the system or feel anxious about being evaluated by AI.

Vantage is currently available in English on Google Labs for sign up. It’s aimed at high school and college students. The team is clear that this is a research experiment, not a finished product. But it’s the most practical approach I’ve seen for tackling this measurement problem.

Why this matters more than you think

We’re heading into a world where AI can do a lot of the routine cognitive work. The skills that remain valuable — critical thinking, collaboration, creativity, conflict resolution — are exactly the ones we’re worst at teaching and assessing. If we can’t measure them, we can’t improve them. And if we can’t improve them, we’re sending students into a job market that increasingly demands these competencies.

I’ve seen a lot of AI-in-education hype that amounts to “let’s put chatbots in classrooms.” This is different. It’s using AI to solve a specific, hard problem that has resisted traditional approaches. The simulated environment approach is more authentic than a test but more scalable than real human assessment. It’s not perfect, but it’s a real attempt to bridge the gap between what we say we value and what we actually measure.

The next step will be seeing how this holds up in real classrooms, with real students who might not be motivated to perform well in a simulated environment. But for now, this is one of the more thoughtful applications of generative AI in education I’ve come across.

Google and NYU Built a Sim to Grade ‘Future-Ready’ Skills. Here’s How It Works.

The measurement problem

How Vantage works

Does it actually work?

Why this matters more than you think

Comments (0)