VAKRA Benchmark: Why AI Agents Still Trip Over Simple Enterprise Tasks

IBM Research just dropped VAKRA, a new benchmark that actually makes AI agents work for a living. No more toy problems or isolated skill tests. This thing throws agents into an executable environment with over 8,000 locally hosted APIs across 62 domains, real databases, and document collections. Tasks require 3 to 7 step reasoning chains combining structured API calls with unstructured retrieval, all under natural language constraints.

The results are not pretty. Models perform poorly across the board. And honestly, that’s exactly what we need right now – a reality check on how far these systems are from reliable enterprise deployment.

What VAKRA Actually Measures

VAKRA breaks down into four capabilities, each designed to stress-test a different aspect of agentic reasoning. The first one, API Chaining using Business Intelligence APIs, has over 2,000 test instances across 54 domains. Agents need to chain 1 to 12 tool calls to answer questions like “Which football team has a build-up play speed of 31, build-up plan dribbling of 53, and build-up play passing of 32?” The correct answer is FC Barcelona, but getting there requires calling get_data first, then filtering step by step through a JSON data source.

What’s clever here is the design. The get_data tool returns a lightweight preview of the data – just the schema and first few values – while keeping the full dataset server-side. This avoids dumping massive data over MCP protocol. It also configures the server to expose the right tool set for that specific domain. The SLOT-BIRD collection provides 7 generic data manipulation tools inspired by Tableau and Google Analytics. The SEL-BIRD collection extends that with more specialized functions, like splitting sort_data into sort_data_ascending and sort_data_descending. Every key in the data gets its own get function, averaging about 4 per instance.

The second capability, Tool Selection using Dashboard APIs, covers 1,597 instances across 17 domains. This one uses REST API endpoints wrapped by an MCP server, with each domain having 6 to 328 tools (average 116). The trick here is that OpenAI’s API specification limits tool lists to 128 tools max. That’s a real constraint if you’re building an agent that needs to pick from hundreds of available APIs. The get_data tool again configures the server to expose only relevant domain-specific APIs, but the selection problem remains nontrivial.

Where Agents Fall Apart

Looking at the failure modes, a few patterns emerge that anyone who’s tried to deploy agents in production will recognize immediately.

First, agents struggle with the initial data retrieval step. They forget to call get_data, or they call it with the wrong tool_universe_id. This is basic housekeeping, but models treat it as optional. Second, multi-step filtering breaks down. Agents lose track of intermediate results, apply filters in the wrong order, or try to filter before they have data. The chain of tool calls requires maintaining state across steps, and these models just don’t do that reliably.

Tool selection is another pain point. When faced with 100+ available tools, agents default to the most generic ones or hallucinate tool names that don’t exist. The REST API endpoints are designed to be query-aligned, meaning the API name gives you a hint about what it does, but models still pick wrong endpoints or try to combine incompatible ones.

The Real Problem Isn’t Tool Calling

Here’s my take: these failures point to a deeper issue than bad tool calling. These models lack a persistent reasoning state. They treat each API call as an isolated event rather than part of a coherent plan. When a human does multi-step data work, they keep the goal in mind, track intermediate results, and adjust when something unexpected happens. Current agents don’t do that. They follow a rigid script and fall apart when reality doesn’t match their assumptions.

VAKRA is honest about this. The benchmark doesn’t let agents cheat by having perfect API documentation or simplified data. It forces them to deal with the same messy constraints enterprise developers face daily. The 128 tool limit isn’t an artificial constraint – it’s a real limitation of current LLM APIs that anyone building agent systems has to work around.

What This Means for Production

If you’re evaluating agents for enterprise use, VAKRA provides a more realistic stress test than most benchmarks. The combination of structured APIs, unstructured documents, and multi-step chains mirrors actual business workflows. The fact that models fail here suggests we’re still a ways from autonomous enterprise agents handling complex tasks without human oversight.

That doesn’t mean agents are useless. It means we need to design systems that account for these failure modes. Human-in-the-loop validation, checkpointing intermediate results, and limiting the tool selection scope to what the model can handle are all practical mitigations. VAKRA’s results also argue for more careful prompt engineering and possibly fine-tuning on tool-use traces before deploying to production.

The dataset and leaderboard are publicly available if you want to test your own systems. It’s worth running your agents through this before claiming they’re production-ready. The failure modes VAKRA exposes are the same ones that will bite you in real deployments, and it’s better to find them in a benchmark than when a customer is waiting for results.

VAKRA Benchmark: Why AI Agents Still Trip Over Simple Enterprise Tasks

What VAKRA Actually Measures

Where Agents Fall Apart

The Real Problem Isn’t Tool Calling

What This Means for Production

Comments (0)