QIMMA: The Arabic LLM Leaderboard That Actually Checks Its Homework

If you’ve been following Arabic LLM evaluation for a while, you’ve probably noticed the same thing I have: more and more benchmarks and leaderboards keep popping up, but nobody seems to be asking whether the data is any good.

That’s the problem QIMMA (قمّة, Arabic for “summit”) sets out to solve. Instead of just throwing models at existing benchmarks and publishing scores, the team behind it — researchers from TII — built a quality validation pipeline that runs before any evaluation happens. And what they found should make anyone working on Arabic NLP a little uncomfortable.

The Fragmented State of Arabic NLP Evaluation

Arabic has over 400 million native speakers across dozens of countries and dialects. You’d think the evaluation infrastructure would reflect that diversity. It doesn’t.

A lot of Arabic benchmarks are just translations from English. That means the questions were originally written for an English-speaking, Western cultural context, then awkwardly mapped onto Arabic. Even the “native” Arabic benchmarks often ship with annotation inconsistencies, wrong gold answers, encoding bugs, and cultural blind spots.

On top of that, most evaluation scripts and per-sample outputs are never published publicly. So if you want to audit someone’s results or build on their work, you’re out of luck.

Here’s where QIMMA sits relative to the existing crowd:

OALL v1/v2: Open source, but mixed native/translated content, no quality validation
BALSAM: Partially open, 50% native, no validation
AraGen: Fully native, but no validation, no code eval
SILMA ABL: Fully native, has validation, but no code eval
ILMAAM: Fully native, no validation
HELM Arabic: Mixed content, no validation
QIMMA: Open source, 99% native Arabic, full quality validation, code evaluation, public outputs

QIMMA is the only one that checks all five boxes.

What’s Actually Inside

The platform consolidates 109 subsets from 14 source benchmarks into a unified evaluation suite of over 52,000 samples across seven domains:

Cultural: AraDiCE-Culture, ArabCulture, PalmX
STEM: ArabicMMLU, GAT, 3LM STEM
Legal: ArabLegalQA, MizanQA
Medical: MedArabiQ, MedAraBench
Safety: AraTrust
Poetry & Literature: FannOrFlop
Coding: 3LM HumanEval+, 3LM MBPP+

99% of the content is native Arabic. The only exception is the code evaluation tasks, which are language-agnostic by nature. And this is the first Arabic leaderboard to include code evaluation at all — with Arabic-language problem statements, no less.

The Quality Validation Pipeline: Where the Real Work Happens

This is the part I find most impressive. Before any model gets evaluated, every single sample goes through a multi-stage validation process.

Stage 1: Multi-model automated assessment. Two strong LLMs — Qwen3-235B-A22B-Instruct and DeepSeek-V3-671B — independently score each sample against a 10-point rubric. Each criterion gets a binary 0 or 1. A sample is eliminated if both models give it less than 7/10. If only one model flags it, it moves to human review.

The choice of models is smart: they’re both strong in Arabic but trained on different data, so their combined judgment is more robust than either alone.

Stage 2: Human annotation and review. Flagged samples are reviewed by native Arabic speakers who understand cultural and dialectal nuances. They make the final call on things like regional variation, subjective interpretation, and subtle quality issues that automated systems miss. For culturally sensitive content, multiple perspectives are considered, because “correctness” can genuinely vary across Arab regions.

What They Found: Systematic Quality Problems

The pipeline uncovered problems that aren’t just isolated errors — they’re systematic issues running through multiple benchmarks. Things like:

Translation artifacts that make questions nonsensical in Arabic
Culturally irrelevant or offensive content in ground-truth labels
Encoding errors that silently corrupt evaluation results
Annotation inconsistencies where the same question type gets different treatment across subsets

These aren’t minor edge cases. They’re the kind of problems that can quietly inflate or deflate model scores, making leaderboard rankings unreliable.

What This Means for Arabic LLM Development

QIMMA is a much-needed reality check for the field. It’s easy to celebrate progress when you’re comparing models on benchmarks that nobody has bothered to validate. But if the data is broken, the scores don’t mean much.

The fact that QIMMA is fully open source, publishes per-sample outputs, and includes code evaluation makes it a genuinely useful resource. It’s not just another leaderboard — it’s a methodological contribution that raises the bar for how Arabic NLP evaluation should be done.

That said, I’d like to see this kind of validation pipeline applied to more languages and domains. The problems QIMMA found in Arabic benchmarks almost certainly exist in other languages too. We just haven’t been looking hard enough.