IBM just dropped Granite 4.1, a family of dense, decoder-only LLMs in 3B, 8B, and 30B sizes, all under Apache 2.0. The numbers are interesting—trained on ~15 trillion tokens, with a long context window that goes up to 512K tokens. But what caught my eye is the 8B instruct model matching or beating the previous Granite 4.0-H-Small, which was a 32B-A9B mixture-of-experts architecture. That’s a dense model outperforming a much larger MoE. How?
They’re not shy about the recipe: five-phase pre-training, supervised fine-tuning on ~4.1M curated samples, and reinforcement learning using on-policy GRPO with DAPO loss (from Yu et al., 2025). Let’s walk through what that actually means.
The Architecture
Granite 4.1 sticks with a decoder-only dense transformer. Nothing wild here—Grouped Query Attention (GQA), Rotary Position Embeddings (RoPE), SwiGLU activations, RMSNorm, and shared input/output embeddings. The three sizes share the same training pipeline and data strategy, just scaled differently.
| Component | 3B Dense | 8B Dense | 30B Dense |
|—|—|—|—|
| Embedding size | 2560 | 4096 | 4096 |
| Number of layers | 40 | 40 | 64 |
| Attention head size | 64 | 128 | 128 |
| Number of attention heads | 40 | 32 | 32 |
| Number of KV heads | 8 | 8 | 8 |
| MLP hidden size | 8192 | 12800 | 32768 |
| MLP activation | SwiGLU | SwiGLU | SwiGLU |
| Position embedding | RoPE | RoPE | RoPE |
Pre-Training: Five Phases, One Goal
This is where the real work is. They split pre-training into five phases, each with a different data mix and learning rate schedule. The idea is to start broad, then progressively focus on higher-quality, more domain-specific data.
Phase 1: General Pre-Training (10T tokens)
First up, broad language understanding. The data mix is heavy on web content:
- CommonCrawl ~59%
- Code ~20%
- Math ~7%
- Technical ~10.5%
- Multilingual ~2%
- Domain Specific ~1.5%
Power learning rate schedule with warmup. Nothing fancy, just getting the model to understand language at scale.
Phase 2: Math/Code Pre-Training (2T tokens)
They crank up the reasoning data. Math goes from 7% to 35% (a 5x increase), code from 20% to 30% (1.5x). CommonCrawl drops to 12% and they start introducing synthetic data at 9%.
Phase 3: High-Quality Data Annealing (2T tokens)
This is mid-training. They switch to an exponential decay learning rate and start blending in chain-of-thought and instruction data. The mix becomes more balanced:
- CommonCrawl-HQ ~16.67%
- Math ~16.67%
- Code ~16.67%
- Synthetic ~8.5%
- Technical ~12.5%
- Multilingual ~4.5%
- Long Chain-of-Thought ~12.5%
- Language Instructions ~7.5%
- Code Instructions ~4.5%
Phase 4: High-Quality Data Annealing — Refinement (0.5T tokens)
Linear learning rate decay to zero. They focus on the highest-quality data:
- CommonCrawl-HQ ~40%
- Code ~20%
- Math ~20%
- Long Chain-of-Thought ~6%
- Code Instructions ~5%
- Language Instructions ~9%
Phase 5: Long Context Training (LCE)
This is where they extend the context window from 4K to 512K tokens, in stages: 32K, then 128K, then 512K. For the 512K stage, the data mix is 80% books and 20% code repository data (only for 8B and 30B models). They use an exponential learning rate schedule starting at 1e-4 and do a model merge after each stage to avoid degrading short-context performance.
RULER benchmark results for base models:
| Model name | 32K | 64K | 128K |
|—|—|—|—|
| granite-4.1-3b-base | 75.0 | 66.6 | 58.0 |
| granite-4.1-8b-base | 83.6 | 79.1 | 73.0 |
The 8B model holds up well even at 128K, which is impressive for a dense model.
Supervised Fine-Tuning
After pre-training, they do SFT on ~4.1M high-quality samples. The key here is data curation using an LLM-as-Judge framework. They’re not just throwing data at the model—they’re filtering for quality. The SFT data covers chat, instruction following, math, coding, and reasoning.
Reinforcement Learning
This is where the magic happens. They use on-policy GRPO with DAPO loss, a method from Yu et al., 2025. The multi-stage RL pipeline targets math, coding, instruction following, and general chat. The idea is to systematically strengthen each capability without sacrificing others.
What I Think
I’ve been watching the small model space for a while, and Granite 4.1 is a solid entry. The 8B model beating a 32B MoE is a good sign that dense architectures still have room to grow when you put the work into data and training strategy. The five-phase pre-training is a smart approach—it’s not just about scaling compute, it’s about knowing when to shift the data mix.
The Apache 2.0 license is a nice touch. Too many models are locked behind restrictive licenses these days. IBM’s been doing good work in the open LLM space, and Granite 4.1 continues that trend.
That said, I’d like to see more benchmarks, especially on reasoning and coding tasks. The RULER numbers are good for long context, but I’m curious how the instruct models stack up against Llama 3 or Mistral on standard evaluations. The claim about matching Granite 4.0-H-Small is promising, but I’ll believe it when I run my own tests.
Overall, Granite 4.1 is worth a look if you’re in the market for a capable small model. The 8B version in particular seems like a sweet spot for deployment on consumer hardware or edge devices. The long context support up to 512K tokens is a nice bonus for document-heavy use cases.
Check out the model collection on Hugging Face or the GitHub repo for the full details.
Comments (0)
Login Log in to comment.
Be the first to comment!