The SMOL Playbook from Hugging Face: What It Actually Takes to Train a World-Class LLM (And Why You Probably Shouldn’t)

Nov 21, 2025

Hugging Face just dropped something rare in AI: the truth.

Their SMOL Training Playbook is a 200+ page document detailing what really happens when you train a large language model from scratch. Not the polished research paper version with clean ablations and obvious decisions. The actual version with 2am dataloader bugs, infrastructure breakdowns, and a complete restart after burning through 1 trillion tokens.

This is the story of building SmolLM3, their 3-billion parameter model trained on 11 trillion tokens. And it’s the most honest thing you’ll read about AI development this year.

The Question Nobody Asks: Should You Even Do This?

Here’s where Hugging Face starts, and where most teams should end: don’t train your own model unless you absolutely have to.

The uncomfortable truth is that with Llama, Qwen, Gemma, and dozens of other powerful open-source models available, most organizations have zero business training from scratch. Fine-tuning? Sure. Prompting optimization? Definitely. But pre-training on trillions of tokens? That’s a different beast.

You need a really specific reason. Not “AI is the future” or “we want to be cutting-edge.” Real reasons, like:

Narrow domain expertise - DNA sequencing, pharmaceutical research, legal documents where general models genuinely fall short
Security and privacy requirements - you need absolute guarantees about what data went into training and where inference happens
Industry regulations - compliance requirements that off-the-shelf models can’t satisfy
Unique architectural needs - your use case requires something genuinely novel that doesn’t exist yet

Notice what’s not on this list: wanting to “own your AI stack” or having some unused GPU clusters sitting around. Those are expensive ways to learn that Meta’s engineers already solved your problem.

The playbook is brutal about this. Try prompting first. Then try fine-tuning. Then try continued pre-training on your specific domain. Only if all of that fails and you have millions of dollars to burn - should you consider training from scratch.

What It Actually Costs (Spoiler: More Than You Think)

When people quote DeepSeek’s training costs, they forget about the months of failed experiments, debugging, and research. The compute for the final training run is just half the story.

Hugging Face had to restart their entire 11 trillion token training run after already burning through 1 trillion tokens. That’s not the compute cost, that’s the research cost. That’s the price of learning.

Here’s what the bill looks like for SmolLM2: around $250,000 USD just for the compute, training on 11 trillion tokens. But that doesn’t include:

All the ablation experiments to figure out what architecture works
The infrastructure failures and debugging sessions
The multiple training runs that didn’t make it to the end
The team’s time across months of work

And that’s for a “small” model. Want to train something frontier-class? Add two zeros.

The playbook mentions something fascinating: fine-tuning 1 trillion tokens costs less than training from scratch on 10 trillion. This is the math that should stop most training projects before they start.

The Bugs That Cost Millions

The most shocking part of the playbook isn’t the successful techniques - it’s the disasters.

The Tensor Parallelism Bug: Every GPU in a parallel group was being initialized with the same random seed, which crippled the model’s ability to learn effectively. The loss curve looked fine. The evaluations told a different story: the new 3B model was performing worse than its 1.7B predecessor.

They caught it because they’d systematically de-risked everything else through ablations. When performance tanked, they knew exactly where to look: the one component they hadn’t tested at scale.

The Dataloader That Couldn’t: When you’re training on trillions of tokens, even small inefficiencies compound. Their new dataloader had an internal index that grew with the total number of training steps, causing slowdowns on long runs. Solution? Roll back to the old, battle-tested dataloader from SmolLM2.

The Storage System That Lied: Infrastructure that worked fine in small tests collapsed under the pressure of a multi-trillion token run. Storage couldn’t keep up. Throughput had periodic drops that looked like ghosts in the machine.

This is the part research papers skip. The “minor implementation details” that cost weeks of debugging and millions in wasted compute.

The Ablation Philosophy: Test Everything at Small Scale First

Here’s the methodology that actually works: run experiments at a small scale and get results you can confidently extrapolate to your final production run.

For SmolLM3, they ran ablations on a 1B model trained on 45B tokens - a mix of FineWeb-Edu, FineMath, and Python-Edu. Small enough to iterate quickly, large enough that the results actually transfer.

The rule is simple but strict: no architectural change gets into production unless an ablation proves it helps. This is how they avoided disaster 8 trillion tokens into training. This is how they knew where to look when things went wrong.

If something hurts performance at small scale, you can confidently rule it out for large scale. The inverse isn’t always true: something that works at 1B might still fail at 100B, but it’s a hell of a lot cheaper to find out early.

The Kimi team, when developing their 1 trillion parameter model, ran ablations on a 3B MoE with 0.5B active parameters. Because using the full size for every test would have been financially insane.

The Data Reality: Quality Over Everything

Here’s something counterintuitive: adding arXiv papers to small models actually drags performance down. Why? The style is too narrow, too specialized. It skews the model’s understanding of natural language.

This is what separates people who’ve actually trained models from people who’ve read papers about training models. Intuition is almost useless at this scale.

The playbook emphasizes data quality obsessively. Too much English content, and multilingual performance tanks. Too much code, and reasoning improves but general conversation suffers. Balance is everything, and finding that balance requires... more ablations.

Modern LLMs are trained on datasets like:

FineWeb-Edu (educational web content, heavily filtered)
FineMath (50B tokens of mathematical reasoning)
Stack-Edu (educational code in 15 programming languages)
Cosmopedia (synthetic textbooks and explanations)

Notice what’s missing: random web scrapes, unfiltered Common Crawl, everything that made earlier models weird and unreliable.

The Architecture Choices That Actually Matter

The playbook walks through every modern architectural decision:

Attention mechanisms: Multi-Head Attention (MHA) is basically obsolete for long-context inference. Grouped Query Attention with 2-16 groups is now standard, with HuggingFace running ablations across MQA, GQA, and MHA to optimize for inference memory.

Positional encodings: RoPE (Rotary Position Embedding) has become standard, though some newer models like Kimi experiment with “NoPE” (no positional encoding in certain layers).

Intra-document masking: Tokens only appear inside the same source document, and SmolLM3 used this from day one while SmolLM2 added it late and paid for it.

Nothing is exotic for the sake of being exotic. Every choice is validated through ablations. Every innovation needs to prove its worth.

What’s Missing From Research Papers

Published research makes everything look inevitable. Strategic architecture choices. Carefully curated datasets. Sufficient compute. The results are polished, the ablations are clean, every decision seems obvious in hindsight.

But those reports only show what worked and apply a bit of rosy retrospection, they don’t capture the 2am dataloader debugging sessions, the loss spikes, or the subtle bugs that quietly sabotage your training.

The SMOL Playbook says the quiet part out loud: this is messy, iterative, and full of dead ends. Even Hugging Face’s expert team doesn’t have a perfect recipe. They have a disciplined, empirical process for navigating chaos.

The story reads like a drama. Promising small-scale ablations that don’t translate at scale. Infrastructure breakdowns at the worst possible moments. Bugs so subtle they took days to find and cost millions to fix.

Why This Matters

This playbook is valuable not because it tells you how to train a model - though it does - but because it shows you when not to.

It demystifies the art of LLM training. It proves that even world-class teams spend half their compute budget on failed experiments and debugging. It provides a concrete framework for making better decisions, faster.

For most companies building AI products, the lesson is clear: use existing open-source models. Fine-tune them. Optimize your prompts. Build great products on top of them.

But if you’re one of the few who genuinely needs to train from scratch, maybe you’re pushing into a domain where no good models exist, or you’re building something that requires architectural innovations - this is your roadmap.

Just know what you’re getting into. It’s not a clean process with obvious answers. It’s debugging at scale. It’s infrastructure that works perfectly until it doesn’t. It’s restarting after a trillion tokens because of a seed initialization bug.

It’s expensive, messy, and harder than it looks.

And Hugging Face just saved you millions of dollars by telling you that upfront.

Max Votek

Discussion about this post

Ready for more?