Notes on Training Code LLMs
•
#ml#llm#training
After spending a year training code LLMs for Mellum, here are some scattered notes on what I've learned.
Data matters more than you think
Everyone says this, but it's really true. We spent way more time on data processing than on model architecture. Some things that helped:
- Deduplication — removing near-duplicates at multiple levels (file, function, line)
- Quality filtering — surprisingly hard to define "quality" for code
- FIM (Fill-in-the-Middle) — essential for code completion, but the format matters
Architecture trade-offs
When you're optimizing for serving latency and memory, you have to make choices:
# Simplified view of the trade-offs
{
"GQA": "fewer KV heads = smaller cache",
"MLA": "even smaller cache, but more complex",
"Sliding window": "bounded memory, but limited context",
}
There's no free lunch. Everything is a trade-off between quality, speed, and memory.
Things that break at scale
Distributed training on 512 GPUs is a different beast from training on 8:
- Network failures become common
- Gradient synchronization bugs are subtle
- Checkpointing takes forever and can fail
- Monitoring becomes critical
One more thing
If you're training models, invest in your evaluation suite early. It's the only way to know if you're making progress or just moving sideways.
More detailed posts on specific topics coming soon.