OLMoE and the hidden simplicity in training better foundation models

Ai2 released OLMoE, which is probably our “best” model yet relative to its peers, but not much has changed in the process.

Today we’re releasing our best open-source language model to date, OLMoE, a Mixture-of-Experts (MoE) model with 1.3 billion active parameters and 6.9 billion total parameters. It was trained on 5 trillion tokens, largely composed of the DCLM baseline mix, comes with many intermediate training checkpoints, the Apache 2.0 license for all model variants, and improved post-training mix, code, and training logs.

I’ll detail the high-level takeaways from the model quickly, but most of this post is an update on my worldview around “how frontier model organizations work” and what the not-so-secret sauces are. Since we started the Open Language Models (OLMos) project, we’ve been following the same trials and tribulations that the leading labs started a few years earlier. The secret is that there are no quick tricks to get a better model, all small wins.

First, some details on OLMoE because it’s a pretty impressive model! Here’s our take on one of those convex hulls that have made the Twitter meme rounds many times. OLMoE is on the frontier with Qwen’s 3 billion activate parameter MoE and the Llama 3.1 8B model.

OLMoE outperforms all the open models in its active parameter range and is very similar to some popular MoE models with about twice the parameters.

Here, you can see that the post-training recipe makes the comparison to slightly bigger models even more favorable.

Screenshot 2024-09-02 at 5.47.21 PM.png

Given how close the KTO and DPO numbers are, I suspect we can push the post-training recipes even further for this model. It's crossing the rubicon I’ve been excited for — when small language models start to really respond to fine-tuning. For the last few years, we’ve only seen minor jumps.

Screenshot 2024-09-02 at 5.46.58 PM.png

The OLMoE paper has a ton of details on design decisions for MoE models — load balancing of experts, number of experts, regularization, and everything in between.

This release is the right time to reflect on what I have been learning about running a successful foundation model training group. I cover compute allocations, de-risking training, organizational complexity, and compounding improvements below.

Frontier model team compute allocations

I’ve heard from many friends who have gone to top language model lives that their work can be seen as somewhat routine, with simple tasks, yet with incredible pace. There are a lot of people who are just grinding on making datasets all the time and a smaller percentage who are launching the biggest training runs. There are many, many pieces that go into making a Pareto expanding model, so this is a noisy sample. Some high-level rules have emerged beyond this.

All of the top labs have undergone substantial internal politics over compute allocations. The biggest allocation goes to the fewest people — pretraining researchers. This ends up being about 60%, or some number greater than half. The full distribution is something like pretraining — 60%, post-training — 25%, data — 10%, and other — 5%. We’ll use these numbers as a ballpark for the rest of the post — specifics vary. It’s well-known what the pretraining mix goes to — figuring out the best architectures and data mixes for scalable state-of-the-art models. The post-training and data allotments are evolving fast.

Post-training at 25% is very high relative to the original landscape of ChatGPT. With the modern RLHF workflow, a ton of compute is spent on the generation and filtering of fine-tuning data. For example, at a small scale, it takes an order of a day or two to train a 7 billion parameter model on about 500 thousand instruction samples on 1-2 Nvidia 8xH100 nodes (on a not-that-optimized cluster).

When doing something like rejection sampling or another online method, it is normal to generate 10 to 30 or more samplers per prompt. Using a bigger model, such as Llama 405B for those generations to “distill” into your candidate model, will take substantially longer than training on the filtered outputs. Filtering takes time too.