Training Olmo 3 to Think and Code
Olmo 3 was officially released today! I’m super excited about to share what I’ve been working on these past few months at Ai2. While I’ve juggled a lot of things to get this model out the door, the main thing I lead was post-training this model to write code.
Post-training for code comes in two main forms: supervised fine-tuning (SFT) and reinforcement learning (RL)1
For SFT we train on pairs of (coding problem, Python solution) using the next-token prediction objective. This is very similar to pre and mid training, except for some hyper parameters like the learning rate, and that we apply a chat template so the model learns the chat format.
For RL we train on pairs of (coding problem, test cases). We have the model generate a Python program to solve the coding problem, then execute the test cases against its solution to produce a reward for the model. We take that reward and apply a policy gradient algorithm to reinforce tokens that earn large rewards and discourage tokens that earn small rewards. You can read more about these algorithms in my manager Nathan’s RL book. We run a variation of GRPO, adding tricks like a higher max-clip value from DAPO, active sampling, and in flight updates á la pipelineRL for stability and efficiency, respectively. You can read more about those in the RL section of the Olmo 3 paper. I’ll probably write another blog on RL infra at some point too.

To train a language model for a new capability you need a few things:
Evaluation Suite → a way to measure progress apart from training metrics
Training Data → For SFT we need pairs of (coding problem, Python solution). For RL we need pairs of (coding problem, test cases).
Infrastructure → both training infra and the RL environment. In the case of code this was a scaleable setup to execute test cases against a Python program
I’ll go through each piece blow.
Eval Suite
The dev eval suite I ended up using consists of 3 code generation evals: HumanEval, MBPP, and LiveCodeBench. HumanEval and MBPP are pretty similar tasks: complete a Python function such that the provided test cases pass. LiveCodeBench has these sorts of function completion problems as well as problems that require reading from and writing to the console using the standard I/O libraries.
We also had a number of unseen evals that we didn’t look at or eval our models on during development. Just like classic dev/test split, a large gap between the dev set and unseen eval set suggests overfitting to the dev evals.
Training Data
My coworker Valentina gave me some good advice when I started thinking about training data. It was something like “all you really need are prompts. Once you have good prompts, you can just generate completions or anything else you need from other language models.”
This really only applies when you’re training for something that existing models are already good at. Luckily, this is definitely true for simple, single-turn code generation.
Since lots of people are working on coding LM’s right now, finding prompts wasn’t too hard. To generate SFT and RL data, I came up with the following synthetic data procedure:
given a prompt, have GPT 4.1 generate the following:
a rewrite of the problem that clearly specifies the function signature
a Python solution with that function signature
a list of test cases
Execute the test cases against the solution. Choose a pass threshold (I went with 80%) then filter via the following rule:
if less than 80% of generated test cases passed, throw out the entire row → assume the problem is poorly phrased or the Python solution is incorrect
if more than 80% of test cases passed, keep the row and throw out only the failing test cases → assume the failing test cases are buggy but the problem, solution and passing test cases are good
I ran this over a lot of prompts from various open datasets. I ended up with 187k rows that included (prompt, Python solution, test cases) that I could use for SFT or RL as I pleased.
For the thinking model, I ended up generating multiple (16) completions to each of these prompts from the thinking model Qwen QwQ as thinking SFT data.
Execution Infrastructure
To train with RL or filter data via the above procedure we needed a way to execute test cases against a solution. I worked hard on this with my then-coworker Costa. I explored running a cloud function (i.e. AWS Lambda) and Costa explored spinning up a local execution server.
While both solutions seemed to work just fine, Costa’s solution was free and generally lower latency once he had the idea to spin up a huge number of workers for the server and put up a nginx load balancer in front of it.
This worked great for a while! But then I ran into an interesting issue, more about this in the Bonus Story below.
In the end we had two good solutions for execution, and we ended up using the local server for quick experiments that is supported, while we used the cloud function for bigger runs with larger test case suites and generally heavier loads.
Takeaways
When training thinking or tool-use models with long trajectories, training many on many completions to the same prompt is often beneficial. Surprisingly, filtering out “incorrect” completions (in the case of code, programs which fail test cases) can often hurt you. The diversity of reasoning tokens ends up being more important than only training on traces that lead to “correct” answers
this is not the case for non-thinking data where the answer is output directly. This more so matches intuition
another explanation is this sort of rejection sampling implicitly filters against harder data!
I ran a lot of data ablations. It’s important to follow the empirical results over “I don’t think this data looks good”
For RL, the difficulty of the data needs to match the starting point of the model. If the data is too easy or too hard, you’ll waste a ton of compute generating rollouts that result in 0 gradient
for this reason we do a lot of difficulty filtering on our data before we start RL. I.e, generate 8 completions for each data point in the set and throw out rows where the model earns full reward on 6 or more completions. You could imagine filtering out rows where the model earns 0 reward as well, but we keep these in hopes that the model will pick these examples up during training
another implication of this is none of my work would’ve been possible without our awesome pre-training team doing their thing. Thanks, pretrainers!
Building training infra that supports a thinking model is really, really hard. Those long generations make all sorts of things very difficult in practice. Shoutout to the people at Ai2 most involved with this: Finbarr, Hamish and Michael.
Formatting the final answer properly can be a pretty big source of weirdness in model eval scores. As a result, making sure your eval setup makes it very clear what format to use (e.g. use \boxed{} or not for math, place final code in ```python``` blocks) is super important! And make sure to train on that same format! This was especially important foe the RLZero models trained. Since they have no post-training and are RL’d directly from the base model, they struggle to follow formatting instructions in the prompt
What’s next
Working on this model was a ton of fun and super educating. I hope to continue improving code capabilities in our models, including using code as a tool in a reasoning chain and training coding agents (do well on SWE-bench). I might also try to setup an RL loop to teach our models to write performant low-level code, e.g. cuda kernels. I would also love to learn even more about building efficient and stable RL infra and make open-instruct a really great piece of software for doing RL at Ai2.
Outside of code, I want to start working on search-and-verify knowledge discovery problems where we can smoothly scale simulation as I described in my last blog post. I’m especially excited about protein design. It’d be cool to fuse the knowledge base of a natural language LM like Olmo with the knowledge base of a protein model like AlphaFold or ESM. You could imagine a simple RL loop where the LM is prompted to propose a protein sequence that exhibits a specific function or structure. You could then run a protein model to verify that the proposed protein exhibits this function or structure. Scaling simulation here is straightforward; just drop-in a bigger, more powerful protein model!
I’m excited for the future and all the cool stuff we’ll train into Olmo 4!
Bonus Story: Verification Ceilings and Base Model Limitations
What follows is an in-the-weeds story about my struggles hill-climbing LiveCodeBench with RL:
The synthetic pipeline I described above worked great for pushing numbers up on HumanEval and MBPP, but I was really struggling with saturating LiveCodeBench via RL.
Looking at a a bunch of completions (a.k.a. rollouts) from my RL-trained models I realized that the models were learning to generate brute-force solutions to the problem but generally failed to generate solutions that used more complicated algorithms with better time-complexities.
This made sense to me: since the test cases I was using were LM generated, they were all very “short”. That is, they weren’t stress-testing the code written the way the test-case suites from LiveCodeBench do. If your code doesn’t have a good time complexity, you’re going to timeout test cases in LiveCodeBench and fail the problem. Since my synthetic test case suites lacked these sorts of test cases, the model was unable to learn. This phenomenon is described in this paper that came out recently, where they dub this a “verification ceiling”.
Luckily I stumbled upon this great dataset: Synthetic-2 from Prime Intellect. Despite its name, the test cases are actually sourced from coding competitions directly, not LM generated. This means they contain those long, stress-testy test cases I was looking for!
I transformed the data and launched a training run. It run trained for a few steps… then ran out of memory. Wait, what? Nothing about this new data should be stressing the GPU DRAM, unless it was inducing super long generations from my model (it wasn’t). The issue was it was actually overloading the CPU’s RAM. Some of these test cases were hundreds of MB’s long, and with a batch size of a few thousand we were asking fro TB’s of RAM to execute these concurrently :0. Luckily, I had already setup that cloud function for execution, so it was a simple drop-in replacement.
Great, fire up the run and… still not saturating LiveCodeBench! My coworker Hamish told me to try training directly on the eval to test the training setup and the limits of the base model. Sure enough, I couldn’t saturate LCB even then. RL and policy gradient algorithms have an annoying feature: they can’t really reinforce behaviors or tokens that a model never exhibits in the first place.2 The Deepseek paper refers to this as the cold start problem.
The way to get around this is to generate some targeted SFT data and train the model on that to seed the behavior you want to reinforce. Indeed, SFT’ing on QwQ traces helped immensely here. But there’s only so much a 7B model can do, and SFT’ing to seed behavior is a bit hacky IMO. I’m excited to try harder tasks on bigger models with less SFT and more entropy. I’m hopeful that will lead to even more interesting emergent behaviors during RL.
There’s actually a secret third thing. We ended up running a preference tuning algorithm called DPO for code. Preference tuning is usually meant to inject certain behaviors and personality into a model, but my coworkers Scott and Victoria found it can actually boost reasoning behavior if you sample chose and rejected completions from strong and weak reasoning models. It worked unreasonably well, and is more evidence in support of the Delta Learning Hypothesis which Scott et al describe in this paper
There’s like, a ton of papers that introduce methods to get around this. So many ideas to try, so little time :(


