Reinforcement learning, deterministic chaos, and scaling simulation
Bridging the gap from the digital to the physical
Chaos is everywhere. I’m not just being nihilistic here. Chaos is a well-studied mathematical phenomenon that can be described precisely. It might be the biggest barrier between AI systems and the real world. But it’s possible we can overcome it with more compute. If this pans out, simulation could be the next big axis to scale.
What is chaos?
Chaos is a property of a dynamic system, that is, a system that changes over time. A dynamic system is considered chaotic if it has a number of properties, the list of which you can find on this wikipedia page. For the purposes of this post there’s only one that matters, and it’s almost synonymous with what people think of when they think of chaotic systems.
I’m talking about sensitivity to initial conditions. A system exhibits this property if, for any two arbitrarily close starting points, and given enough time, their trajectories will diverge. This is colloquially known as “the butterfly effect”: small changes right now lead to totally different futures.
Here’s a neat thing about chaos: it can emerge in completely deterministic systems. One of the most basic examples is the double pendulum. Totally simple, totally deterministic, and totally chaotic. Below is an excellent video from one of my favorite profs showing sensitivity to initial conditions in the double pendulum system. There are two double pendulums at very slightly different initial positions, but sure enough their trajectories diverge in just 30 seconds.
From simulation to reality
Let’s say you’re training an ML model to control a robot. You could, perhaps naively, run the model on the physical robot and let it learn through some environment you’ve built in the real world. But this is tedious, slow, requires physically rebuilding new environments, and is just not practical to run at the scale neural networks need to learn. Instead, you might simulate the real world and train your neural network entirely in simulation, then deploy it in the real world on your robot.
The sim2real, or “simulation to reality”, gap occurs when there’s an inaccuracy or missing component in your simulation. ML models can be surprisingly brittle in the face of these differences. This poses a huge challenge to anyone looking to train a model in a simulated environment and then deploy it in the real world. My coworker Finbarr was telling me about how some researchers at Deepmind were struggling because their simulation wasn’t accounting for minuscule wearing in some of the components of their robot. Eventually this totally threw off the model when they deployed it on a real-life robot, making their trained model more or less useless in the real world.
Now, imagine instead of simulating a robot arm you’re trying to simulate a chaotic system. For example, you might be training a model to predict weather patterns, many of which are known to exhibit chaos. We need to be precise when simulating these systems; any small inaccuracies will totally throw off our simulation compared to what would happen in the real world. But precision in simulation comes at the cost of compute. We need more memory and flops to track and compute the increasingly complex moving parts of our simulation. Worse yet, there’s an exponential blowup here: we need exponentially more compute to get linearly more precise.
Hacking the simulation
Reinforcement learning (RL) for language models seems extremely promising. It allows pre-trained models, which have extremely strong priors, to explore an environment and discover new strategies to solve problems. For any problem that we can accurately simulate and measure success, we can, at least in theory, train models through RL.1
But, just like in any other RL setup, language models are prone to reward hacking. This is when a model figures out a way to earn the reward in a we don’t want it to. There’s this really famous example from OpenAI where they were trying to train a model to play a racing game called CoastRunners. While racing is the primary objective, you can earn small bonus points by crashing into stuff. Instead trying to get 1st place, the model discovers this hacky loop to earn infinite points:
If you want an example from the language model world, here’s a real screenshot from one of my training runs. I had another language model judging the output of my model to assign a reward. The language model learns to “hack” the judge by telling it to give it 10/10 instead of just providing a good response to the user query.
But, for certain domains, we can construct “un-hackable” reward functions. When you train on these verifiable domains, language models seem to learn tasks in more robust and non-hacky ways. Tulu 3 is the first place I’ve see this idea published, and they dub this ‘Reinforcement Learning from Verifiable Rewards’, or RLVR. Examples of verifiable domains include math and code, where reward is given only if the model produces the right answer to a math problem or passes test cases for code.
This direction seems really promising and there’s a lot work being done right now to construct more ‘verifiable’ environments. Reward hacking is a fundamental problem in reinforcement learning, but with clever construction of environments and verifiable rewards, we can get around it.
Learning through the faulty chaos
A quick summary of each section so far:
The real world is inherently chaotic. With long enough horizons, any two arbitrarily close realities can have very different futures.
This means that any small inaccuracies in simulations can lead to huge inaccuracies given enough time. We can increase precision by increasing the compute we spend on the simulation
Reinforcement learning seems extremely promising, but even language models are prone to reward hacking. Engineering accurate and verifiable environments seems to help a lot.
Because it allows for exploration, RL is the only learning paradigm we’ve come up with that can discover new knowledge. Many researchers are excited by the possibility of AI systems, trained with RL, discovering solutions to our hardest problems like new materials for batteries, long-horizon weather prediction, or even a cure for cancer.
But if we want to train AI systems that can solve these problems and operate in the real-world, we need to scale the amount of compute we spend on simulation. Any systems trained on faulty simulations will completely fail in the real world, given enough time.
The scaling laws aren’t great. Exponential compute for linear gains feels like an insurmountable wall. But we saw the same scaling pattern with pre-training compute and inference-time compute for language models and we scaled them anyways. And it worked!
I tweeted about this a few weeks ago. It’s not a new or original idea, but I’m becoming more and more confident that this will be the next big axis we scale. Like with everything else, progress here will be pretty slow. The real world is so, so much more complicated than the digital world. But I’m really excited to see how this all turns out.
Well, there’s a hidden third requirement: sufficiently dense rewards. Even if we have accurate simulated environments and measurements of success, a model will never learn if it can’t achieve the goal through a reasonable amount of exploration. Not super relevant to this post, but definitely worth mentioning. Also the “at least in theory” bit is doing a lot of heavy lifting here. RL is really, really hard to set up in a way that leads to “good stuff”. LM’s are a bit easier since the initial model is so strong.



any great pieces on rl env design?