Notes from systems I trained, shipped, broke, or had to explain with numbers.
200,000 PPO timesteps on an RTX 3050 cut crash rate from 98% to 3%, then exposed a slow-driving policy the reward function accidentally preferred.
6 min read