Jan 2026 · 6 min read
Degenerate Policy in a Highway RL Agent
200,000 PPO timesteps on an RTX 3050 cut crash rate from 98% to 3%, then exposed a slow-driving policy the reward function accidentally preferred.
200,000 timesteps
Jan 2026: I trained PPO agents in highway-env on an NVIDIA GeForce RTX 3050 Laptop GPU for 200,000 timesteps in about 2.5 hours.
The headline metric looked good: crash rate moved from 98% to 3%, a 97% reduction, and mean reward moved from -54.2 to +329.8.
100k checkpoint
The best checkpoint was not the final one. At 100k timesteps, mean reward reached 329.84 +/- 33.06, then the 200k run overfit the reward shape.
The policy distribution showed the failure: SLOWER appeared 454.7 times per episode, or 96.3%; FASTER and lane-change actions appeared 0 times.
0.5 collision penalty
The reward function made slow driving rational. R_speed lived in [0, 1], while P_collision was 0.5, so avoiding collision dominated useful progress.
The agent found a locally optimal policy: brake constantly, avoid traffic, collect survival reward, and ignore the driving behavior I actually wanted.
17,000 parameters
The model had about 17,000 trainable parameters. The bug was not model capacity; it was reward accounting.
The fix is reward design: stronger collision penalty, non-linear speed reward, distance-based progress reward, and a penalty for crawling below the target speed.
What I would do differently
I would inspect action distributions every 10,000 timesteps instead of waiting for aggregate reward curves to look wrong.
I would also run a hand-written policy baseline, because a simple brake-heavy controller would have exposed the reward loophole before the PPO run finished.