Back to Writing

Jan 2026 · 6 min read

Degenerate Policy in a Highway RL Agent

200,000 PPO timesteps on an RTX 3050 cut crash rate from 98% to 3%, then exposed a slow-driving policy the reward function accidentally preferred.

200,000 timesteps

Jan 2026: I trained PPO agents in highway-env on an NVIDIA GeForce RTX 3050 Laptop GPU for 200,000 timesteps in about 2.5 hours.

The headline metric looked good: crash rate moved from 98% to 3%, a 97% reduction, and mean reward moved from -54.2 to +329.8.

100k checkpoint

The best checkpoint was not the final one. At 100k timesteps, mean reward reached 329.84 +/- 33.06, then the 200k run overfit the reward shape.

The policy distribution showed the failure: SLOWER appeared 454.7 times per episode, or 96.3%; FASTER and lane-change actions appeared 0 times.

0.5 collision penalty

The reward function made slow driving rational. R_speed lived in [0, 1], while P_collision was 0.5, so avoiding collision dominated useful progress.

The agent found a locally optimal policy: brake constantly, avoid traffic, collect survival reward, and ignore the driving behavior I actually wanted.

17,000 parameters

The model had about 17,000 trainable parameters. The bug was not model capacity; it was reward accounting.

The fix is reward design: stronger collision penalty, non-linear speed reward, distance-based progress reward, and a penalty for crawling below the target speed.

What I would do differently

I would inspect action distributions every 10,000 timesteps instead of waiting for aggregate reward curves to look wrong.

I would also run a hand-written policy baseline, because a simple brake-heavy controller would have exposed the reward loophole before the PPO run finished.

Ask me anything