Autonomous Driving RL Agent
Training agents that make split-second decisions in dense traffic.
The Problem
Autonomous vehicles must make split-second driving decisions in dense traffic — when to accelerate, brake, or change lanes — while balancing competing objectives: go fast AND don't crash. How do you train an agent to navigate a 4-lane highway with 40+ vehicles without killing anyone?
The Approach
I trained PPO agents across two distinct driving environments (Highway-v0 and Intersection-v1) using Stable-Baselines3, with custom multi-objective reward functions and CUDA-accelerated training.
Key architectural decisions
PPO over DQN/A2C — PPO's clipped objective prevents catastrophic policy updates, critical for safety-critical driving. Industry standard (OpenAI Five, Tesla Autopilot research).
Custom multi-objective reward function (V6) — 6 iterations of reward design, balancing speed, safety distance, weaving penalty, slow-speed penalty, and collision penalty.
Two environments — Highway-v0 (dense 4-lane merging) and Intersection-v1 (cross-traffic with goal-directed navigation) to test generalization.
Mathematical analysis of failures — didn't just report results. When the agent learned a degenerate policy, I proved mathematically WHY it was optimal under the reward structure.
Technical Deep-Dive
Highway-v0 Environment
- ▸Agent observes 5 nearest vehicles (position, velocity) in a 5×5 kinematics matrix
- ▸5 discrete actions: lane left, idle, lane right, faster, slower
- ▸40+ vehicles in dense 4-lane traffic
- ▸200k training steps (~2.5 hours on RTX 3050 with CUDA)
Multi-objective reward function V6
- ▸R_speed = v_ego / v_max (normalized velocity reward)
- ▸R_safe_distance = +0.05 if front distance ≥ 15m
- ▸P_weaving = -0.08 if lane change within 10 steps of previous
- ▸P_slow = -0.02 if velocity < 0.6 × v_max
- ▸P_collision = -0.5 on crash
The degenerate policy discovery
The trained agent achieved 97% crash reduction — but by driving extremely slowly (SLOWER action 96.3% of the time). Through mathematical break-even analysis, I proved:
Slow driving at 0.5 × v_max earns 0.55 reward/step × 960 steps = 528 total reward. Fast driving at v_max earns 1.0 reward/step but crashes sooner. The reward structure mathematically favors slow driving over fast driving.
This is a textbook reward exploitation problem — the agent found the optimal strategy under the given reward, but it wasn't the intended behavior.
Intersection-v1 Environment
- ▸15 vehicle observations with heading information
- ▸Goal-directed navigation (reach the other side of intersection)
- ▸Cross-traffic awareness required
- ▸Discovered overfitting: 100k checkpoint (100% success) outperformed 200k checkpoint (2% success)
Neural network architecture
- ▸Shared MLP (128→128) with separate actor (5 actions) and critic (value) heads
- ▸~17,000 trainable parameters
- ▸GAE (λ=0.95) for advantage estimation
Key Metrics
Challenges & Solutions
Challenge 1:Degenerate policy (reward exploitation)
Agent learned to drive slowly instead of fast-and-safe. Root cause: Collision penalty (-0.5) too weak relative to cumulative slow-driving rewards. Proposed fixes: 10× collision penalty, quadratic speed reward, distance-based reward.
Challenge 2:Overfitting in Intersection environment
100k checkpoint succeeded 100% of the time; 200k checkpoint only 2%. Extended training DEGRADED performance. Root cause: Entropy coefficient reduced too aggressively (0.02 → 0.003), eliminating exploration. Lesson: Entropy scheduling must be gradual.
Challenge 3:Lane change avoidance
Agent never changed lanes (0.0 lane changes per episode). Root cause: Weaving penalty (0.08) was too aggressive + staying in one lane and slowing down is mathematically safer in dense traffic.
Lessons Learned
Reward function design is the hardest part of RL. Small imbalances create degenerate policies. Mathematical verification BEFORE training is essential — I should have done the break-even analysis before the first training run.
More training ≠ better performance. The Intersection environment proved that 100k steps outperformed 200k steps. Validation-based model selection (like in supervised learning) is critical in RL too.
Documenting failures is as valuable as documenting successes. The degenerate policy analysis is the most interesting part of this project — it shows understanding of RL dynamics, not just API usage.