Pose Estimation Research
Systematic ML model benchmarking for real-time exercise tracking.
The Problem
People doing home workouts have no reliable way to count reps automatically or know if their form is degrading over a set. Existing solutions either require expensive hardware (motion capture) or use a single pose estimation model without evaluating alternatives. Which model is actually better for real-time exercise tracking — and how do you measure "better"?
The Approach
I conducted a comparative analysis of two leading pose estimation models under identical conditions, then built a complete rep counting and form scoring pipeline on top of each.
Key architectural decisions
MediaPipe vs YOLOv8-Pose — chose these because they represent two fundamentally different approaches: MediaPipe (33 keypoints, real-time optimized, lightweight) vs YOLOv8-Pose (17 keypoints, object-detection-based, more robust to occlusion).
Signal processing for rep counting — instead of threshold-based counting (brittle), used Savitzky-Golay filtering + scipy peak detection on joint angle time series. This is noise-robust and handles different exercise speeds.
Dual-model evaluation under identical conditions — same video input, same exercises, same metrics. No cherry-picking results.
Technical Deep-Dive
Pose estimation pipeline
- 1.Video input → frame extraction at native FPS
- 2.Per-frame pose estimation (MediaPipe 33 keypoints OR YOLOv8-Pose 17 keypoints)
- 3.Joint angle computation from keypoint coordinates (e.g., elbow angle from shoulder-elbow-wrist)
- 4.Time series construction of joint angles across all frames
Rep counting algorithm
- 1.Raw joint angle signal → Savitzky-Golay smoothing filter (removes noise while preserving peaks)
- 2.Scipy find_peaks on the smoothed signal with configurable prominence and distance thresholds
- 3.Each detected peak = one rep
- 4.Per-rep form scoring: compare each rep's angle range against the first rep (assumed good form). Deviation = degradation score.
Benchmarking
- ▸Latency: time per frame for each model
- ▸Accuracy: keypoint confidence scores and rep count correctness
- ▸Model size: memory footprint and download size
- ▸All measured on the same hardware, same input videos
Key Metrics
Challenges & Solutions
Challenge 1:Noisy joint angle signals
Raw keypoint coordinates jitter frame-to-frame, producing noisy angle signals that trigger false peaks. Fix: Savitzky-Golay filter with carefully tuned window size smooths noise while preserving the actual rep peaks.
Challenge 2:Different keypoint sets
MediaPipe gives 33 keypoints, YOLOv8-Pose gives 17. Comparing them directly isn't straightforward. Fix: Mapped both to a common subset of joints (shoulders, elbows, wrists, hips, knees, ankles) for fair comparison, while noting that MediaPipe's additional keypoints (face, hands) provide richer data.
Challenge 3:Form degradation quantification
"Good form" vs "bad form" is subjective. Fix: Used the first rep as the baseline and measured deviation in joint angle range for subsequent reps. Increasing deviation over a set = measurable form degradation.
Lessons Learned
Pretrained models are the right choice for inference tasks. Training a pose estimation model from scratch would take weeks and wouldn't beat MediaPipe or YOLOv8. Use pretrained, evaluate vendor-neutrally, and focus engineering effort on the application layer.
Signal processing is underrated in ML pipelines. The rep counting accuracy depends more on the filtering and peak detection than on the pose estimation model itself.
Fair benchmarking requires identical conditions. Same video, same hardware, same metrics. Without this discipline, model comparisons are meaningless.