This project started as a fairly standard object detection task and turned into a deep lesson in why severity classification is fundamentally harder than binary detection — and how synthetic data can both help and mislead you.
The dataset problem
There is no good public pothole severity dataset. So we built one: 1,982 images, 5,021 labeled potholes, two severity classes (major and medium). Annotation work is unglamorous but it taught me that label quality directly determines what the model can and cannot learn. Defining "major" vs "medium" consistently across 2,000 images requires more upfront alignment than you'd expect — and ambiguous class definitions show up immediately in your confusion matrices.
Why two stages?
Single-stage severity-aware detection underperformed on medium-severity cases. The two-stage approach — YOLOv11 for detection, YOLOv8-cls for severity classification — let each model specialize. Medium-severity recall went from 0.62 to 0.78 after the split.
Synthetic data via Blender
We generated synthetic labeled imagery in Blender to test depth-estimation limitations for small potholes. The honest finding: synthetic imagery helped volume but introduced distribution shift. Models trained on too much synthetic data got worse on real-world rainy and low-visibility conditions.
Final numbers
Binary detection: 0.941 precision, 0.894 recall. Severity-aware: up to 0.737 precision and 0.670 mAP@50. The gap between binary and severity performance is the real story — severity is a much harder problem.