Multi-Head Fashion Attribute Recognition
DINOv2 + VLM-audited silver labels for per-image classification of color, pattern, material, and texture on a 13,355-image TextileNet subset.
Headline Result
Backbone comparison
Mean test macro-F1 0.765 (DINOv2 ViT-B/14), a uniform +11.6 pp over an identically-recipe-trained ResNet-50 baseline. DINOv2 wins on every head and on all 29 individual classes — smallest gap +3.4 pp (color/red), largest +19.6 pp (pattern/striped). Same data, splits, augmentation, optimizer, schedule, weights. Only the backbone changes.
| Head | n eval | DINOv2 F1 | ResNet-50 F1 | Δ |
|---|---|---|---|---|
| material | 3,760 | 0.744 | 0.612 | +13.2 |
| color | 3,743 | 0.827 | 0.738 | +8.9 |
| pattern | 3,527 | 0.737 | 0.616 | +12.1 |
| texture | 3,487 | 0.753 | 0.630 | +12.3 |
| mean | — | 0.765 | 0.649 | +11.6 |

The Problem
Public fashion datasets are fractured. DeepFashion2 covers detection, not attributes. Fashionpedia treats texture as embellishment. iMaterialist has color and pattern but no material. TextileNet has expert material labels but no color or pattern. No public corpus jointly labels color, pattern, material, and texture per image — and none evaluates whether the four axes are statistically independent.
What I Built
I built a four-head attribute-recognition pipeline on TextileNet, using deterministic material labels where the dataset was strong and VLM-audited silver labels where it was missing coverage.
Hybrid silver-label pipeline
Hybrid 4-attribute labeling pipeline on a 13,355-image TextileNet subset (21 fabric classes after schema-orphan and below-floor pruning).
Material (6 classes): rule-based mapping from fabric folder class. Perfect-precision by construction.
Color, pattern, texture (12 / 6 / 5 classes): Marqo-FashionSigLIP zero-shot, 5-template prompt ensembling, softmax with temperature 100, per-attribute confidence-threshold abstain (color 0.30; pattern, texture 0.40). Abstained labels masked from training loss via `ignore_index=-100`, preserving signal for the remaining heads.
210-image stratified human audit (10 per fabric class) anchors silver-label trust: macro-F1 0.971 (color), 0.803 (pattern), 0.681 (texture). The audit is the empirical ceiling on what any downstream model can achieve.
The Model
DINOv2 multi-head setup
DINOv2 ViT-B/14 backbone (86.6M params, self-supervised on LVD-142M) emitting a 768-d CLS embedding into four parallel linear heads (22.3k params total). Class-weighted cross-entropy summed across heads. AdamW with discriminative LRs (1e-5 backbone, 1e-4 heads), 2-epoch warmup + 18-epoch cosine over 20 epochs, AMP fp16, batch 64, single NVIDIA T4.
The Orthogonality Test
- ▸V_gt = 0.318 — moderate ground-truth coupling, dominated by physically-grounded co-occurrences (leather → smooth, knitwear → fluffy_furry).
- ▸V_pred = 0.353 — Δ = +0.035, just outside the ±0.03 faithfulness threshold. The model amplifies coupling slightly via mild prior collapse onto modal cells, but doesn't invent new dependencies.
- ▸Across just 20 test images, the predicted 4-tuples cover 17 distinct (material, texture) combinations — visual proof that the heads don't collapse to canonical pairings.
Stack
PyTorch · DINOv2 · Marqo-FashionSigLIP (open_clip) · scikit-learn · Hugging Face Hub · Jupyter · Google Colab (T4)