Multi-Head Fashion Attribute Recognition

DINOv2 + VLM-audited silver labels for per-image classification of color, pattern, material, and texture on a 13,355-image TextileNet subset.

PyTorchDINOv2Marqo-FashionSigLIPViT

Try Live Demo GitHub IEEE writeup PDF

Headline Result

Backbone comparison

Mean test macro-F1 0.765 (DINOv2 ViT-B/14), a uniform +11.6 pp over an identically-recipe-trained ResNet-50 baseline. DINOv2 wins on every head and on all 29 individual classes — smallest gap +3.4 pp (color/red), largest +19.6 pp (pattern/striped). Same data, splits, augmentation, optimizer, schedule, weights. Only the backbone changes.

Head	n eval	DINOv2 F1	ResNet-50 F1	Δ
material	3,760	0.744	0.612	+13.2
color	3,743	0.827	0.738	+8.9
pattern	3,527	0.737	0.616	+12.1
texture	3,487	0.753	0.630	+12.3
mean	—	0.765	0.649	+11.6

Per-attribute confusion matrices for the DINOv2 ViT-B/14 fashion attribute model — Per-attribute confusion matrices, DINOv2 ViT-B/14, test split (n=3,760).

The Problem

Public fashion datasets are fractured. DeepFashion2 covers detection, not attributes. Fashionpedia treats texture as embellishment. iMaterialist has color and pattern but no material. TextileNet has expert material labels but no color or pattern. No public corpus jointly labels color, pattern, material, and texture per image — and none evaluates whether the four axes are statistically independent.

What I Built

I built a four-head attribute-recognition pipeline on TextileNet, using deterministic material labels where the dataset was strong and VLM-audited silver labels where it was missing coverage.

Hybrid silver-label pipeline

▸

Hybrid 4-attribute labeling pipeline on a 13,355-image TextileNet subset (21 fabric classes after schema-orphan and below-floor pruning).

▸

Material (6 classes): rule-based mapping from fabric folder class. Perfect-precision by construction.

▸

Color, pattern, texture (12 / 6 / 5 classes): Marqo-FashionSigLIP zero-shot, 5-template prompt ensembling, softmax with temperature 100, per-attribute confidence-threshold abstain (color 0.30; pattern, texture 0.40). Abstained labels masked from training loss via `ignore_index=-100`, preserving signal for the remaining heads.

▸

210-image stratified human audit (10 per fabric class) anchors silver-label trust: macro-F1 0.971 (color), 0.803 (pattern), 0.681 (texture). The audit is the empirical ceiling on what any downstream model can achieve.

The Model

DINOv2 multi-head setup

DINOv2 ViT-B/14 backbone (86.6M params, self-supervised on LVD-142M) emitting a 768-d CLS embedding into four parallel linear heads (22.3k params total). Class-weighted cross-entropy summed across heads. AdamW with discriminative LRs (1e-5 backbone, 1e-4 heads), 2-epoch warmup + 18-epoch cosine over 20 epochs, AMP fp16, batch 64, single NVIDIA T4.

The Orthogonality Test

▸V_gt = 0.318 — moderate ground-truth coupling, dominated by physically-grounded co-occurrences (leather → smooth, knitwear → fluffy_furry).
▸V_pred = 0.353 — Δ = +0.035, just outside the ±0.03 faithfulness threshold. The model amplifies coupling slightly via mild prior collapse onto modal cells, but doesn't invent new dependencies.
▸Across just 20 test images, the predicted 4-tuples cover 17 distinct (material, texture) combinations — visual proof that the heads don't collapse to canonical pairings.

Stack

PyTorch · DINOv2 · Marqo-FashionSigLIP (open_clip) · scikit-learn · Hugging Face Hub · Jupyter · Google Colab (T4)

What This Project Demonstrates

Data engineering depth

TextileNet shipped with one attribute; the project ends with four, plus a measured noise floor per attribute, plus a full provenance manifest documenting every dataset decision.

Evaluation rigor

Reported test accuracy is vs silver labels, with audit-derived trust numbers as the conservative ceiling. ResNet-50 is run under a byte-near-identical recipe so backbone-vs-everything-else is cleanly attributable. Cramér's V uses ground truth as a baseline, not just an absolute number.

Honest framing

Limitations section calls out VLM grounding ambiguity in multi-garment compositions, TextileNet inheritance noise, and ResNet-50 not having converged at 20 epochs. The headline number stands; the caveats are where the work earns trust.

Back to Work

Try Live Demo View Source on GitHub