TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias

30 Mar 2024  ยท  Sanghyun Jo, Soohyun Ryu, Sungyub Kim, Eunho Yang, KyungSu Kim ยท

We identify a critical bias in contemporary CLIP-based models, which we denote as single tag bias. This bias manifests as a disproportionate focus on a singular tag (word) while neglecting other pertinent tags, stemming from CLIP's text embeddings that prioritize one specific tag in image-text relationships. When deconstructing text into individual tags, only one tag tends to have high relevancy with CLIP's image embedding, leading to biased tag relevancy. In this paper, we introduce a novel two-step fine-tuning approach, Text-Tag Self-Distillation (TTD), to address this challenge. TTD first extracts image-relevant tags from text based on their similarity to the nearest pixels then employs a self-distillation strategy to align combined masks with the text-derived mask. This approach ensures the unbiased image-text alignment of the CLIP-based models using only image-text pairs without necessitating additional supervision. Our technique demonstrates model-agnostic improvements in multi-tag classification and segmentation tasks, surpassing competing methods that rely on external resources. The code is available at

PDF Abstract


Results from the Paper

Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Unsupervised Semantic Segmentation with Language-image Pre-training ADE20K TTD (MaskCLIP) Mean IoU (val) 12.7 # 5
Unsupervised Semantic Segmentation with Language-image Pre-training ADE20K TTD (TCL) Mean IoU (val) 17.0 # 3
Open Vocabulary Semantic Segmentation ADE20K-150 TTD (TCL) mIoU 17.0 # 15
Open Vocabulary Semantic Segmentation ADE20K-150 TTD (MaskCLIP) mIoU 12.7 # 16
Semantic Segmentation CC3M-TagMask TTD (TCL) mIoU 65.5 # 1
Semantic Segmentation CC3M-TagMask TTD (MaskCLIP) mIoU 50.2 # 3
Multi-Label Text Classification CC3M-TagMask TTD (w/o fine-tuning) F1 78.5 # 2
Precision 82.9 # 2
Recall 74.5 # 3
mAP 90.3 # 2
Accuracy 91.0 # 1
Multi-Label Text Classification CC3M-TagMask TTD (w/ fine-tuning) F1 82.8 # 1
Precision 88.3 # 1
Recall 78.0 # 2
mAP 93.7 # 1
Accuracy 88.6 # 2
Open Vocabulary Semantic Segmentation Cityscapes TTD (MaskCLIP) mIoU 27.0 # 5
Open Vocabulary Semantic Segmentation Cityscapes TTD (TCL) mIoU 32.0 # 3
Unsupervised Semantic Segmentation with Language-image Pre-training Cityscapes val TTD (MaskCLIP) mIoU 32.0 # 1
Unsupervised Semantic Segmentation with Language-image Pre-training Cityscapes val TTD (TCL) mIoU 27.0 # 3
Unsupervised Semantic Segmentation with Language-image Pre-training COCO-Object TTD (MaskCLIP) mIoU 26.5 # 6
Unsupervised Semantic Segmentation with Language-image Pre-training COCO-Object TTD (TCL) mIoU 37.4 # 1
Open Vocabulary Semantic Segmentation COCO-Stuff-171 TTD (TCL) mIoU 23.7 # 1
Open Vocabulary Semantic Segmentation COCO-Stuff-171 TTD (MaskCLIP) mIoU 19.4 # 3
Unsupervised Semantic Segmentation with Language-image Pre-training COCO-Stuff-171 TTD (MaskCLIP) mIoU 19.4 # 4
Unsupervised Semantic Segmentation with Language-image Pre-training COCO-Stuff-171 TTD (TCL) mIoU 23.7 # 2
Unsupervised Semantic Segmentation with Language-image Pre-training PASCAL Context-59 TTD (MaskCLIP) mIoU 31.0 # 4
Unsupervised Semantic Segmentation with Language-image Pre-training PASCAL Context-59 TTD (TCL) mIoU 37.4 # 2
Open Vocabulary Semantic Segmentation PASCAL Context-59 TTD (TCL) mIoU 37.4 # 15
Open Vocabulary Semantic Segmentation PASCAL Context-59 TTD (MaskCLIP) mIoU 31.0 # 17
Unsupervised Semantic Segmentation with Language-image Pre-training PASCAL VOC TTD (MaskCLIP) mIoU 43.1 # 5
Unsupervised Semantic Segmentation with Language-image Pre-training PASCAL VOC TTD (TCL) mIoU 61.1 # 2


ALIGN โ€ข Focus