TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias

30 Mar 2024  ยท  Sanghyun Jo, Soohyun Ryu, Sungyub Kim, Eunho Yang, KyungSu Kim ยท

We identify a critical bias in contemporary CLIP-based models, which we denote as single tag bias. This bias manifests as a disproportionate focus on a singular tag (word) while neglecting other pertinent tags, stemming from CLIP's text embeddings that prioritize one specific tag in image-text relationships. When deconstructing text into individual tags, only one tag tends to have high relevancy with CLIP's image embedding, leading to biased tag relevancy. In this paper, we introduce a novel two-step fine-tuning approach, Text-Tag Self-Distillation (TTD), to address this challenge. TTD first extracts image-relevant tags from text based on their similarity to the nearest pixels then employs a self-distillation strategy to align combined masks with the text-derived mask. This approach ensures the unbiased image-text alignment of the CLIP-based models using only image-text pairs without necessitating additional supervision. Our technique demonstrates model-agnostic improvements in multi-tag classification and segmentation tasks, surpassing competing methods that rely on external resources. The code is available at https://github.com/shjo-april/TTD.

PDF Abstract

Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Unsupervised Semantic Segmentation with Language-image Pre-training ADE20K TTD (MaskCLIP) Mean IoU (val) 12.7 # 5
Unsupervised Semantic Segmentation with Language-image Pre-training ADE20K TTD (TCL) Mean IoU (val) 17.0 # 3
Open Vocabulary Semantic Segmentation ADE20K-150 TTD (TCL) mIoU 17.0 # 15
Open Vocabulary Semantic Segmentation ADE20K-150 TTD (MaskCLIP) mIoU 12.7 # 16
Semantic Segmentation CC3M-TagMask TTD (TCL) mIoU 65.5 # 1
Semantic Segmentation CC3M-TagMask TTD (MaskCLIP) mIoU 50.2 # 3
Multi-Label Text Classification CC3M-TagMask TTD (w/o fine-tuning) F1 78.5 # 2
Precision 82.9 # 2
Recall 74.5 # 3
mAP 90.3 # 2
Accuracy 91.0 # 1
Multi-Label Text Classification CC3M-TagMask TTD (w/ fine-tuning) F1 82.8 # 1
Precision 88.3 # 1
Recall 78.0 # 2
mAP 93.7 # 1
Accuracy 88.6 # 2
Open Vocabulary Semantic Segmentation Cityscapes TTD (MaskCLIP) mIoU 27.0 # 5
Open Vocabulary Semantic Segmentation Cityscapes TTD (TCL) mIoU 32.0 # 3
Unsupervised Semantic Segmentation with Language-image Pre-training Cityscapes val TTD (MaskCLIP) mIoU 32.0 # 1
Unsupervised Semantic Segmentation with Language-image Pre-training Cityscapes val TTD (TCL) mIoU 27.0 # 3
Unsupervised Semantic Segmentation with Language-image Pre-training COCO-Object TTD (MaskCLIP) mIoU 26.5 # 6
Unsupervised Semantic Segmentation with Language-image Pre-training COCO-Object TTD (TCL) mIoU 37.4 # 1
Open Vocabulary Semantic Segmentation COCO-Stuff-171 TTD (TCL) mIoU 23.7 # 1
Open Vocabulary Semantic Segmentation COCO-Stuff-171 TTD (MaskCLIP) mIoU 19.4 # 3
Unsupervised Semantic Segmentation with Language-image Pre-training COCO-Stuff-171 TTD (MaskCLIP) mIoU 19.4 # 4
Unsupervised Semantic Segmentation with Language-image Pre-training COCO-Stuff-171 TTD (TCL) mIoU 23.7 # 2
Unsupervised Semantic Segmentation with Language-image Pre-training PASCAL Context-59 TTD (MaskCLIP) mIoU 31.0 # 4
Unsupervised Semantic Segmentation with Language-image Pre-training PASCAL Context-59 TTD (TCL) mIoU 37.4 # 2
Open Vocabulary Semantic Segmentation PASCAL Context-59 TTD (TCL) mIoU 37.4 # 15
Open Vocabulary Semantic Segmentation PASCAL Context-59 TTD (MaskCLIP) mIoU 31.0 # 17
Unsupervised Semantic Segmentation with Language-image Pre-training PASCAL VOC TTD (MaskCLIP) mIoU 43.1 # 5
Unsupervised Semantic Segmentation with Language-image Pre-training PASCAL VOC TTD (TCL) mIoU 61.1 # 2

Methods


ALIGN โ€ข Focus