Image-to-Text Retrieval
29 papers with code • 8 benchmarks • 8 datasets
Image-text retrieval refers to the process of finding relevant images based on textual descriptions or retrieving textual descriptions that are relevant to a given image. It's an interdisciplinary area that blends techniques from computer vision, natural language processing (NLP), and machine learning. The aim is to bridge the semantic gap between the visual information present in images and the textual descriptions that humans use to interpret them.
Libraries
Use these libraries to find Image-to-Text Retrieval models and implementationsMost implemented papers
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models.
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens.
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks.
Deep Visual-Semantic Alignments for Generating Image Descriptions
Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data.
FLAVA: A Foundational Language And Vision Alignment Model
State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks.
IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages
Our benchmark enables the evaluation of multilingual multimodal models for transfer learning, not only in a zero-shot setting, but also in newly defined few-shot learning setups.
Exploring Models and Data for Remote Sensing Image Caption Generation
Finally, a comprehensive review is presented on the proposed data set to fully advance the task of remote sensing caption.
WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training
We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model.
OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation
In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text and audio resources.
A Differentiable Semantic Metric Approximation in Probabilistic Embedding for Cross-Modal Retrieval
To verify the effectiveness of our approach, extensive experiments are conducted on MS-COCO, CUB Captions, and Flickr30K, which are commonly used in cross-modal retrieval.