🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language

76 dataset results for Image Retrieval

The dataset consists of over 350,000 public domain patent drawings collected from the United States Patent and Trademark Office (USPTO). The whole collection consists of a total of 45,000 design patents published between January 2018 and June 2019.

3 PAPERS • 1 BENCHMARK

GPR1200

Most publications that aim to optimize neural networks for CBIR, train and test their models on domain specific datasets. It is therefore unclear, if those networks can be used as a general-purpose image feature extractor. After analyzing popular image retrieval test sets we decided to manually curate GPR1200, an easy to use and accessible but challenging benchmark dataset with 1200 categories and 10 class examples. Classes and images were manually selected from six publicly available datasets of different image areas, ensuring high class diversity and clean class boundaries.

3 PAPERS • NO BENCHMARKS YET

MMID

MMID (Massively Multilingual Image Dataset)

A large-scale multilingual corpus of images, each labeled with the word it represents. The dataset includes approximately 10,000 words in each of 100 languages.

3 PAPERS • NO BENCHMARKS YET

CREPE (Compositional REPresentation Evaluation)

A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, we find that—across 7 architectures trained with 4 algorithms on massive datasets—they struggle at compositionality. To arrive at this conclusion, we introduce a new compositionality evaluation benchmark, CREPE, which measures two important aspects of compositionality identified by cognitive science literature: systematicity and productivity. To measure systematicity, CREPE consists of a test dataset containing over 370K image-text pairs and three different seen-unseen splits. The three splits are designed to test models trained on three popular training datasets: CC-12M, YFCC-15M, and LAION-400M. We also generate 325K, 316K, and 309K hard negative captions for a subset of the pairs. To test productivity, CREPE contains 17K image-text pairs with nine different complexities plus 183K hard neg

2 PAPERS • 1 BENCHMARK

Cross-View Time Dataset

The appearance of the world varies dramatically not only from place to place but also from hour to hour and month to month. Every day billions of images capture this complex relationship, many of which are associated with precise time and location metadata. We propose to use these images to construct a global-scale, dynamic map of visual appearance attributes. Such a map enables fine-grained understanding of the expected appearance at any geographic location and time. Our approach integrates dense overhead imagery with location and time metadata into a general framework capable of mapping a wide variety of visual attributes. A key feature of our approach is that it requires no manual data annotation. We demonstrate how this approach can support various applications, including image-driven mapping, image geolocalization, and metadata verification.

2 PAPERS • 1 BENCHMARK

DyML-Animal

DyML-Animal (Dynamic Metric Learning Animal)

DyML-Animal is based on animal images selected from ImageNet-5K [1]. It has 5 semantic scales (i.e., classes, order, family, genus, species) according to biological taxonomy. Specifically, there are 611 “species” for the fine level, 47 categories corresponding to “order”, “family” or “genus” for the middle level, and 5 “classes” for the coarse level. We note some animals have contradiction between visual perception and biological taxonomy, e.g., whale in “mammal” actually looks more similar to fish. Annotating the whale images as belonging to mammal would cause confusion to visual recognition. So we take a detailed check on potential contradictions and intentionally leave out those animals.

2 PAPERS • 1 BENCHMARK

DyML-Product

DyML-Product (Dynamic Metric Learning Product)

DyML-Product is derived from iMaterialist-2019, a hierarchical online product dataset. The original iMaterialist-2019 offers up to 4 levels of hierarchical annotations. We remove the coarsest level and maintain 3 levels for DyML-Product.

2 PAPERS • 1 BENCHMARK

DyML-Vehicle

DyML-Vehicle (Dynamic Metric Learning Vehicle)

DyML-Vehicle merges two vehicle re-ID datasets PKU VehicleID [1], VERI-Wild [1]. Since these two datasets have only annotations on the identity (fine) level, we manually annotate each image with “model” label (e.g., Toyota Camry, Honda Accord, Audi A4) and “body type” label (e.g., car, suv, microbus, pickup). Moreover, we label all the taxi images as a novel testing class under coarse level.

2 PAPERS • 1 BENCHMARK

European Flood 2013 Dataset

This dataset consists of 3,710 flood images, annotated by domain experts regarding their relevance with respect to three tasks (determining the flooded area, inundation depth, water pollution).

2 PAPERS • NO BENCHMARKS YET

INSTRE

INSTRE is a benchmark for INSTance-level visual object REtrieval and REcognition (INSTRE). INSTRE has the following major properties: (1) balanced data scale, (2) more diverse intraclass instance variations, (3) cluttered and less contextual backgrounds, (4) object localization annotation for each image, (5) well-manipulated double-labelled images for measuring multiple object (within one image) case.

2 PAPERS • 1 BENCHMARK

LaSCo

Large Scale Composed Image Retrieval (LaSCo) is a new dataset for Composed Image Retrieval (CoIR), x10 times larger than current ones.

2 PAPERS • 1 BENCHMARK

Salient Object Subitizing Dataset

A salient object subitizing image dataset of about 14K everyday images which are annotated using an online crowdsourcing marketplace.

2 PAPERS • NO BENCHMARKS YET

CBVS

A Large-Scale Chinese Image-Text Benchmark for Real-World Short Video Search Scenario

1 PAPER • 1 BENCHMARK

ConQA (Conceptual Query Answering)

ConQA is a dataset created using the intersection between VisualGenome and MS-COCO. The goal of this dataset is to provide a new benchmark for text to image retrieval using short and less descriptive queries than the commonly use captions from MS-COCO or Flicker. ConQA consists of 80 queries divided into 50 conceptual and 30 descriptive queries. A descriptive query mentions some of the objects in the image, for instance, people chopping vegetables. While, a conceptual query does not mention objects or only refers to objects in a general context, e.g., working class life.

1 PAPER • 2 BENCHMARKS

Cross-View Time Dataset (Cross-Camera Split)

The standard evaluation protocol of Cross-View Time dataset allows for certain cameras to be shared between training and testing sets. This protocol can emulate scenarios in which we need to verify the authenticity of images from a particular set of devices and locations. Considering the ubiquity of surveillance systems (CCTV) nowadays, this is a common scenario, especially for big cities and high visibility events (e.g., protests, musical concerts, terrorist attempts, sports events). In such cases, we can leverage the availability of historical photographs of that device and collect additional images from previous days, months, and years. This would allow the model to better capture the particularities of how time influences the appearance of that specific place, probably leading to a better verification accuracy. However, there might be cases in which data is originated from heterogeneous sources, such as social media. In this sense, it is essential that models are optimized on camer

1 PAPER • 1 BENCHMARK

Dataset of Structured Queries and Spatial Relations

Provides 450, 000 relevance annotations and 53 structured queries.

1 PAPER • NO BENCHMARKS YET

DialogCC

DialogCC is a large-scale multi-modal dialogue dataset, which covers diverse real-world topics and various images per dialogue. It contains 651k unique images and is designed for image and text retrieval tasks.

1 PAPER • NO BENCHMARKS YET

FETA Car-Manuals (FETA Car-Manuals dataset, image-text retrieval for foundation models' expert data performance.)

FETA benchmark focuses on text-to-image and image-to-text retrieval in public car manuals and sales catalogue brochures. The FETA Car-Manuals dataset consists of a total of 349 PDF documents from 5 car manufacturers, namely Nissan, Toyota, Mazda, Renault, Chevrolet.

1 PAPER • 2 BENCHMARKS

FooDI-ML (Food Drinks and groceries Images Multi Lingual)

Food Drinks and groceries Images Multi Lingual (FooDI-ML) is a dataset that contains over 1.5M unique images and over 9.5M store names, product names descriptions, and collection sections gathered from the Glovo application. The data made available corresponds to food, drinks and groceries products from 37 countries in Europe, the Middle East, Africa and Latin America. The dataset comprehends 33 languages, including 870K samples of languages of countries from Eastern Europe and Western Asia such as Ukrainian and Kazakh, which have been so far underrepresented in publicly available visiolinguistic datasets. The dataset also includes widely spoken languages such as Spanish and English.

1 PAPER • 2 BENCHMARKS

IAPR TC-12 (IAPR TC-12 Benchmark)

The image collection of the IAPR TC-12 Benchmark consists of 20,000 still natural images taken from locations around the world and comprising an assorted cross-section of still natural images. This includes pictures of different sports and actions, photographs of people, animals, cities, landscapes, and many other aspects of contemporary life. Each image is associated with a text caption in up to three different languages (English, German and Spanish).

1 PAPER • NO BENCHMARKS YET

IAW Dataset (Ikea Assembly In The Wild Dataset)

The IAW dataset contains 420 Ikea furniture pieces from 14 common categories e.g. sofa, bed, wardrobe, table, etc. Each piece of furniture comes with one or more user instruction manuals, which are first divided into pages and then further divided into independent steps cropped from each page (some pages contain more than one step and some pages do not contain instructions). There are 8568 pages and 8263 steps overall, on average 20.4 pages and 19.7 steps for each piece of furniture. We crawled YouTube to find videos corresponding to these instruction manuals and as such the conditions in the videos are diverse on many aspects e.g. duration, resolution, first- or third-person view, camera pose, background environment, number of assemblers, etc. The IAW dataset contains 1005 raw videos with a length of around 183 hours in total. Among them, approximately 114 hours of content are labeled as 15649 actions to match the corresponding step in the corresponding manual.

1 PAPER • NO BENCHMARKS YET

InstaCities1M

InstaCities1M is a dataset of social media images with associated text. It consists of Instagram images associated associated with one of the 10 most populated English speaking cities all over the world. It has 100K images for each city, which makes a total of 1M images, split in 800K training images, 50K validation images and 150K testing images. All images were resized to 300x300 pixels.

1 PAPER • NO BENCHMARKS YET

Large Labelled Logo Dataset (L3D)

It is composed of around 770k of color 256x256 RGB images extracted from the European Union Intellectual Property Office (EUIPO) open registry. Each of them is associated to multiple labels that classify the figurative and textual elements that appear in the images. These annotations have been classified by the EUIPO evaluators using the Vienna classification, a hierarchical classification of figurative marks.

1 PAPER • 1 BENCHMARK

MELON

MELON (Melodic Design)

A unique dataset comprising multimodal creative and designed documents containing images with corresponding captions paired with music based on around 50mood/themes.

1 PAPER • NO BENCHMARKS YET

NAVER LABS Localization Datasets

The NAVER LABS localization datasets are 5 new indoor datasets for visual localization in challenging real-world environments. They were captured in a large shopping mall and a large metro station in Seoul, South Korea, using a dedicated mapping platform consisting of 10 cameras and 2 laser scanners. In order to obtain accurate ground truth camera poses, we used a robust LiDAR SLAM which provides initial poses that are then refined using a novel structure-from-motion based optimization. The datasets are provided in the kapture format and contain about 130k images as well as 6DoF camera poses for training and validation. We also provide sparse Lidar-based depth maps for the training images. The poses of the test set are withheld to not bias the benchmark.

1 PAPER • NO BENCHMARKS YET

PKU SketchRe-ID Dataset

The PKU Sketch Re-ID dataset is constructed by National Engineering Laboratory for Video Technology (NELVT), Peking University.

1 PAPER • 1 BENCHMARK

WebLI

WebLI (Web Language Image)

WebLI (Web Language Image) is a web-scale multilingual image-text dataset, designed to support Google’s vision-language research, such as the large-scale pre-training for image understanding, image captioning, visual question answering, object detection etc.

1 PAPER • NO BENCHMARKS YET

fruit-SALAD

fruit-SALAD is a synthetic image dataset with 10,000 generated images of fruit depictions. This combined semantic category and style benchmark comprises 100 instances each of 10 easily recognizable fruit categories and 10 easy distinguishable styles.

1 PAPER • NO BENCHMARKS YET