The dataset consists of over 350,000 public domain patent drawings collected from the United States Patent and Trademark Office (USPTO). The whole collection consists of a total of 45,000 design patents published between January 2018 and June 2019.
3 PAPERS • 1 BENCHMARK
Most publications that aim to optimize neural networks for CBIR, train and test their models on domain specific datasets. It is therefore unclear, if those networks can be used as a general-purpose image feature extractor. After analyzing popular image retrieval test sets we decided to manually curate GPR1200, an easy to use and accessible but challenging benchmark dataset with 1200 categories and 10 class examples. Classes and images were manually selected from six publicly available datasets of different image areas, ensuring high class diversity and clean class boundaries.
3 PAPERS • NO BENCHMARKS YET
A large-scale multilingual corpus of images, each labeled with the word it represents. The dataset includes approximately 10,000 words in each of 100 languages.
A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, we find that—across 7 architectures trained with 4 algorithms on massive datasets—they struggle at compositionality. To arrive at this conclusion, we introduce a new compositionality evaluation benchmark, CREPE, which measures two important aspects of compositionality identified by cognitive science literature: systematicity and productivity. To measure systematicity, CREPE consists of a test dataset containing over 370K image-text pairs and three different seen-unseen splits. The three splits are designed to test models trained on three popular training datasets: CC-12M, YFCC-15M, and LAION-400M. We also generate 325K, 316K, and 309K hard negative captions for a subset of the pairs. To test productivity, CREPE contains 17K image-text pairs with nine different complexities plus 183K hard neg
2 PAPERS • 1 BENCHMARK
The appearance of the world varies dramatically not only from place to place but also from hour to hour and month to month. Every day billions of images capture this complex relationship, many of which are associated with precise time and location metadata. We propose to use these images to construct a global-scale, dynamic map of visual appearance attributes. Such a map enables fine-grained understanding of the expected appearance at any geographic location and time. Our approach integrates dense overhead imagery with location and time metadata into a general framework capable of mapping a wide variety of visual attributes. A key feature of our approach is that it requires no manual data annotation. We demonstrate how this approach can support various applications, including image-driven mapping, image geolocalization, and metadata verification.
DyML-Animal is based on animal images selected from ImageNet-5K [1]. It has 5 semantic scales (i.e., classes, order, family, genus, species) according to biological taxonomy. Specifically, there are 611 “species” for the fine level, 47 categories corresponding to “order”, “family” or “genus” for the middle level, and 5 “classes” for the coarse level. We note some animals have contradiction between visual perception and biological taxonomy, e.g., whale in “mammal” actually looks more similar to fish. Annotating the whale images as belonging to mammal would cause confusion to visual recognition. So we take a detailed check on potential contradictions and intentionally leave out those animals.
DyML-Product is derived from iMaterialist-2019, a hierarchical online product dataset. The original iMaterialist-2019 offers up to 4 levels of hierarchical annotations. We remove the coarsest level and maintain 3 levels for DyML-Product.
DyML-Vehicle merges two vehicle re-ID datasets PKU VehicleID [1], VERI-Wild [1]. Since these two datasets have only annotations on the identity (fine) level, we manually annotate each image with “model” label (e.g., Toyota Camry, Honda Accord, Audi A4) and “body type” label (e.g., car, suv, microbus, pickup). Moreover, we label all the taxi images as a novel testing class under coarse level.
This dataset consists of 3,710 flood images, annotated by domain experts regarding their relevance with respect to three tasks (determining the flooded area, inundation depth, water pollution).
2 PAPERS • NO BENCHMARKS YET
INSTRE is a benchmark for INSTance-level visual object REtrieval and REcognition (INSTRE). INSTRE has the following major properties: (1) balanced data scale, (2) more diverse intraclass instance variations, (3) cluttered and less contextual backgrounds, (4) object localization annotation for each image, (5) well-manipulated double-labelled images for measuring multiple object (within one image) case.
Large Scale Composed Image Retrieval (LaSCo) is a new dataset for Composed Image Retrieval (CoIR), x10 times larger than current ones.
A salient object subitizing image dataset of about 14K everyday images which are annotated using an online crowdsourcing marketplace.
A Large-Scale Chinese Image-Text Benchmark for Real-World Short Video Search Scenario
1 PAPER • 1 BENCHMARK
ConQA is a dataset created using the intersection between VisualGenome and MS-COCO. The goal of this dataset is to provide a new benchmark for text to image retrieval using short and less descriptive queries than the commonly use captions from MS-COCO or Flicker. ConQA consists of 80 queries divided into 50 conceptual and 30 descriptive queries. A descriptive query mentions some of the objects in the image, for instance, people chopping vegetables. While, a conceptual query does not mention objects or only refers to objects in a general context, e.g., working class life.
1 PAPER • 2 BENCHMARKS
The standard evaluation protocol of Cross-View Time dataset allows for certain cameras to be shared between training and testing sets. This protocol can emulate scenarios in which we need to verify the authenticity of images from a particular set of devices and locations. Considering the ubiquity of surveillance systems (CCTV) nowadays, this is a common scenario, especially for big cities and high visibility events (e.g., protests, musical concerts, terrorist attempts, sports events). In such cases, we can leverage the availability of historical photographs of that device and collect additional images from previous days, months, and years. This would allow the model to better capture the particularities of how time influences the appearance of that specific place, probably leading to a better verification accuracy. However, there might be cases in which data is originated from heterogeneous sources, such as social media. In this sense, it is essential that models are optimized on camer
Provides 450, 000 relevance annotations and 53 structured queries.
1 PAPER • NO BENCHMARKS YET
DialogCC is a large-scale multi-modal dialogue dataset, which covers diverse real-world topics and various images per dialogue. It contains 651k unique images and is designed for image and text retrieval tasks.
FETA benchmark focuses on text-to-image and image-to-text retrieval in public car manuals and sales catalogue brochures. The FETA Car-Manuals dataset consists of a total of 349 PDF documents from 5 car manufacturers, namely Nissan, Toyota, Mazda, Renault, Chevrolet.
Food Drinks and groceries Images Multi Lingual (FooDI-ML) is a dataset that contains over 1.5M unique images and over 9.5M store names, product names descriptions, and collection sections gathered from the Glovo application. The data made available corresponds to food, drinks and groceries products from 37 countries in Europe, the Middle East, Africa and Latin America. The dataset comprehends 33 languages, including 870K samples of languages of countries from Eastern Europe and Western Asia such as Ukrainian and Kazakh, which have been so far underrepresented in publicly available visiolinguistic datasets. The dataset also includes widely spoken languages such as Spanish and English.
The image collection of the IAPR TC-12 Benchmark consists of 20,000 still natural images taken from locations around the world and comprising an assorted cross-section of still natural images. This includes pictures of different sports and actions, photographs of people, animals, cities, landscapes, and many other aspects of contemporary life. Each image is associated with a text caption in up to three different languages (English, German and Spanish).
The IAW dataset contains 420 Ikea furniture pieces from 14 common categories e.g. sofa, bed, wardrobe, table, etc. Each piece of furniture comes with one or more user instruction manuals, which are first divided into pages and then further divided into independent steps cropped from each page (some pages contain more than one step and some pages do not contain instructions). There are 8568 pages and 8263 steps overall, on average 20.4 pages and 19.7 steps for each piece of furniture. We crawled YouTube to find videos corresponding to these instruction manuals and as such the conditions in the videos are diverse on many aspects e.g. duration, resolution, first- or third-person view, camera pose, background environment, number of assemblers, etc. The IAW dataset contains 1005 raw videos with a length of around 183 hours in total. Among them, approximately 114 hours of content are labeled as 15649 actions to match the corresponding step in the corresponding manual.
InstaCities1M is a dataset of social media images with associated text. It consists of Instagram images associated associated with one of the 10 most populated English speaking cities all over the world. It has 100K images for each city, which makes a total of 1M images, split in 800K training images, 50K validation images and 150K testing images. All images were resized to 300x300 pixels.
It is composed of around 770k of color 256x256 RGB images extracted from the European Union Intellectual Property Office (EUIPO) open registry. Each of them is associated to multiple labels that classify the figurative and textual elements that appear in the images. These annotations have been classified by the EUIPO evaluators using the Vienna classification, a hierarchical classification of figurative marks.
A unique dataset comprising multimodal creative and designed documents containing images with corresponding captions paired with music based on around 50mood/themes.
The NAVER LABS localization datasets are 5 new indoor datasets for visual localization in challenging real-world environments. They were captured in a large shopping mall and a large metro station in Seoul, South Korea, using a dedicated mapping platform consisting of 10 cameras and 2 laser scanners. In order to obtain accurate ground truth camera poses, we used a robust LiDAR SLAM which provides initial poses that are then refined using a novel structure-from-motion based optimization. The datasets are provided in the kapture format and contain about 130k images as well as 6DoF camera poses for training and validation. We also provide sparse Lidar-based depth maps for the training images. The poses of the test set are withheld to not bias the benchmark.
The PKU Sketch Re-ID dataset is constructed by National Engineering Laboratory for Video Technology (NELVT), Peking University.
WebLI (Web Language Image) is a web-scale multilingual image-text dataset, designed to support Google’s vision-language research, such as the large-scale pre-training for image understanding, image captioning, visual question answering, object detection etc.
fruit-SALAD is a synthetic image dataset with 10,000 generated images of fruit depictions. This combined semantic category and style benchmark comprises 100 instances each of 10 easily recognizable fruit categories and 10 easy distinguishable styles.