The Georgia Tech Egocentric Activities (GTEA) dataset contains seven types of daily activities such as making sandwich, tea, or coffee. Each activity is performed by four different people, thus totally 28 videos. For each video, there are about 20 fine-grained action instances such as take bread, pour ketchup, in approximately one minute.
106 PAPERS • 2 BENCHMARKS
The Holistic Evaluation of Language Models (HELM) is a comprehensive framework developed by Stanford University for evaluating foundation language models. It serves as a living benchmark, promoting transparency in language models. Here are the key aspects of HELM:
106 PAPERS • NO BENCHMARKS YET
KILT (Knowledge Intensive Language Tasks) is a benchmark consisting of 11 datasets representing 5 types of tasks:
106 PAPERS • 12 BENCHMARKS
WikiHow is a dataset of more than 230,000 article and summary pairs extracted and constructed from an online knowledge base written by different human authors. The articles span a wide range of topics and represent high diversity styles.
Dataset composed of online banking queries annotated with their corresponding intents.
105 PAPERS • 5 BENCHMARKS
The CSIQ database consists of 30 original images, each is distorted using six different types of distortions at four to five different levels of distortion. CSIQ images are subjectively rated base on a linear displacement of the images across four calibrated LCD monitors placed side by side with equal viewing distance to the observer. The database contains 5000 subjective ratings from 35 different observers, and ratings are reported in the form of DMOS.
105 PAPERS • 1 BENCHMARK
A labeled benchmark dataset for training machine learning models to statically detect malicious Windows portable executable files. The dataset includes features extracted from 1.1M binary files: 900K training samples (300K malicious, 300K benign, 300K unlabeled) and 200K test samples (100K malicious, 100K benign).
105 PAPERS • NO BENCHMARKS YET
Multi-News, consists of news articles and human-written summaries of these articles from the site newser.com. Each summary is professionally written by editors and includes links to the original articles cited.
105 PAPERS • 4 BENCHMARKS
NELL-995 KG Completion Dataset
105 PAPERS • 2 BENCHMARKS
The Newsela dataset was introduced by Xu et al. in their research on text simplification. It is a corpus that includes thousands of news articles professionally leveled to different reading complexities. The dataset is used for academic research in fields such as text difficulty and text simplification. It is made available to academic partners upon request. The dataset is often used as a benchmark in the field of text simplification. Please note that the Newsela dataset is different from the NELA datasets, which are collections of news articles for the study of media bias and other applications.
GoEmotions is a corpus of 58k carefully curated comments extracted from Reddit, with human annotations to 27 emotion categories or Neutral.
104 PAPERS • 1 BENCHMARK
PubLayNet is a dataset for document layout analysis by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central. The size of the dataset is comparable to established computer vision datasets, containing over 360 thousand document images, where typical document layout elements are annotated.
The SummEval dataset is a resource developed by the Yale LILY Lab and Salesforce Research for evaluating text summarization models. It was created as part of a project to address shortcomings in summarization evaluation methods.
104 PAPERS • NO BENCHMARKS YET
We present ASDiv (Academia Sinica Diverse MWP Dataset), a diverse (in terms of both language patterns and problem types) English math word problem (MWP) corpus for evaluating the capability of various MWP solvers. Existing MWP corpora for studying AI progress remain limited either in language usage patterns or in problem types. We thus present a new English MWP corpus with 2,305 MWPs that cover more text patterns and most problem types taught in elementary school. Each MWP is annotated with its problem type and grade level (for indicating the level of difficulty). Furthermore, we propose a metric to measure the lexicon usage diversity of a given MWP corpus, and demonstrate that ASDiv is more diverse than existing corpora. Experiments show that our proposed corpus reflects the true capability of MWP solvers more faithfully.
103 PAPERS • 1 BENCHMARK
A hand-object interaction dataset with 3D pose annotations of hand and object. The dataset contains 66,034 training images and 11,524 test images from a total of 68 sequences. The sequences are captured in multi-camera and single-camera setups and contain 10 different subjects manipulating 10 different objects from YCB dataset. The annotations are automatically obtained using an optimization algorithm. The hand pose annotations for the test set are withheld and the accuracy of the algorithms on the test set can be evaluated with standard metrics using the CodaLab challenge submission(see project page). The object pose annotations for the test and train set are provided along with the dataset.
103 PAPERS • 2 BENCHMARKS
PTC is a collection of 344 chemical compounds represented as graphs which report the carcinogenicity for rats. There are 19 node labels for each node.
The Penn Action Dataset contains 2326 video sequences of 15 different actions and human joint annotations for each sequence.
102 PAPERS • 4 BENCHMARKS
SegTrack v2 is a video segmentation dataset with full pixel-level annotations on multiple objects at each frame within each video.
The Stylized-ImageNet dataset is created by removing local texture cues in ImageNet while retaining global shape information on natural images via AdaIN style transfer. This nudges CNNs towards learning more about shapes and less about local textures.
102 PAPERS • 1 BENCHMARK
A repository that contains political events with a specific timestamp. These political events relate entities (e.g. countries, presidents...) to a number of other entities via logical predicates (e.g. ’Make a visit’ or ’Express intent to meet or negotiate’).
101 PAPERS • 2 BENCHMARKS
CORNELL NEWSROOM is a large dataset for training and evaluating summarization systems. It contains 1.3 million articles and summaries written by authors and editors in the newsrooms of 38 major publications. The summaries are obtained from search and social metadata between 1998 and 2017 and use a variety of summarization strategies combining extraction and abstraction.
101 PAPERS • NO BENCHMARKS YET
Reading Comprehension with Commonsense Reasoning Dataset (ReCoRD) is a large-scale reading comprehension dataset which requires commonsense reasoning. ReCoRD consists of queries automatically generated from CNN/Daily Mail news articles; the answer to each query is a text span from a summarizing passage of the corresponding news. The goal of ReCoRD is to evaluate a machine's ability of commonsense reasoning in reading comprehension. ReCoRD is pronounced as [ˈrɛkərd].
101 PAPERS • 1 BENCHMARK
Sentiment analysis is increasingly viewed as a vital task both from an academic and a commercial standpoint. The majority of current approaches, however, attempt to detect the overall polarity of a sentence, paragraph, or text span, regardless of the entities mentioned (e.g., laptops, restaurants) and their aspects (e.g., battery, screen; food, service). By contrast, this task is concerned with aspect based sentiment analysis (ABSA), where the goal is to identify the aspects of given target entities and the sentiment expressed towards each aspect. Datasets consisting of customer reviews with human-authored annotations identifying the mentioned aspects of the target entities and the sentiment polarity of each aspect will be provided.
101 PAPERS • 5 BENCHMARKS
The image dataset TinyImages contains 80 million images of size 32×32 collected from the Internet, crawling the words in WordNet.
The Visual Storytelling Dataset (VIST) consists of 210,819 unique photos and 50,000 stories. The images were collected from albums on Flickr. The albums included 10 to 50 images and all the images in an album are taken in a 48-hour span. The stories were created by workers on Amazon Mechanical Turk, where the workers were instructed to choose five images from the album and write a story about them. Every story has five sentences, and every sentence is paired with its appropriate image. The dataset is split into 3 subsets, a training set (80%), a validation set (10%) and a test set (10%). All the words and interpunction signs in the stories are separated by a space character and all the location names are replaced with the word location. All the names of people are replaced with the words male or female depending on the gender of the person.
CARER is an emotion dataset collected through noisy labels, annotated via distant supervision as in (Go et al., 2009).
100 PAPERS • 4 BENCHMARKS
The Oxford-BBC Lip Reading Sentences 2 (LRS2) dataset is one of the largest publicly available datasets for lip reading sentences in-the-wild. The database consists of mainly news and talk shows from BBC programs. Each sentence is up to 100 characters in length. The training, validation and test sets are divided according to broadcast date. It is a challenging set since it contains thousands of speakers without speaker labels and large variation in head pose. The pre-training set contains 96,318 utterances, the training set contains 45,839 utterances, the validation set contains 1,082 utterances and the test set contains 1,242 utterances.
100 PAPERS • 9 BENCHMARKS
QASC is a question-answering dataset with a focus on sentence composition. It consists of 9,980 8-way multiple-choice questions about grade school science (8,134 train, 926 dev, 920 test), and comes with a corpus of 17M sentences.
100 PAPERS • NO BENCHMARKS YET
The RVL-CDIP dataset consists of scanned document images belonging to 16 classes such as letter, form, email, resume, memo, etc. The dataset has 320,000 training, 40,000 validation and 40,000 test images. The images are characterized by low quality, noise, and low resolution, typically 100 dpi.
100 PAPERS • 3 BENCHMARKS
The Wider Facial Landmarks in the Wild or WFLW database contains 10000 faces (7500 for training and 2500 for testing) with 98 annotated landmarks. This database also features rich attribute annotations in terms of occlusion, head pose, make-up, illumination, blur and expressions.
This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages. This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository.
99 PAPERS • NO BENCHMARKS YET
Cholec80 is an endoscopic video dataset containing 80 videos of cholecystectomy surgeries performed by 13 surgeons. The videos are captured at 25 fps and downsampled to 1 fps for processing. The whole dataset is labeled with the phase and tool presence annotations. The phases have been defined by a senior surgeon in Strasbourg hospital, France. Since the tools are sometimes hardly visible in the images and thus difficult to be recognized visually, a tool is defined as present in an image if at least half of the tool tip is visible.
99 PAPERS • 2 BENCHMARKS
Visual7W is a large-scale visual question answering (QA) dataset, with object-level groundings and multimodal answers. Each question starts with one of the seven Ws, what, where, when, who, why, how and which. It is collected from 47,300 COCO images and it has 327,929 QA pairs, together with 1,311,756 human-generated multiple-choices and 561,459 object groundings from 36,579 categories.
99 PAPERS • 1 BENCHMARK
The Yelp2018 dataset is adopted from the 2018 edition of the yelp challenge. Wherein local businesses like restaurants and bars are viewed as items. We use the same 10-core setting in order to ensure data quality.
99 PAPERS • 3 BENCHMARKS
AVA is a project that provides audiovisual annotations of video for improving our understanding of human activity. Each of the video clips has been exhaustively annotated by human annotators, and together they represent a rich variety of scenes, recording conditions, and expressions of human activity. There are annotations for:
98 PAPERS • 7 BENCHMARKS
The data for FRGC consists of 50,000 recordings divided into training and validation partitions. The training partition is designed for training algorithms and the validation partition is for assessing performance of an approach in a laboratory setting. The validation partition consists of data from 4,003 subject sessions. A subject session is the set of all images of a person taken each time a person's biometric data is collected and consists of four controlled still images, two uncontrolled still images, and one three-dimensional image. The controlled images were taken in a studio setting, are full frontal facial images taken under two lighting conditions and with two facial expressions (smiling and neutral). The uncontrolled images were taken in varying illumination conditions; e.g., hallways, atriums, or outside. Each set of uncontrolled images contains two expressions, smiling and neutral. The 3D image was taken under controlled illumination conditions. The 3D images consist of bo
98 PAPERS • 1 BENCHMARK
MathQA significantly enhances the AQuA dataset with fully-specified operational programs.
The NCLT dataset is a large scale, long-term autonomy dataset for robotics research collected on the University of Michigan’s North Campus. The dataset consists of omnidirectional imagery, 3D lidar, planar lidar, GPS, and proprioceptive sensors for odometry collected using a Segway robot. The dataset was collected to facilitate research focusing on long-term autonomous operation in changing environments. The dataset is comprised of 27 sessions spaced approximately biweekly over the course of 15 months. The sessions repeatedly explore the campus, both indoors and outdoors, on varying trajectories, and at different times of the day across all four seasons. This allows the dataset to capture many challenging elements including: moving obstacles (e.g., pedestrians, bicyclists, and cars), changing lighting, varying viewpoint, seasonal and weather changes (e.g., falling leaves and snow), and long-term structural changes caused by construction projects.
98 PAPERS • NO BENCHMARKS YET
SA-1B consists of 11M diverse, high resolution, licensed, and privacy protecting images and 1.1B high-quality segmentation masks.
VATEX is multilingual, large, linguistically complex, and diverse dataset in terms of both video and natural language descriptions. It has two tasks for video-and-language research: (1) Multilingual Video Captioning, aimed at describing a video in various languages with a compact unified captioning model, and (2) Video-guided Machine Translation, to translate a source language description into the target language using the video information as additional spatiotemporal context.
98 PAPERS • 3 BENCHMARKS
The highD dataset is a new dataset of naturalistic vehicle trajectories recorded on German highways. Using a drone, typical limitations of established traffic data collection methods such as occlusions are overcome by the aerial perspective. Traffic was recorded at six different locations and includes more than 110 500 vehicles. Each vehicle's trajectory, including vehicle type, size and manoeuvres, is automatically extracted. Using state-of-the-art computer vision algorithms, the positioning error is typically less than ten centimeters. Although the dataset was created for the safety validation of highly automated vehicles, it is also suitable for many other tasks such as the analysis of traffic patterns or the parameterization of driver models.
CommonGen is constructed through a combination of crowdsourced and existing caption corpora, consists of 79k commonsense descriptions over 35k unique concept-sets.
97 PAPERS • 1 BENCHMARK
The Free Music Archive (FMA) is a large-scale dataset for evaluating several tasks in Music Information Retrieval. It consists of 343 days of audio from 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres. It provides full-length and high-quality audio, pre-computed features, together with track- and user-level metadata, tags, and free-form text such as biographies.
97 PAPERS • 2 BENCHMARKS
Grammarly’s Yahoo Answers Formality Corpus (GYAFC) is the largest dataset for any style containing a total of 110K informal / formal sentence pairs.
97 PAPERS • 3 BENCHMARKS
Imagenet32 is a huge dataset made up of small images called the down-sampled version of Imagenet. Imagenet32 is composed of 1,281,167 training data and 50,000 test data with 1,000 labels.
97 PAPERS • 4 BENCHMARKS
Soil Moisture Active Passive (SMAP) dataset is a dataset of soil samples and telemetry information using the Mars rover by NASA. Originally published in https://arxiv.org/abs/1802.04431 and used for the unsupervised anomaly detection task in time series data. Later it was used in many popular anomaly detection methods and benchmarks that distribute it in their repositories e.g., https://github.com/OpsPAI/MTAD
WritingPrompts is a large dataset of 300K human-written stories paired with writing prompts from an online forum.
The BP4D-Spontaneous dataset is a 3D video database of spontaneous facial expressions in a diverse group of young adults. Well-validated emotion inductions were used to elicit expressions of emotion and paralinguistic communication. Frame-level ground-truth for facial actions was obtained using the Facial Action Coding System. Facial features were tracked in both 2D and 3D domains using both person-specific and generic approaches. The database includes forty-one participants (23 women, 18 men). They were 18 – 29 years of age; 11 were Asian, 6 were African-American, 4 were Hispanic, and 20 were Euro-American. An emotion elicitation protocol was designed to elicit emotions of participants effectively. Eight tasks were covered with an interview process and a series of activities to elicit eight emotions. The database is structured by participants. Each participant is associated with 8 tasks. For each task, there are both 3D and 2D videos. As well, the Metadata include manually annotated
96 PAPERS • 3 BENCHMARKS