Clotho is an audio captioning dataset, consisting of 4981 audio samples, and each audio sample has five captions (a total of 24 905 captions). Audio samples are of 15 to 30 s duration and captions are eight to 20 words long.
143 PAPERS • 5 BENCHMARKS
KP20k is a large-scale scholarly articles dataset with 528K articles for training, 20K articles for validation and 20K articles for testing.
79 PAPERS • 3 BENCHMARKS
CELEX database comprises three different searchable lexical databases, Dutch, English and German. The lexical data contained in each database is divided into five categories: orthography, phonology, morphology, syntax (word class) and word frequency.
57 PAPERS • NO BENCHMARKS YET
The Video-based Multimodal Summarization with Multimodal Output (VMSMO) corpus consists of 184,920 document-summary pairs, with 180,000 training pairs, 2,460 validation and test pairs. The task for this dataset is generating and appropriate textual summary of an article and choosing a proper cover frame from a video accompanying the article.
8 PAPERS • NO BENCHMARKS YET
SkillSpan is a dataset for Skill Extraction (SE). It is an important and widely-studied task useful to gain insights into labor market dynamics. However, there is a lacuna of datasets and annotation guidelines; available datasets are few and contain crowd-sourced labels on the span-level or labels from a predefined skill inventory. To address this gap, the authors introduce SkillSpan, a novel SE dataset consisting of 14.5K sentences and over 12.5K annotated spans.
7 PAPERS • NO BENCHMARKS YET
The CropAndWeed dataset is focused on the fine-grained identification of 74 relevant crop and weed species with a strong emphasis on data variability. Annotations of labeled bounding boxes, semantic masks and stem positions are provided for about 112k instances in more than 8k high-resolution images of both real-world agricultural sites and specifically cultivated outdoor plots of rare weed types. Additionally, each sample is enriched with meta-annotations regarding environmental conditions.
4 PAPERS • NO BENCHMARKS YET
PGDP5K is a dataset consisting of 5000 diagram samples composed of 16 shapes, covering 5 positional relations, 22 symbol types and 6 text types, labeled with more fine-grained annotations at primitive level, including primitive classes, locations and relationships, where 1,813 non-duplicated images are selected from the Geometry3K dataset and other 3,187 images are collected from three popular textbooks across grades 6-12 on mathematics curriculum websites by taking screenshots from PDF books.
4 PAPERS • 1 BENCHMARK
This is a dataset for segmentation and classification of epistemic activities in diagnostic reasoning texts.
3 PAPERS • NO BENCHMARKS YET
OSAI introduces OpenTTGames - an open dataset aimed at evaluation of different computer vision tasks in Table Tennis: ball detection, semantic segmentation of humans, table and scoreboard and fast in-game events spotting.
Huggingface Datasets is a great library, but it lacks standardization, and datasets require preprocessing work to be used interchangeably. tasksource automates this and facilitates reproducible multi-task learning scaling.
CQR is an extension to the Stanford Dialogue Corpus. It contains crowd-sourced rewrites to facilitate research in dialogue state tracking using natural language as the interface.
2 PAPERS • NO BENCHMARKS YET
ExHVV is a novel dataset that offers natural language explanations of connotative roles for three types of entities -- heroes, villains, and victims, encompassing 4,680 entities present in 3K memes.
A large scale, C2C marketplace e-commerce dataset.
1 PAPER • NO BENCHMARKS YET
RoboPianist is a benchmarking suite for high-dimensional control, targeted at testing high spatial and temporal precision, coordination, and planning, all with an underactuated system frequently making-and-breaking contacts. The proposed challenge is mastering the piano through bi-manual dexterity, using a pair of simulated anthropomorphic robot hands. The initial version covers a broad set of 150 variable-difficulty songs.
The availability of well-curated datasets has driven the success of Machine Learning (ML) models. Despite greater access to earth observation data in agriculture, there is a scarcity of curated and labelled datasets, which limits the potential of its use in training ML models for remote sensing (RS) in agriculture. To this end, we introduce a first-of-its-kind dataset called SICKLE, which constitutes a time-series of multi-resolution imagery from 3 distinct satellites: Landsat-8, Sentinel-1 and Sentinel-2. Our dataset constitutes multi-spectral, thermal and microwave sensors during January 2018 - March 2021 period. We construct each temporal sequence by considering the cropping practices followed by farmers primarily engaged in paddy cultivation in the Cauvery Delta region of Tamil Nadu, India; and annotate the corresponding imagery with key cropping parameters at multiple resolutions (i.e. 3m, 10m and 30m). Our dataset comprises 2, 370 season-wise samples from 388 unique plots, having
1 PAPER • 5 BENCHMARKS