The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.
10,363 PAPERS • 93 BENCHMARKS
MIMIC-CXR from Massachusetts Institute of Technology presents 371,920 chest X-rays associated with 227,943 imaging studies from 65,079 patients. The studies were performed at Beth Israel Deaconess Medical Center in Boston, MA.
178 PAPERS • 2 BENCHMARKS
ECHR is an English legal judgment prediction dataset of cases from the European Court of Human Rights (ECHR). The dataset contains ~11.5k cases, including the raw text.
39 PAPERS • 1 BENCHMARK
LSHTC is a dataset for large-scale text classification. The data used in the LSHTC challenges originates from two popular sources: the DBpedia and the ODP (Open Directory Project) directory, also known as DMOZ. DBpedia instances were selected from the english, non-regional Extended Abstracts provided by the DBpedia site. The DMOZ instances consist of either Content vectors, Description vectors or both. A Content vectors is obtained by directly indexing the web page using standard indexing chain (preprocessing, stemming/lemmatization, stop-word removal).
18 PAPERS • NO BENCHMARKS YET
OpenImages V6 is a large-scale dataset , consists of 9 million training images, 41,620 validation samples, and 125,456 test samples. It is a partially annotated dataset, with 9,600 trainable classes
18 PAPERS • 3 BENCHMARKS
For each dataset we provide a short description as well as some characterization metrics. It includes the number of instances (m), number of attributes (d), number of labels (q), cardinality (Card), density (Dens), diversity (Div), average Imbalance Ratio per label (avgIR), ratio of unconditionally dependent label pairs by chi-square test (rDep) and complexity, defined as m × q × d as in [Read 2010]. Cardinality measures the average number of labels associated with each instance, and density is defined as cardinality divided by the number of labels. Diversity represents the percentage of labelsets present in the dataset divided by the number of possible labelsets. The avgIR measures the average degree of imbalance of all labels, the greater avgIR, the greater the imbalance of the dataset. Finally, rDep measures the proportion of pairs of labels that are dependent at 99% confidence. A broader description of all the characterization metrics and the used partition methods are described in
4 PAPERS • NO BENCHMARKS YET
Sewer-ML is a sewer defect dataset. It contains 1.3 million images, from 75,618 videos collected from three Danish water utility companies over nine years. All videos have been annotated by licensed sewer inspectors following the Danish sewer inspection standard, Fotomanualen. This leads to consistent and reliable annotations, and a total of 17 annotated defect classes.
CAVES is the first large-scale dataset containing about 10k COVID-19 anti-vaccine tweets labelled into various specific anti-vaccine concerns in a multi-label setting. This is also the first multi-label classification dataset that provides explanations for each of the labels. Additionally, the dataset also provides class-wise summaries of all the tweets.
1 PAPER • NO BENCHMARKS YET