8 dataset results for Document Classification AND English

MPQA Opinion Corpus (Multi-Perspective Question Answering)

The MPQA Opinion Corpus contains 535 news articles from a wide variety of news sources manually annotated for opinions and other private states (i.e., beliefs, emotions, sentiments, speculations, etc.).

304 PAPERS • 3 BENCHMARKS

WOS

WOS (Web of Science Dataset)

Web of Science (WOS) is a document classification dataset that contains 46,985 documents with 134 categories which include 7 parents categories.

49 PAPERS • 4 BENCHMARKS

HOC (Hallmarks of Cancer)

The Hallmarks of Cancer (*HOC) corpus consists of 1852 PubMed publication abstracts manually annotated by experts according to the Hallmarks of Cancer taxonomy. The taxonomy consists of 37 classes in a hierarchy. Zero or more class labels are assigned to each sentence in the corpus.

28 PAPERS • 1 BENCHMARK

MultiEURLEX

MultiEURLEX is a multilingual dataset for topic classification of legal documents. The dataset comprises 65k European Union (EU) laws, officially translated in 23 languages, annotated with multiple labels from the EUROVOC taxonomy. The dataset covers 23 official EU languages from 7 language families.

10 PAPERS • NO BENCHMARKS YET

RTC

RTC (Reddit Time Corpus)

RTC is a benchmark corpus of social media comments sampled over three years. The corpus consists of 36.36m unlabelled comments for adaptation and evaluation on an upstream masked language modelling task as well as 0.9m labelled comments for finetuning and evaluation on a downstream document classification task. The Reddit Time Corpus (RTC) covers three years between March 2017 and February 2020 and is split into 36 evenly-sized monthly subsets based on comment timestamps. RTC is sampled from the Pushshift Reddit dataset.

2 PAPERS • NO BENCHMARKS YET

MeSHup

MeSHup (A Corpus for Full Text Biomedical Document Indexing)

Contains 1,342,667 full text articles in English, together with the associated MeSH labels and metadata, authors, and publication venues that are collected from the MEDLINE database.

1 PAPER • NO BENCHMARKS YET

RVL-CDIP_MP

RVL-CDIP_MP (RVL-CDIP multi-page)

RVL-CDIP_MP is our first contribution to retrieve the original documents of the IIT-CDIP test collection which were used to create RVL-CDIP. Some PDFs or encoded images were corrupt, which explains that we have around 500 fewer instances. By leveraging metadata from OCR-IDL , we matched the original identifiers from IIT-CDIP and retrieved them from IDL using a conversion.

1 PAPER • NO BENCHMARKS YET

RVL-CDIP_N_MP

RVL-CDIP_N_MP (RVL-CDIP-N multi-page)

RVL-CDIP_MP-N can serve its original goal as a covariate shift test set, now for multi-page document classification. We were able to retrieve the original full documents from DocumentCloud and Web Search.

1 PAPER • NO BENCHMARKS YET

Datasets

8 dataset results for Document Classification AND English