CoNLL-2003 is a named entity recognition dataset released as a part of CoNLL-2003 shared task: language-independent named entity recognition. The data consists of eight files covering two languages: English and German. For each of the languages there is a training file, a development file, a test file and a large file with unannotated data.
648 PAPERS • 16 BENCHMARKS
OntoNotes 5.0 is a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference).
240 PAPERS • 11 BENCHMARKS
BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions.
176 PAPERS • 7 BENCHMARKS
The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?) using the corresponding abstracts.
160 PAPERS • 2 BENCHMARKS
Form Understanding in Noisy Scanned Documents (FUNSD) comprises 199 real, fully annotated, scanned forms. The documents are noisy and vary widely in appearance, making form understanding (FoUn) a challenging task. The proposed dataset can be used for various tasks, including text detection, optical character recognition, spatial layout analysis, and entity labeling/linking.
147 PAPERS • 3 BENCHMARKS
The NCBI Disease corpus consists of 793 PubMed abstracts, which are separated into training (593), development (100) and test (100) subsets. The NCBI Disease corpus is annotated with disease mentions, using concept identifiers from either MeSH or OMIM.
142 PAPERS • 3 BENCHMARKS
The BLUE benchmark consists of five different biomedicine text-mining tasks with ten corpora. These tasks cover a diverse range of text genres (biomedical literature and clinical notes), dataset sizes, and degrees of difficulty and, more importantly, highlight common biomedicine text-mining challenges.
125 PAPERS • NO BENCHMARKS YET
SciERC dataset is a collection of 500 scientific abstract annotated with scientific entities, their relations, and coreference clusters. The abstracts are taken from 12 AI conference/workshop proceedings in four AI communities, from the Semantic Scholar Corpus. SciERC extends previous datasets in scientific articles SemEval 2017 Task 10 and SemEval 2018 Task 7 by extending entity types, relation types, relation coverage, and adding cross-sentence relations using coreference links.
123 PAPERS • 7 BENCHMARKS
The GENIA corpus is the primary collection of biomedical literature compiled and annotated within the scope of the GENIA project. The corpus was created to support the development and evaluation of information extraction and text mining systems for the domain of molecular biology.
116 PAPERS • 6 BENCHMARKS
This shared task focuses on identifying unusual, previously-unseen entities in the context of emerging discussions. Named entities form the basis of many modern approaches to other tasks (like event clustering and summarisation), but recall on them is a real problem in noisy text - even among annotators. This drop tends to be due to novel entities and surface forms. Take for example the tweet “so.. kktny in 30 mins?” - even human experts find entity kktny hard to detect and resolve. This task will evaluate the ability to detect and classify novel, emerging, singleton named entities in noisy text.
114 PAPERS • 2 BENCHMARKS
Few-NERD is a large-scale, fine-grained manually annotated named entity recognition dataset, which contains 8 coarse-grained types, 66 fine-grained types, 188,200 sentences, 491,711 entities, and 4,601,223 tokens. Three benchmark tasks are built, one is supervised (Few-NERD (SUP)) and the other two are few-shot (Few-NERD (INTRA) and Few-NERD (INTER)).
72 PAPERS • 3 BENCHMARKS
The shared task of CoNLL-2002 concerns language-independent named entity recognition. The types of named entities include: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups. The participants of the shared task were offered training and test data for at least two languages. Information sources other than the training data might have been used in this shared task.
70 PAPERS • 3 BENCHMARKS
ACE 2005 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2005 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities, relations and events by the Linguistic Data Consortium (LDC) with support from the ACE Program and additional assistance from LDC.
62 PAPERS • 9 BENCHMARKS
MasakhaNER is a collection of Named Entity Recognition (NER) datasets for 10 different African languages. The languages forming this dataset are: Amharic, Hausa, Igbo, Kinyarwanda, Luganda, Luo, Nigerian-Pidgin, Swahili, Wolof, and Yorùbá.
50 PAPERS • 2 BENCHMARKS
ACE 2004 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2004 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities and relations and was created by Linguistic Data Consortium with support from the ACE Program, with additional assistance from the DARPA TIDES (Translingual Information Detection, Extraction and Summarization) Program. The objective of the ACE program is to develop automatic content extraction technology to support automatic processing of human language in text form. In September 2004, sites were evaluated on system performance in six areas: Entity Detection and Recognition (EDR), Entity Mention Detection (EMD), EDR Co-reference, Relation Detection and Recognition (RDR), Relation Mention Detection (RMD), and RDR given reference entities. All tasks were evaluated in three languages: English, Chinese and Arabic.
47 PAPERS • 5 BENCHMARKS
RadGraph is a dataset of entities and relations in radiology reports based on our novel information extraction schema, consisting of 600 reports with 30K radiologist annotations and 221K reports with 10.5M automatically generated annotations.
46 PAPERS • NO BENCHMARKS YET
MedMentions is a new manually annotated resource for the recognition of biomedical concepts. What distinguishes MedMentions from other annotated biomedical corpora is its size (over 4,000 abstracts and over 350,000 linked mentions), as well as the size of the concept ontology (over 3 million concepts from UMLS 2017) and its broad coverage of biomedical disciplines.
45 PAPERS • 1 BENCHMARK
MultiCoNER is a large multilingual dataset (11 languages) for Named Entity Recognition. It is designed to represent some of the contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities such as movie titles, and long-tail entity distributions.
44 PAPERS • NO BENCHMARKS YET
This data is for the task of named entity recognition and linking/disambiguation over tweets. It comprises the addition of an entity URI layer on top of an NER-annotated tweet dataset. The task is to detect entities and then provide a correct link to them in DBpedia, thus disambiguating otherwise ambiguous entity surface forms; for example, this means linking "Paris" to the correct instance of a city named that (e.g. Paris, France vs. Paris, Texas).
31 PAPERS • 1 BENCHMARK
Korean Language Understanding Evaluation (KLUE) benchmark is a series of datasets to evaluate natural language understanding capability of Korean language models. KLUE consists of 8 diverse and representative tasks, which are accessible to anyone without any restrictions. With ethical considerations in mind, we deliberately design annotation guidelines to obtain unambiguous annotations for all datasets. Furthermore, we build an evaluation system and carefully choose evaluations metrics for every task, thus establishing fair comparison across Korean language models.
19 PAPERS • 1 BENCHMARK
JNLPBA is a biomedical dataset that comes from the GENIA version 3.02 corpus (Kim et al., 2003). It was created with a controlled search on MEDLINE. From this search 2,000 abstracts were selected and hand annotated according to a small taxonomy of 48 classes based on a chemical classification. 36 terminal classes were used to annotate the GENIA corpus.
18 PAPERS • 2 BENCHMARKS
The 'Deutsche Welle corpus for Information Extraction' (DWIE) is a multi-task dataset that combines four main Information Extraction (IE) annotation sub-tasks: (i) Named Entity Recognition (NER), (ii) Coreference Resolution, (iii) Relation Extraction (RE), and (iv) Entity Linking. DWIE is conceived as an entity-centric dataset that describes interactions and properties of conceptual entities on the level of the complete document.
16 PAPERS • 4 BENCHMARKS
BioRED is a first-of-its-kind biomedical relation extraction dataset with multiple entity types (e.g. gene/protein, disease, chemical) and relation pairs (e.g. gene–disease; chemical–chemical) at the document level, on a set of600 PubMed abstracts. Furthermore, BioRED label each relation as describing either a novel finding or previously known background knowledge, enabling automated algorithms to differentiate between novel and background information.
15 PAPERS • 3 BENCHMARKS
CrossNER is a cross-domain NER (Named Entity Recognition) dataset, a fully-labeled collection of NER data spanning over five diverse domains (Politics, Natural Science, Music, Literature, and Artificial Intelligence) with specialized entity categories for different domains. Additionally, CrossNER also includes unlabeled domain-related corpora for the corresponding five domains.
12 PAPERS • 1 BENCHMARK
This paper introduces the Broad Twitter Corpus (BTC), which is not only significantly bigger, but sampled across different regions, temporal periods, and types of Twitter users. The gold-standard named entity annotations are made by a combination of NLP experts and crowd workers, which enables us to harness crowd recall while maintaining high quality. We also measure the entity drift observed in our dataset (i.e. how entity representation varies over time), and compare to newswire.
11 PAPERS • 2 BENCHMARKS
Created by Smith et al. at 2008, the BioCreative II Gene Mention Recognition (BC2GM) Dataset contains data where participants are asked to identify a gene mention in a sentence by giving its start and end characters. The training set consists of a set of sentences, and for each sentence a set of gene mentions (GENE annotations). [registration required for access], in English language. Containing 20 in n/a file format.
10 PAPERS • 1 BENCHMARK
Chinese Medical Named Entity Recognition, a dataset first released in CHIP20204, is used for CMeEE task. Given a pre-defined schema, the task is to identify and extract entities from the given sentence and classify them into nine categories: disease, clinical manifestations, drugs, medical equipment, medical procedures, body, medical examinations, microorganisms, and department.
9 PAPERS • 2 BENCHMARKS
The Bacteria Biotope (BB) Task is part of the BioNLP Open Shared Tasks and meets the BioNLP-OST standards of quality, originality and data formats. Manually annotated data is provided for training, development and evaluation of information extraction methods. Tools for the detailed evaluation of system outputs are available. Support in performing linguistic processing are provided in the form of analyses created by various state-of-the art tools on the dataset texts.
8 PAPERS • 2 BENCHMARKS
CoNLL-2000 is a dataset for dividing text into syntactically related non-overlapping groups of words, so-called text chunking.
GUM is an open source multilayer English corpus of richly annotated texts from twelve text types. Annotations include:
8 PAPERS • 1 BENCHMARK
Peyma is a Persian NER dataset to train and test NER systems. It is constructed by collecting documents from ten news websites.
8 PAPERS • NO BENCHMARKS YET
GeoWebNews provides test/train examples and enable fine-grained Geotagging and Toponym Resolution (Geocoding). This dataset is also suitable for prototyping and evaluating machine learning NLP models.
7 PAPERS • NO BENCHMARKS YET
The MEDIA French corpus is dedicated to semantic extraction from speech in a context of human/machine dialogues. The corpus has manual transcription and conceptual annotation of dialogues from 250 speakers. It is split into the following three parts : (1) the training set (720 dialogues, 12K sentences), (2) the development set (79 dialogues, 1.3K sentences, and (3) the test set (200 dialogues, 3K sentences).
6 PAPERS • NO BENCHMARKS YET
legal_NER is a corpus of 46545 annotated legal named entities mapped to 14 legal entity types. It is designed for named entity recognition in indian court judgement.
AMALGUM is a machine annotated multilayer corpus following the same design and annotation layers as GUM, but substantially larger (around 4M tokens). The goal of this corpus is to close the gap between high quality, richly annotated, but small datasets, and the larger but shallowly annotated corpora that are often scraped from the Web.
5 PAPERS • NO BENCHMARKS YET
Danish Dependency Treebank (DaNE) is a named entity annotation for the Danish Universal Dependencies treebank using the CoNLL-2003 annotation scheme.
5 PAPERS • 5 BENCHMARKS
Species-800 is a corpus for species entities, which is based on manually annotated abstracts. It comprises 800 PubMed abstracts that contain identified organism mentions. To increase the corpus taxonomic mention diversity the 800 abstracts were collected by selecting 100 abstracts from the following 8 categories: bacteriology, botany, entomology, medicine, mycology, protistology, virology and zoology. 800 has been annotated with a focus at the species level; however, higher taxa mentions (such as genera, families and orders) have also been considered.
5 PAPERS • 1 BENCHMARK
WikiNEuRal is a high-quality automatically-generated dataset for Multilingual Named Entity Recognition.
Introduced by Krallinger et al. in The CHEMDNER corpus of chemicals and drugs and its annotation principles
4 PAPERS • 1 BENCHMARK
DaN+ is a new multi-domain corpus and annotation guidelines for Danish nested named entities (NEs) and lexical normalization to support research on cross-lingual cross-domain learning for a less-resourced language.
4 PAPERS • NO BENCHMARKS YET
Finnish News Corpus for Named Entity Recognition (Finer) is a corpus that consists of 953 articles (193,742 word tokens) with six named entity classes (organization, location, person, product, event,and date). The articles are extracted from the archives of Digitoday, a Finnish online technology news source.
LeNER-Br is a dataset for named entity recognition (NER) in Brazilian Legal Text.
4 PAPERS • 2 BENCHMARKS
Naamapadam is a Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. In each language, it contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location and Organization) for 9 out of the 11 languages. The training dataset has been automatically created from the Samanantar parallel corpus by projecting automatically tagged entities from an English sentence to the corresponding Indian language sentence.
Romanian Named Entity Corpus is a named entity corpus for the Romanian language. The corpus contains over 26000 entities in ~5000 annotated sentences, belonging to 16 distinct classes. The sentences have been extracted from a copy-right free newspaper, covering several styles. This corpus represents the first initiative in the Romanian language space specifically targeted for named entity recognition.
This dataset contains 1304 de-identified longitudinal medical records describing 296 patients.
We introduce FUNSD-r and CORD-r in Token Path Prediction, the revised VrD-NER datasets to reflect the real-world scenarios of NER on scanned VrDs.
3 PAPERS • 1 BENCHMARK
COVID-Q consists of COVID-19 questions which have been annotated into a broad category (e.g. Transmission, Prevention) and a more specific class such that questions in the same class are all asking the same thing.
3 PAPERS • NO BENCHMARKS YET
DR.BENCH is a dataset for developing and evaluating cNLP models with clinical diagnostic reasoning ability. The suite includes six tasks from ten publicly available datasets addressing clinical text understanding, medical knowledge reasoning, and diagnosis generation.