🔔 Share your dataset with the ML community!

Filter by Modality (clear)

Filter by Task (clear)

Filter by Language

93 dataset results for Named Entity Recognition (NER) AND Texts

CoNLL 2003

CoNLL-2003 is a named entity recognition dataset released as a part of CoNLL-2003 shared task: language-independent named entity recognition. The data consists of eight files covering two languages: English and German. For each of the languages there is a training file, a development file, a test file and a large file with unannotated data.

648 PAPERS • 16 BENCHMARKS

OntoNotes 5.0

OntoNotes 5.0 is a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference).

240 PAPERS • 11 BENCHMARKS

BC5CDR (BioCreative V CDR corpus)

BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions.

176 PAPERS • 7 BENCHMARKS

PubMedQA

The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?) using the corresponding abstracts.

160 PAPERS • 2 BENCHMARKS

FUNSD (Form Understanding in Noisy Scanned Documents)

Form Understanding in Noisy Scanned Documents (FUNSD) comprises 199 real, fully annotated, scanned forms. The documents are noisy and vary widely in appearance, making form understanding (FoUn) a challenging task. The proposed dataset can be used for various tasks, including text detection, optical character recognition, spatial layout analysis, and entity labeling/linking.

147 PAPERS • 3 BENCHMARKS

NCBI Disease

The NCBI Disease corpus consists of 793 PubMed abstracts, which are separated into training (593), development (100) and test (100) subsets. The NCBI Disease corpus is annotated with disease mentions, using concept identifiers from either MeSH or OMIM.

142 PAPERS • 3 BENCHMARKS

BLUE

BLUE (Biomedical Language Understanding Evaluation)

The BLUE benchmark consists of five different biomedicine text-mining tasks with ten corpora. These tasks cover a diverse range of text genres (biomedical literature and clinical notes), dataset sizes, and degrees of difficulty and, more importantly, highlight common biomedicine text-mining challenges.

125 PAPERS • NO BENCHMARKS YET

SciERC

SciERC dataset is a collection of 500 scientific abstract annotated with scientific entities, their relations, and coreference clusters. The abstracts are taken from 12 AI conference/workshop proceedings in four AI communities, from the Semantic Scholar Corpus. SciERC extends previous datasets in scientific articles SemEval 2017 Task 10 and SemEval 2018 Task 7 by extending entity types, relation types, relation coverage, and adding cross-sentence relations using coreference links.

123 PAPERS • 7 BENCHMARKS

GENIA

The GENIA corpus is the primary collection of biomedical literature compiled and annotated within the scope of the GENIA project. The corpus was created to support the development and evaluation of information extraction and text mining systems for the domain of molecular biology.

116 PAPERS • 6 BENCHMARKS

WNUT 2017

WNUT 2017 (WNUT 2017 Emerging and Rare entity recognition)

This shared task focuses on identifying unusual, previously-unseen entities in the context of emerging discussions. Named entities form the basis of many modern approaches to other tasks (like event clustering and summarisation), but recall on them is a real problem in noisy text - even among annotators. This drop tends to be due to novel entities and surface forms. Take for example the tweet “so.. kktny in 30 mins?” - even human experts find entity kktny hard to detect and resolve. This task will evaluate the ability to detect and classify novel, emerging, singleton named entities in noisy text.

114 PAPERS • 2 BENCHMARKS

Few-NERD

Few-NERD is a large-scale, fine-grained manually annotated named entity recognition dataset, which contains 8 coarse-grained types, 66 fine-grained types, 188,200 sentences, 491,711 entities, and 4,601,223 tokens. Three benchmark tasks are built, one is supervised (Few-NERD (SUP)) and the other two are few-shot (Few-NERD (INTRA) and Few-NERD (INTER)).

72 PAPERS • 3 BENCHMARKS

CoNLL 2002

The shared task of CoNLL-2002 concerns language-independent named entity recognition. The types of named entities include: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups. The participants of the shared task were offered training and test data for at least two languages. Information sources other than the training data might have been used in this shared task.

70 PAPERS • 3 BENCHMARKS

ACE 2005 (ACE 2005 Multilingual Training Corpus)

ACE 2005 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2005 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities, relations and events by the Linguistic Data Consortium (LDC) with support from the ACE Program and additional assistance from LDC.

62 PAPERS • 9 BENCHMARKS

MasakhaNER

MasakhaNER is a collection of Named Entity Recognition (NER) datasets for 10 different African languages. The languages forming this dataset are: Amharic, Hausa, Igbo, Kinyarwanda, Luganda, Luo, Nigerian-Pidgin, Swahili, Wolof, and Yorùbá.

50 PAPERS • 2 BENCHMARKS

ACE 2004

ACE 2004 (ACE 2004 Multilingual Training Corpus)

ACE 2004 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2004 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities and relations and was created by Linguistic Data Consortium with support from the ACE Program, with additional assistance from the DARPA TIDES (Translingual Information Detection, Extraction and Summarization) Program. The objective of the ACE program is to develop automatic content extraction technology to support automatic processing of human language in text form. In September 2004, sites were evaluated on system performance in six areas: Entity Detection and Recognition (EDR), Entity Mention Detection (EMD), EDR Co-reference, Relation Detection and Recognition (RDR), Relation Mention Detection (RMD), and RDR given reference entities. All tasks were evaluated in three languages: English, Chinese and Arabic.

47 PAPERS • 5 BENCHMARKS

RadGraph

RadGraph (RadGraph: Extracting Clinical Entities and Relations from Radiology Reports)

RadGraph is a dataset of entities and relations in radiology reports based on our novel information extraction schema, consisting of 600 reports with 30K radiologist annotations and 221K reports with 10.5M automatically generated annotations.

46 PAPERS • NO BENCHMARKS YET

MedMentions

MedMentions is a new manually annotated resource for the recognition of biomedical concepts. What distinguishes MedMentions from other annotated biomedical corpora is its size (over 4,000 abstracts and over 350,000 linked mentions), as well as the size of the concept ontology (over 3 million concepts from UMLS 2017) and its broad coverage of biomedical disciplines.

45 PAPERS • 1 BENCHMARK

MultiCoNER

MultiCoNER is a large multilingual dataset (11 languages) for Named Entity Recognition. It is designed to represent some of the contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities such as movie titles, and long-tail entity distributions.

44 PAPERS • NO BENCHMARKS YET

IPM NEL

IPM NEL (Derczynski IPM Named Entity Linking)

This data is for the task of named entity recognition and linking/disambiguation over tweets. It comprises the addition of an entity URI layer on top of an NER-annotated tweet dataset. The task is to detect entities and then provide a correct link to them in DBpedia, thus disambiguating otherwise ambiguous entity surface forms; for example, this means linking "Paris" to the correct instance of a city named that (e.g. Paris, France vs. Paris, Texas).

31 PAPERS • 1 BENCHMARK

KLUE (Korean Language Understanding Evaluation)

Korean Language Understanding Evaluation (KLUE) benchmark is a series of datasets to evaluate natural language understanding capability of Korean language models. KLUE consists of 8 diverse and representative tasks, which are accessible to anyone without any restrictions. With ethical considerations in mind, we deliberately design annotation guidelines to obtain unambiguous annotations for all datasets. Furthermore, we build an evaluation system and carefully choose evaluations metrics for every task, thus establishing fair comparison across Korean language models.

19 PAPERS • 1 BENCHMARK

JNLPBA

JNLPBA is a biomedical dataset that comes from the GENIA version 3.02 corpus (Kim et al., 2003). It was created with a controlled search on MEDLINE. From this search 2,000 abstracts were selected and hand annotated according to a small taxonomy of 48 classes based on a chemical classification. 36 terminal classes were used to annotate the GENIA corpus.

18 PAPERS • 2 BENCHMARKS

DWIE

DWIE (Deutsche Welle corpus for Information Extraction)

The 'Deutsche Welle corpus for Information Extraction' (DWIE) is a multi-task dataset that combines four main Information Extraction (IE) annotation sub-tasks: (i) Named Entity Recognition (NER), (ii) Coreference Resolution, (iii) Relation Extraction (RE), and (iv) Entity Linking. DWIE is conceived as an entity-centric dataset that describes interactions and properties of conceptual entities on the level of the complete document.

16 PAPERS • 4 BENCHMARKS

BioRED

BioRED is a first-of-its-kind biomedical relation extraction dataset with multiple entity types (e.g. gene/protein, disease, chemical) and relation pairs (e.g. gene–disease; chemical–chemical) at the document level, on a set of600 PubMed abstracts. Furthermore, BioRED label each relation as describing either a novel finding or previously known background knowledge, enabling automated algorithms to differentiate between novel and background information.

15 PAPERS • 3 BENCHMARKS

CrossNER

CrossNER is a cross-domain NER (Named Entity Recognition) dataset, a fully-labeled collection of NER data spanning over five diverse domains (Politics, Natural Science, Music, Literature, and Artificial Intelligence) with specialized entity categories for different domains. Additionally, CrossNER also includes unlabeled domain-related corpora for the corresponding five domains.

12 PAPERS • 1 BENCHMARK

Broad Twitter Corpus

This paper introduces the Broad Twitter Corpus (BTC), which is not only significantly bigger, but sampled across different regions, temporal periods, and types of Twitter users. The gold-standard named entity annotations are made by a combination of NLP experts and crowd workers, which enables us to harness crowd recall while maintaining high quality. We also measure the entity drift observed in our dataset (i.e. how entity representation varies over time), and compare to newswire.

11 PAPERS • 2 BENCHMARKS

BC2GM

Created by Smith et al. at 2008, the BioCreative II Gene Mention Recognition (BC2GM) Dataset contains data where participants are asked to identify a gene mention in a sentence by giving its start and end characters. The training set consists of a set of sentences, and for each sentence a set of gene mentions (GENE annotations). [registration required for access], in English language. Containing 20 in n/a file format.

10 PAPERS • 1 BENCHMARK

CMeEE

CMeEE (Chinese Medical Named Entity Recognition Dataset)

Chinese Medical Named Entity Recognition, a dataset first released in CHIP20204, is used for CMeEE task. Given a pre-defined schema, the task is to identify and extract entities from the given sentence and classify them into nine categories: disease, clinical manifestations, drugs, medical equipment, medical procedures, body, medical examinations, microorganisms, and department.

9 PAPERS • 2 BENCHMARKS

BB (Bacteria Biotope)

The Bacteria Biotope (BB) Task is part of the BioNLP Open Shared Tasks and meets the BioNLP-OST standards of quality, originality and data formats. Manually annotated data is provided for training, development and evaluation of information extraction methods. Tools for the detailed evaluation of system outputs are available. Support in performing linguistic processing are provided in the form of analyses created by various state-of-the art tools on the dataset texts.

8 PAPERS • 2 BENCHMARKS

CoNLL-2000

CoNLL-2000 is a dataset for dividing text into syntactically related non-overlapping groups of words, so-called text chunking.

8 PAPERS • 2 BENCHMARKS

GUM (Georgetown University Multilayer corpus)

GUM is an open source multilayer English corpus of richly annotated texts from twelve text types. Annotations include:

8 PAPERS • 1 BENCHMARK

PEYMA

Peyma is a Persian NER dataset to train and test NER systems. It is constructed by collecting documents from ten news websites.

8 PAPERS • NO BENCHMARKS YET

GeoWebNews

GeoWebNews provides test/train examples and enable fine-grained Geotagging and Toponym Resolution (Geocoding). This dataset is also suitable for prototyping and evaluating machine learning NLP models.

7 PAPERS • NO BENCHMARKS YET

MEDIA

The MEDIA French corpus is dedicated to semantic extraction from speech in a context of human/machine dialogues. The corpus has manual transcription and conceptual annotation of dialogues from 250 speakers. It is split into the following three parts : (1) the training set (720 dialogues, 12K sentences), (2) the development set (79 dialogues, 1.3K sentences, and (3) the test set (200 dialogues, 3K sentences).

6 PAPERS • NO BENCHMARKS YET

legal_NER

legal_NER is a corpus of 46545 annotated legal named entities mapped to 14 legal entity types. It is designed for named entity recognition in indian court judgement.

6 PAPERS • NO BENCHMARKS YET

AMALGUM

AMALGUM (A Machine Annotated Lookalike of GUM)

AMALGUM is a machine annotated multilayer corpus following the same design and annotation layers as GUM, but substantially larger (around 4M tokens). The goal of this corpus is to close the gap between high quality, richly annotated, but small datasets, and the larger but shallowly annotated corpora that are often scraped from the Web.

5 PAPERS • NO BENCHMARKS YET

DaNE

DaNE (Danish Dependency Treebank)

Danish Dependency Treebank (DaNE) is a named entity annotation for the Danish Universal Dependencies treebank using the CoNLL-2003 annotation scheme.

5 PAPERS • 5 BENCHMARKS

Species-800

Species-800 is a corpus for species entities, which is based on manually annotated abstracts. It comprises 800 PubMed abstracts that contain identified organism mentions. To increase the corpus taxonomic mention diversity the 800 abstracts were collected by selecting 100 abstracts from the following 8 categories: bacteriology, botany, entomology, medicine, mycology, protistology, virology and zoology. 800 has been annotated with a focus at the species level; however, higher taxa mentions (such as genera, families and orders) have also been considered.

5 PAPERS • 1 BENCHMARK

WikiNEuRal

WikiNEuRal is a high-quality automatically-generated dataset for Multilingual Named Entity Recognition.

5 PAPERS • NO BENCHMARKS YET

BC4CHEMD

BC4CHEMD (BioCreative IV Chemical compound and drug name recognition)

Introduced by Krallinger et al. in The CHEMDNER corpus of chemicals and drugs and its annotation principles

4 PAPERS • 1 BENCHMARK

DaN+

DaN+ is a new multi-domain corpus and annotation guidelines for Danish nested named entities (NEs) and lexical normalization to support research on cross-lingual cross-domain learning for a less-resourced language.

4 PAPERS • NO BENCHMARKS YET

Finer

Finer (Finnish News Corpus for Named Entity Recognition)

Finnish News Corpus for Named Entity Recognition (Finer) is a corpus that consists of 953 articles (193,742 word tokens) with six named entity classes (organization, location, person, product, event,and date). The articles are extracted from the archives of Digitoday, a Finnish online technology news source.

4 PAPERS • NO BENCHMARKS YET

LeNER-Br

LeNER-Br is a dataset for named entity recognition (NER) in Brazilian Legal Text.

4 PAPERS • 2 BENCHMARKS

Naamapadam

Naamapadam is a Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. In each language, it contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location and Organization) for 9 out of the 11 languages. The training dataset has been automatically created from the Samanantar parallel corpus by projecting automatically tagged entities from an English sentence to the corresponding Indian language sentence.

4 PAPERS • NO BENCHMARKS YET

RONEC

RONEC (Romanian Named Entity Corpus)

Romanian Named Entity Corpus is a named entity corpus for the Romanian language. The corpus contains over 26000 entities in ~5000 annotated sentences, belonging to 16 distinct classes. The sentences have been extracted from a copy-right free newspaper, covering several styles. This corpus represents the first initiative in the Romanian language space specifically targeted for named entity recognition.

4 PAPERS • NO BENCHMARKS YET

i2b2 De-identification Dataset

i2b2 De-identification Dataset (Informatics for Integrating Biology and the Bedside (i2b2) Project — De-identification Dataset)

This dataset contains 1304 de-identified longitudinal medical records describing 296 patients.

4 PAPERS • 1 BENCHMARK

CORD-r

We introduce FUNSD-r and CORD-r in Token Path Prediction, the revised VrD-NER datasets to reflect the real-world scenarios of NER on scanned VrDs.

3 PAPERS • 1 BENCHMARK

COVID-Q

COVID-Q consists of COVID-19 questions which have been annotated into a broad category (e.g. Transmission, Prevention) and a more specific class such that questions in the same class are all asking the same thing.

3 PAPERS • NO BENCHMARKS YET

DR.BENCH (Diagnostic Reasoning Benchmark for clinical natural language processing)

DR.BENCH is a dataset for developing and evaluating cNLP models with clinical diagnostic reasoning ability. The suite includes six tasks from ten publicly available datasets addressing clinical text understanding, medical knowledge reasoning, and diagnosis generation.

3 PAPERS • NO BENCHMARKS YET

Datasets

93 dataset results for Named Entity Recognition (NER) AND Texts