WikiANN, also known as PAN-X, is a multilingual named entity recognition dataset. It consists of Wikipedia articles that have been annotated with LOC (location), PER (person), and ORG (organization) tags in the IOB2 format¹². This dataset serves as a valuable resource for training and evaluating named entity recognition models across various languages.
60 PAPERS • 3 BENCHMARKS
XFUND is a multilingual form understanding benchmark dataset that includes human-labeled forms with key-value pairs in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese).
15 PAPERS • NO BENCHMARKS YET
WikiNEuRal is a high-quality automatically-generated dataset for Multilingual Named Entity Recognition.
5 PAPERS • NO BENCHMARKS YET
LeNER-Br is a dataset for named entity recognition (NER) in Brazilian Legal Text.
4 PAPERS • 2 BENCHMARKS
HAREM, an initiative by Linguateca, boasts a Golden Collection—a meticulously curated repository of annotated Portuguese texts. This resource serves as a pivotal benchmark for evaluating systems in recognizing mentioned entities within documents. It stands as a cornerstone, supporting advancements and innovations in Portuguese language processing research, providing a comprehensive foundation for evaluating system performances and fostering ongoing developments in this domain.
1 PAPER • NO BENCHMARKS YET
The MiniHAREM, a reiteration of the 2005 evaluation, used the same methodology and platform. Held from April 3rd to 5th, 2006, it offered participants a 48-hour window to annotate, verify, and submit text collections. Results are available, and the collection used is accessible. Participant lists, submitted outputs, and updated guidelines are provided. Additionally, the HAREM format checker ensures compliance with MiniHAREM directives. Information for the HAREM Meeting, open for registration until June 15th after the Linguateca Summer School in the University of Porto, is also available.
UNER v1 adds an NER annotation layer to 18 datasets (primarily treebanks from UD) and covers 12 geneologically and ty- pologically diverse languages: Cebuano, Danish, German, English, Croatian, Portuguese, Russian, Slovak, Serbian, Swedish, Tagalog, and Chinese4. Overall, UNER v1 contains nine full datasets with training, development, and test splits over eight languages, three evaluation sets for lower-resource languages (TL and CEB), and a parallel evaluation benchmark spanning six languages.
1 PAPER • 31 BENCHMARKS
The Second HAREM was an evaluation exercise in Portuguese Named Entity Recognition. It aims to refine text annotation processes, building on the First HAREM. Challenges include adapting guidelines for new texts and establishing a unified document with directives from both editions.
0 PAPER • NO BENCHMARKS YET