4 dataset results for Key Information Extraction AND English

Kleister NDA is a dataset for Key Information Extraction (KIE). The dataset contains a mix of scanned and born-digital long formal English-language documents. For this datasets, an NLP system is expected to find or infer various types of entities by employing both textual and structural layout features. The Kleister NDA dataset has 540 Non-disclosure Agreements, with 3,229 unique pages and 2,160 entities to extract.

15 PAPERS • 1 BENCHMARK

Information Extraction from Tables

Information Extraction from Tables (Extraction materials compositions from tables of materials science research papers)

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

2 PAPERS • NO BENCHMARKS YET

SOMD

SOMD (SOftware Mention Detection)

The dataset contains the training and test data for the SOftware Mention Detection challenge. The data is derived from the SoMeSci Knowledge Graph of software mentions.

1 PAPER • NO BENCHMARKS YET

POIE (Products for OCR and Information Extraction)

Products for OCR and Information Extraction (POIE) dataset derives from camera images of various products in the real world. The images are carefully selected and manually annotated. Our labeling team consists of 8 experienced labelers. We first crop the nutrition tables from product images and adopt multiple commercial OCR engines (Azure and Baidu OCR) for pre-labeling. Then we use LabelMe to manually check the annotation of the location as well as transcription of every text box, and the values of entities for all the text in the images and repaired the OCR errors found. After discarding low-quality and blurred images, we obtain 3,000 images with 111,155 text instances.

0 PAPER • NO BENCHMARKS YET

Datasets

4 dataset results for Key Information Extraction AND English