Kleister NDA is a dataset for Key Information Extraction (KIE). The dataset contains a mix of scanned and born-digital long formal English-language documents. For this datasets, an NLP system is expected to find or infer various types of entities by employing both textual and structural layout features. The Kleister NDA dataset has 540 Non-disclosure Agreements, with 3,229 unique pages and 2,160 entities to extract.
15 PAPERS • 1 BENCHMARK
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
2 PAPERS • NO BENCHMARKS YET
The dataset contains the training and test data for the SOftware Mention Detection challenge. The data is derived from the SoMeSci Knowledge Graph of software mentions.
1 PAPER • NO BENCHMARKS YET
Products for OCR and Information Extraction (POIE) dataset derives from camera images of various products in the real world. The images are carefully selected and manually annotated. Our labeling team consists of 8 experienced labelers. We first crop the nutrition tables from product images and adopt multiple commercial OCR engines (Azure and Baidu OCR) for pre-labeling. Then we use LabelMe to manually check the annotation of the location as well as transcription of every text box, and the values of entities for all the text in the images and repaired the OCR errors found. After discarding low-quality and blurred images, we obtain 3,000 images with 111,155 text instances.
0 PAPER • NO BENCHMARKS YET