6981 SAT-level geometry problem with complete natural language description, geometric shapes, formal language annotations, and theorem sequences annotations.
2 PAPERS • NO BENCHMARKS YET
The Archive Query Log (AQL) is a previously unused, comprehensive query log collected at the Internet Archive over the last 25 years. Its first version includes 356 million queries, 166 million search result pages, and 1.7 billion search results across 550 search providers. Although many query logs have been studied in the literature, the search providers that own them generally do not publish their logs to protect user privacy and vital business data. The AQL is the first publicly available query log that combines size, scope, and diversity, enabling research on new retrieval models and search engine analyses. Provided in a privacy-preserving manner, it promotes open research as well as more transparency and accountability in the search industry.
1 PAPER • NO BENCHMARKS YET
A Rich Annotated Mandarin Conversational (RAMC) Speech Dataset, including 180 hours of Mandarin Chinese dialogue, 150, 10 and 20 hours for the training set, development set and test set respectively. It contains 351 multi-turn dialogues, each of which is a coherent and compact conversation centered around one theme.
This paper analyses two hitherto unstudied sites sharing state-backed disinformation, Reliable Recent News (rrn.world) and WarOnFakes (waronfakes.com), which publish content in Arabic, Chinese, English, French, German, and Spanish.
The Inpainting dataset consists of synchronized Labeled image and LiDAR scanned point clouds. It's captured by HESAI Pandora All-in-One Sensing Kit. It is collected under various lighting conditions and traffic densities in Beijing, China.
1 PAPER • 1 BENCHMARK
The BWB corpus consists of Chinese novels translated by experts into English, and the annotated test set is designed to probe the ability of machine translation systems to model various discourse phenomena.
Baidu PersonaChat, which is a personalization dataset collected and open-sourced by Baidu, is similar to ConvAI2, although it’s Chinese.
CC-Riddle is a Chinese character riddle dataset covering the majority of common simplified Chinese characters by crawling riddles from the Web and generating brand new ones. In the generation stage, the authors provide the Chinese phonetic alphabet, decomposition and explanation of the solution character for the generation model and get multiple riddle descriptions for each tested character. Then the generated riddles are manually filtered and the final dataset, CCRiddle is composed of both human-written riddles and filtered generated riddle.
Chinese Character Stroke Extraction (CCSE) is a benchmark containing two large-scale datasets: Kaiti CCSE (CCSE-Kai) and Handwritten CCSE (CCSE-HW). It is designed for stroke extraction problems.
CHIP Clinical Trial Classification, a dataset aimed at classifying clinical trials eligibility criteria, which are fundamental guidelines of clinical trials defined to identify whether a subject meets a clinical trial or not, is used for the CHIP-CTC task. All text data are collected from the website of the Chinese Clinical Trial Registry (ChiCTR) , and a total of 44 categories are defined. The task is like text classification; although it is not a new task, studies and corpus for the Chinese clinical trial criterion are still limited, and we hope to promote future researches for social benefits.
CHORD is the first chorus recognition dataset containing 627 songs for public use.
The CLPD dataset comprises 1200 images that encompass various regions within mainland China. These images were sourced from diverse origins, including the internet, mobile devices, and in-car recording devices. While the majority of the images were recorded during daylight hours, a portion of them were captured at nighttime. The dataset predominantly features passenger cars, with a limited number of images depicting trucks and buses.
Chinese Medical Information Extraction, a dataset that is also released in CHIP2020, is used for CMeIE task. The task is aimed at identifying both entities and relations in a sentence following the schema constraints. There are 53 relations defined in the dataset, including 10 synonymous sub-relationships and 43 other sub-relationships.
Contains a dataset of 241 Chinese dishes with 191,811 images. There are 170843 images in the training set and 20943 images in the validation set. All images are resized to 600x600. As some of the images in the dataset are from ChineseFoodNet, they are not supported for commercial use.
CNFOOD-241 Contains a dataset of 241 Chinese dishes with 191,811 images. There are 170843 images in the training set and 20943 images in the validation set. All images are resized to 600x600. As some of the images in the dataset are from ChineseFoodNet, they are not supported for commercial use. CNFOOD-241-Chen is the CNFOOD-241 dataset spilt with the list introduced in the paper "Res-VMamba: Fine-Grained Food Category Visual Classification Using Selective State Space Models with Deep Residual Learning," which has random split as train, val, test three parts.
The high-quality multi-turn dialogue dataset, which has a total of 3,134 multi-turn consultation dialogues. CPsyCounD covers nine representative topics and seven classic schools of psychological counseling.
The general multi-turn dialogue evaluation dataset with nine topics. Each topic has five representative cases, resulting in a comprehensive evaluation dataset of 45 cases.
Chinese Spelling Correction Dataset for errors generated by pinyin IME (CSCD-IME), a dataset containing 40,000 annotated sentences from real posts of official media on Sina Weibo. It is designed to detect and correct spelling mistakes in Chinese texts.
We present CSL, a large-scale Chinese Scientific Literature dataset, which contains the titles, abstracts, keywords and academic fields of 396,209 papers. To our knowledge, CSL is the first scientific document dataset in Chinese.
The Chinese Stock Policy Retrieval Dataset (CSPRD) contains a Chinese policy corpus of 10,002 articles and 709 prospectus examples from 545 companies listed on China’s Science and Technology Innovation Board (STAR Market). CSPRD is bilingual in Chinese and English (Translated by ChatGPT) and is annotated by experienced experts from Shanghai Stock Exchange.
ChCatExt is composed of BidAnn (bid announcement), FinAnn (financial announcement) and CreRat (credit rating report). It is designed for re-construct catalog trees from documents.
ChinaOpen is a new video dataset targeted at open-world multimodal learning, with raw data gathered from Bilibili, a popular Chinese video-sharing website. The dataset has a large webly annotated training set of videos (associated with user-generated titles and tags) and a smaller manually annotated test set of videos (with manually checked user titles / tags, manually written captions, and manual labels describing what visual objects / actions / scenes shown in the visual content).
Large-scale Chinese legal dataset for judgment prediction. \dataset contains more than 2.6 million criminal cases published by the Supreme People's Court of China, which are several times larger than other datasets in existing works on judgment prediction.
Chinese Literature NER RE is a Discourse-Level Named Entity Recognition and Relation Extraction Dataset for Chinese Literature Text. It is constructed from hundreds of Chinese literature articles.
The Chinese Traditional Painting dataset for style transfer contains 1000 content images and 100 style images. The content images are mostly the photorealistic scenes of mountain, lake, river, bridge, and buildings in regions south of the Yangtze River. It includes not only the scenes of China, but also beautiful pictures of Rhine, Alps, Yellow Stone, Grand Canyon, etc. The content images include diverse types of Chinese traditional paintings.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
Conic10K is an open-ended math problem dataset on conic sections in Chinese senior high school education. This dataset contains 10,861 carefully annotated problems, each one has a formal representation, the corresponding text spans, the answer, and natural language rationales. These questions require long reasoning steps while the topic is limited to conic sections. It could be used to evaluate models with 2 tasks: semantic parsing and mathematical question answering (mathQA).
ConvSumX is a cross-lingual conversation summarization benchmark, through a new annotation schema that explicitly considers source input context. ConvSumX consists of 2 sub-tasks under different real-world scenarios, with each covering 3 language directions.
Digitally Generated Numerals (DIGITal) Description The Digitally Generated Numerals (DIGITal) dataset consists of 100,000 image pairs representing digits from 0 to 9. These image pairs include both low and high-quality versions, with a resolution of 128x128 pixels.
This data is for the Mis2-KDD 2021 under review paper: Dataset of Propaganda Techniques of the State-Sponsored Information Operation of the People’s Republic of China
DiaKG is a high-quality Chinese dataset for Diabetes knowledge graph.
EUCA dataset description Associated Paper: EUCA: the End-User-Centered Explainable AI Framework
a fine-grained corpus to detect, identify and correct the chinese grammatical errors. collected mainly from multi-choice questions in public school Chinese examinations with multiple references Online Evaluation Site for test set: https://codalab.lisn.upsaclay.fr/competitions/8020
Previous research for adapting a general neural machine translation (NMT) model into a specific domain usually neglects the diversity in translation within the same domain, which is a core problem for domain adaptation in real- world scenarios. One representative of such challenging scenarios is to deploy a translation system for a conference with a specific topic, e.g., global warming or coronavirus, where there are usually extremely less resources due to the limited schedule. To motivate wider investigation in such a scenario, we present a real-world fine-grained domain adaptation task in machine translation (FGraDA). The FGraDA dataset consists of Chinese-English translation task for four sub-domains of information technology: autonomous vehicles, AI education, real-time networks, and smart phone. Each sub-domain is equipped with a development set and test set for evaluation pur- poses. To be closer to reality, FGraDA does not employ any in-domain bilingual training data but provide
The "Crime Facts" of "Offenses of Fraudulence" in Judicial Yuan Verdicts Dataset
HT Docking is a dataset consisting of 200 million 3D complex structures and 2D structure scores across a consistent set of 13 million ``in-stock'' molecules over 15 receptors, or binding sites, across the SARS-CoV-2 proteome. It is used to study surrogate model accuracy for protein-ligand docking.
Hypertention Disease Medication dataset.
IEE is a financial-domain dataset of the Insurance-entity extraction task. Its goal is to locate named entities mentioned in the input sentence.
JDDC 2.0 is a large-scale multimodal multi-turn dialogue dataset collected from a mainstream Chinese E-commerce platform JD.com, containing about 246 thousand dialogue sessions, 3 million utterances, and 507 thousand images, along with product knowledge bases and image category annotations. The dataset is divided into the training set, the validation set, and the test set according to the ratio of 80%, 10%, and 10%.
Word embedding is a modern distributed word representations approach widely used in many natural language processing tasks. Converting the vocabulary in a legal document into a word embedding model facilitates subjecting legal documents to machine learning, deep learning, and other algorithms and subsequently performing the downstream tasks of natural language processing vis-à-vis, for instance, document classification, contract review, and machine translation. The most common and practical approach of accuracy evaluation with the word embedding model uses a benchmark set with linguistic rules or the relationship between words to perform analogy reasoning via algebraic calculation. This paper proposes establishing a 1,134 Legal Analogical Reasoning Questions Set (LARQS) from the 2,388 Chinese Codex corpus using five kinds of legal relations, which are then used to evaluate the accuracy of the Chinese word embedding model. Moreover, we discovered that legal relations might be ubiquitous
Large Scale Informal Chinese Corpus (LSICC) is a large-scale corpus of informal Chinese. This corpus contains around 37 million book reviews and 50 thousand netizen's comments to the news.
An open-source online generative dictionary that takes a word and context containing the word as input and automatically generates a definition as output. Incorporating state-of-the-art definition generation models, it supports not only Chinese and English, but also Chinese-English cross-lingual queries. Moreover, it has a user-friendly front-end design that can help users understand the query words quickly and easily.
The Live Comment Dataset is a large-scale dataset with 2,361 videos and 895,929 live comments that were written while the videos were streamed.
A 160B bilingual long-text dataset with 3 categories: holistic, aggregated and chaotic long texts.
MGSM8KInstruct, the multilingual math reasoning instruction dataset, encompassing ten distinct languages, thus addressing the issue of training data scarcity in multilingual math reasoning.
MTC is a financial-domain dataset of the multi-label topic classification task. It aims to identify the topics of the spoken dialogue.
MULTI-Benchmark is a cutting-edge benchmark for evaluating Multimodal Large Language Models (MLLMs). It is designed to test the understanding of complex tables and images, and reasoning with long context¹. Here are some key features of MULTI-Benchmark: