🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task

Filter by Language (clear)

333 dataset results for Chinese

6981 SAT-level geometry problem with complete natural language description, geometric shapes, formal language annotations, and theorem sequences annotations.

2 PAPERS • NO BENCHMARKS YET

AQL-22

AQL-22 (Archive Query Log)

The Archive Query Log (AQL) is a previously unused, comprehensive query log collected at the Internet Archive over the last 25 years. Its first version includes 356 million queries, 166 million search result pages, and 1.7 billion search results across 550 search providers. Although many query logs have been studied in the literature, the search providers that own them generally do not publish their logs to protect user privacy and vital business data. The AQL is the first publicly available query log that combines size, scope, and diversity, enabling research on new retrieval models and search engine analyses. Provided in a privacy-preserving manner, it promotes open research as well as more transparency and accountability in the search industry.

1 PAPER • NO BENCHMARKS YET

ASR-RAMC-BIGCCSC: A CHINESE CONVERSATIONAL SPEECH CORPUS

A Rich Annotated Mandarin Conversational (RAMC) Speech Dataset, including 180 hours of Mandarin Chinese dialogue, 150, 10 and 20 hours for the training set, development set and test set respectively. It contains 351 multi-turn dialogues, each of which is a coherent and compact conversation centered around one theme.

1 PAPER • NO BENCHMARKS YET

Analysing state-backed propaganda websites: a new dataset and linguistic study

This paper analyses two hitherto unstudied sites sharing state-backed disinformation, Reliable Recent News (rrn.world) and WarOnFakes (waronfakes.com), which publish content in Arabic, Chinese, English, French, German, and Spanish.

1 PAPER • NO BENCHMARKS YET

Apolloscape Inpainting

The Inpainting dataset consists of synchronized Labeled image and LiDAR scanned point clouds. It's captured by HESAI Pandora All-in-One Sensing Kit. It is collected under various lighting conditions and traffic densities in Beijing, China.

1 PAPER • 1 BENCHMARK

BWB

The BWB corpus consists of Chinese novels translated by experts into English, and the annotated test set is designed to probe the ability of machine translation systems to model various discourse phenomena.

1 PAPER • NO BENCHMARKS YET

Baidu PersonaChat

Baidu PersonaChat, which is a personalization dataset collected and open-sourced by Baidu, is similar to ConvAI2, although it’s Chinese.

1 PAPER • NO BENCHMARKS YET

CC-Riddle (Chinese Character Riddle)

CC-Riddle is a Chinese character riddle dataset covering the majority of common simplified Chinese characters by crawling riddles from the Web and generating brand new ones. In the generation stage, the authors provide the Chinese phonetic alphabet, decomposition and explanation of the solution character for the generation model and get multiple riddle descriptions for each tested character. Then the generated riddles are manually filtered and the final dataset, CCRiddle is composed of both human-written riddles and filtered generated riddle.

1 PAPER • NO BENCHMARKS YET

CCSE (Chinese Character Stroke Extraction)

Chinese Character Stroke Extraction (CCSE) is a benchmark containing two large-scale datasets: Kaiti CCSE (CCSE-Kai) and Handwritten CCSE (CCSE-HW). It is designed for stroke extraction problems.

1 PAPER • NO BENCHMARKS YET

CHIP-CTC

CHIP Clinical Trial Classification, a dataset aimed at classifying clinical trials eligibility criteria, which are fundamental guidelines of clinical trials defined to identify whether a subject meets a clinical trial or not, is used for the CHIP-CTC task. All text data are collected from the website of the Chinese Clinical Trial Registry (ChiCTR) , and a total of 44 categories are defined. The task is like text classification; although it is not a new task, studies and corpus for the Chinese clinical trial criterion are still limited, and we hope to promote future researches for social benefits.

1 PAPER • 1 BENCHMARK

CHORD

CHORD (CHOrus Recognition Dataset)

CHORD is the first chorus recognition dataset containing 627 songs for public use.

1 PAPER • NO BENCHMARKS YET

CLPD (China License Plate Dataset)

The CLPD dataset comprises 1200 images that encompass various regions within mainland China. These images were sourced from diverse origins, including the internet, mobile devices, and in-car recording devices. While the majority of the images were recorded during daylight hours, a portion of them were captured at nighttime. The dataset predominantly features passenger cars, with a limited number of images depicting trucks and buses.

1 PAPER • NO BENCHMARKS YET

CMeIE

CMeIE (Chinese Medical Information Extraction Dataset)

Chinese Medical Information Extraction, a dataset that is also released in CHIP2020, is used for CMeIE task. The task is aimed at identifying both entities and relations in a sentence following the schema constraints. There are 53 relations defined in the dataset, including 10 synonymous sub-relationships and 43 other sub-relationships.

1 PAPER • 1 BENCHMARK

CNFOOD-241

Contains a dataset of 241 Chinese dishes with 191,811 images. There are 170843 images in the training set and 20943 images in the validation set. All images are resized to 600x600. As some of the images in the dataset are from ChineseFoodNet, they are not supported for commercial use.

1 PAPER • NO BENCHMARKS YET

CNFOOD-241-Chen

CNFOOD-241 Contains a dataset of 241 Chinese dishes with 191,811 images. There are 170843 images in the training set and 20943 images in the validation set. All images are resized to 600x600. As some of the images in the dataset are from ChineseFoodNet, they are not supported for commercial use. CNFOOD-241-Chen is the CNFOOD-241 dataset spilt with the list introduced in the paper "Res-VMamba: Fine-Grained Food Category Visual Classification Using Selective State Space Models with Deep Residual Learning," which has random split as train, val, test three parts.

1 PAPER • 1 BENCHMARK

CPsyCounD

The high-quality multi-turn dialogue dataset, which has a total of 3,134 multi-turn consultation dialogues. CPsyCounD covers nine representative topics and seven classic schools of psychological counseling.

1 PAPER • NO BENCHMARKS YET

CPsyCounE

The general multi-turn dialogue evaluation dataset with nine topics. Each topic has five representative cases, resulting in a comprehensive evaluation dataset of 45 cases.

1 PAPER • NO BENCHMARKS YET

CSCD-IME

Chinese Spelling Correction Dataset for errors generated by pinyin IME (CSCD-IME), a dataset containing 40,000 annotated sentences from real posts of official media on Sina Weibo. It is designed to detect and correct spelling mistakes in Chinese texts.

1 PAPER • NO BENCHMARKS YET

CSL (Chinese Scientific Literature)

We present CSL, a large-scale Chinese Scientific Literature dataset, which contains the titles, abstracts, keywords and academic fields of 396,209 papers. To our knowledge, CSL is the first scientific document dataset in Chinese.

1 PAPER • NO BENCHMARKS YET

CSPRD

CSPRD (Chinese Stock Policy Retrieval Dataset)

The Chinese Stock Policy Retrieval Dataset (CSPRD) contains a Chinese policy corpus of 10,002 articles and 709 prospectus examples from 545 companies listed on China’s Science and Technology Innovation Board (STAR Market). CSPRD is bilingual in Chinese and English (Translated by ChatGPT) and is annotated by experienced experts from Shanghai Stock Exchange.

1 PAPER • NO BENCHMARKS YET

ChCatExt

ChCatExt (Chinese Catalog Extraction Dataset)

ChCatExt is composed of BidAnn (bid announcement), FinAnn (financial announcement) and CreRat (credit rating report). It is designed for re-construct catalog trees from documents.

1 PAPER • 1 BENCHMARK

ChinaOpen-1k

ChinaOpen is a new video dataset targeted at open-world multimodal learning, with raw data gathered from Bilibili, a popular Chinese video-sharing website. The dataset has a large webly annotated training set of videos (associated with user-generated titles and tags) and a smaller manually annotated test set of videos (with manually checked user titles / tags, manually written captions, and manual labels describing what visual objects / actions / scenes shown in the visual content).

1 PAPER • 1 BENCHMARK

Chinese AI and Law (CAIL) 2018

Large-scale Chinese legal dataset for judgment prediction. \dataset contains more than 2.6 million criminal cases published by the Supreme People's Court of China, which are several times larger than other datasets in existing works on judgment prediction.

1 PAPER • NO BENCHMARKS YET

Chinese Literature NER RE

Chinese Literature NER RE is a Discourse-Level Named Entity Recognition and Relation Extraction Dataset for Chinese Literature Text. It is constructed from hundreds of Chinese literature articles.

1 PAPER • NO BENCHMARKS YET

Chinese Traditional Painting dataset

The Chinese Traditional Painting dataset for style transfer contains 1000 content images and 100 style images. The content images are mostly the photorealistic scenes of mountain, lake, river, bridge, and buildings in regions south of the Yangtze River. It includes not only the scenes of China, but also beautiful pictures of Rhine, Alps, Yellow Stone, Grand Canyon, etc. The content images include diverse types of Chinese traditional paintings.

1 PAPER • NO BENCHMARKS YET

Chinese social media suicide risk and cognitive distortions classification

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 PAPER • NO BENCHMARKS YET

Conic10K

Conic10K is an open-ended math problem dataset on conic sections in Chinese senior high school education. This dataset contains 10,861 carefully annotated problems, each one has a formal representation, the corresponding text spans, the answer, and natural language rationales. These questions require long reasoning steps while the topic is limited to conic sections. It could be used to evaluate models with 2 tasks: semantic parsing and mathematical question answering (mathQA).

1 PAPER • NO BENCHMARKS YET

ConvSumX

ConvSumX is a cross-lingual conversation summarization benchmark, through a new annotation schema that explicitly considers source input context. ConvSumX consists of 2 sub-tasks under different real-world scenarios, with each covering 3 language directions.

1 PAPER • NO BENCHMARKS YET

DIGITal (Digitally Generated Numerals)

Digitally Generated Numerals (DIGITal) Description The Digitally Generated Numerals (DIGITal) dataset consists of 100,000 image pairs representing digits from 0 to 9. These image pairs include both low and high-quality versions, with a resolution of 128x128 pixels.

1 PAPER • NO BENCHMARKS YET

Dataset of Propaganda Techniques of the State-Sponsored Information Operation of the People's Republic of China

This data is for the Mis2-KDD 2021 under review paper: Dataset of Propaganda Techniques of the State-Sponsored Information Operation of the People’s Republic of China

1 PAPER • 1 BENCHMARK

DiaKG

DiaKG is a high-quality Chinese dataset for Diabetes knowledge graph.

1 PAPER • NO BENCHMARKS YET

EUCA dataset

EUCA dataset description Associated Paper: EUCA: the End-User-Centered Explainable AI Framework

1 PAPER • NO BENCHMARKS YET

FCGEC (FCGEC: Fine-Grained Corpus for Chinese Grammatical Error Correction)

a fine-grained corpus to detect, identify and correct the chinese grammatical errors. collected mainly from multi-choice questions in public school Chinese examinations with multiple references Online Evaluation Site for test set: https://codalab.lisn.upsaclay.fr/competitions/8020

1 PAPER • 1 BENCHMARK

FGraDA (Fine-Grained Domain Adaptation Dataset)

Previous research for adapting a general neural machine translation (NMT) model into a specific domain usually neglects the diversity in translation within the same domain, which is a core problem for domain adaptation in real- world scenarios. One representative of such challenging scenarios is to deploy a translation system for a conference with a specific topic, e.g., global warming or coronavirus, where there are usually extremely less resources due to the limited schedule. To motivate wider investigation in such a scenario, we present a real-world fine-grained domain adaptation task in machine translation (FGraDA). The FGraDA dataset consists of Chinese-English translation task for four sub-domains of information technology: autonomous vehicles, AI education, real-time networks, and smart phone. Each sub-domain is equipped with a development set and test set for evaluation pur- poses. To be closer to reality, FGraDA does not employ any in-domain bilingual training data but provide

1 PAPER • NO BENCHMARKS YET

Fraud_Case_Verdicts

Fraud_Case_Verdicts (The "Crime Facts" of "Offenses of Fraudulence" in Judicial Yuan Verdicts Dataset)

The "Crime Facts" of "Offenses of Fraudulence" in Judicial Yuan Verdicts Dataset

1 PAPER • NO BENCHMARKS YET

HT Docking

HT Docking is a dataset consisting of 200 million 3D complex structures and 2D structure scores across a consistent set of 13 million ``in-stock'' molecules over 15 receptors, or binding sites, across the SARS-CoV-2 proteome. It is used to study surrogate model accuracy for protein-ligand docking.

1 PAPER • NO BENCHMARKS YET

HTDM

HTDM (Hypertention Disease Medication)

Hypertention Disease Medication dataset.

1 PAPER • NO BENCHMARKS YET

IEE

IEE is a financial-domain dataset of the Insurance-entity extraction task. Its goal is to locate named entities mentioned in the input sentence.

1 PAPER • NO BENCHMARKS YET

JDDC 2.0

JDDC 2.0 is a large-scale multimodal multi-turn dialogue dataset collected from a mainstream Chinese E-commerce platform JD.com, containing about 246 thousand dialogue sessions, 3 million utterances, and 507 thousand images, along with product knowledge bases and image category annotations. The dataset is divided into the training set, the validation set, and the test set according to the ratio of 80%, 10%, and 10%.

1 PAPER • NO BENCHMARKS YET

LARQS

LARQS (An Evaluation Dataset for Chinese Codex Word Embedding Model)

Word embedding is a modern distributed word representations approach widely used in many natural language processing tasks. Converting the vocabulary in a legal document into a word embedding model facilitates subjecting legal documents to machine learning, deep learning, and other algorithms and subsequently performing the downstream tasks of natural language processing vis-à-vis, for instance, document classification, contract review, and machine translation. The most common and practical approach of accuracy evaluation with the word embedding model uses a benchmark set with linguistic rules or the relationship between words to perform analogy reasoning via algebraic calculation. This paper proposes establishing a 1,134 Legal Analogical Reasoning Questions Set (LARQS) from the 2,388 Chinese Codex corpus using five kinds of legal relations, which are then used to evaluate the accuracy of the Chinese word embedding model. Moreover, we discovered that legal relations might be ubiquitous

1 PAPER • NO BENCHMARKS YET

LPBA40

LPBA40 (LONI Probabilistic Brain Atlas)

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 PAPER • NO BENCHMARKS YET

LSICC

LSICC (Large Scale Informal Chinese Corpus)

Large Scale Informal Chinese Corpus (LSICC) is a large-scale corpus of informal Chinese. This corpus contains around 37 million book reviews and 50 thousand netizen's comments to the news.

1 PAPER • NO BENCHMARKS YET

LitMind Dictionary

An open-source online generative dictionary that takes a word and context containing the word as input and automatically generates a definition as output. Incorporating state-of-the-art definition generation models, it supports not only Chinese and English, but also Chinese-English cross-lingual queries. Moreover, it has a user-friendly front-end design that can help users understand the query words quickly and easily.

1 PAPER • NO BENCHMARKS YET

Live Comment Dataset

The Live Comment Dataset is a large-scale dataset with 2,361 videos and 895,929 live comments that were written while the videos were streamed.

1 PAPER • NO BENCHMARKS YET

LongWanjuan

A 160B bilingual long-text dataset with 3 categories: holistic, aggregated and chaotic long texts.

1 PAPER • NO BENCHMARKS YET

MGSM8KInstruct

MGSM8KInstruct, the multilingual math reasoning instruction dataset, encompassing ten distinct languages, thus addressing the issue of training data scarcity in multilingual math reasoning.

1 PAPER • NO BENCHMARKS YET

MTC

MTC is a financial-domain dataset of the multi-label topic classification task. It aims to identify the topics of the spoken dialogue.

1 PAPER • NO BENCHMARKS YET

MULTI

MULTI-Benchmark is a cutting-edge benchmark for evaluating Multimodal Large Language Models (MLLMs). It is designed to test the understanding of complex tables and images, and reasoning with long context¹. Here are some key features of MULTI-Benchmark:

1 PAPER • NO BENCHMARKS YET

Datasets

333 dataset results for Chinese