🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task

Filter by Language (clear)

333 dataset results for Chinese

MaRVL (Multicultural Reasoning over Vision and Language)

Multicultural Reasoning over Vision and Language (MaRVL) is a dataset based on an ImageNet-style hierarchy representative of many languages and cultures (Indonesian, Mandarin Chinese, Swahili, Tamil, and Turkish). The selection of both concepts and images is entirely driven by native speakers. Afterwards, we elicit statements from native speakers about pairs of images. The task consists in discriminating whether each grounded statement is true or false.

19 PAPERS • 3 BENCHMARKS

RCTW-17 (Reading Chinese Text in the Wild)

Features a large-scale dataset with 12,263 annotated images. Two tasks, namely text localization and end-to-end recognition, are set up. The competition took place from January 20 to May 31, 2017. 23 valid submissions were received from 19 teams.

19 PAPERS • NO BENCHMARKS YET

CASIA-HWDB

CASIA-HWDB is a dataset for handwritten Chinese character recognition. It contains 300 files (240 in HWDB1.1 training set and 60 in HWDB1.1 test set). Each file contains about 3000 isolated gray-scale Chinese character images written by one writer, as well as their corresponding labels.

17 PAPERS • NO BENCHMARKS YET

United Nations Parallel Corpus

The first parallel corpus composed from United Nations documents published by the original data creator. The parallel corpus presented consists of manually translated UN documents from the last 25 years (1990 to 2014) for the six official UN languages, Arabic, Chinese, English, French, Russian, and Spanish.

17 PAPERS • NO BENCHMARKS YET

CH-SIMS

CH-SIMS is a Chinese single- and multimodal sentiment analysis dataset which contains 2,281 refined video segments in the wild with both multimodal and independent unimodal annotations. It allows researchers to study the interaction between modalities or use independent unimodal annotations for unimodal sentiment analysis.

16 PAPERS • 1 BENCHMARK

EPHOIE (phtnsantader@gmail.com)

EPHOIE is a fully-annotated dataset which is the first Chinese benchmark for both text spotting and visual information extraction. EPHOIE consists of 1,494 images of examination paper head with complex layouts and background, including a total of 15,771 Chinese handwritten or printed text instances.

16 PAPERS • 2 BENCHMARKS

OntoNotes 4.0

OntoNotes 4.0 (OntoNotes Release 4.0)

OntoNotes Release 4.0 contains the content of earlier releases -- OntoNotes Release 1.0 LDC2007T21, OntoNotes Release 2.0 LDC2008T04 and OntoNotes Release 3.0 LDC2009T24 -- and adds newswire, broadcast news, broadcast conversation and web data in English and Chinese and newswire data in Arabic. This cumulative publication consists of 2.4 million words as follows: 300k words of Arabic newswire 250k words of Chinese newswire, 250k words of Chinese broadcast news, 150k words of Chinese broadcast conversation and 150k words of Chinese web text and 600k words of English newswire, 200k word of English broadcast news, 200k words of English broadcast conversation and 300k words of English web text.

16 PAPERS • 1 BENCHMARK

CPED (Chinese Personalized and Emotional Dialogue)

We construct a dataset named CPED from 40 Chinese TV shows. CPED consists of multisource knowledge related to empathy and personal characteristic. This knowledge covers 13 emotions, gender, Big Five personality traits, 19 dialogue acts and other knowledge.

15 PAPERS • 3 BENCHMARKS

OASST1

OASST1 (OpenAssistant Conversations Dataset)

license: apache-2.0 tags: human-feedback size_categories: 100K<n<1M pretty_name: OpenAssistant Conversations

15 PAPERS • NO BENCHMARKS YET

XFUND

XFUND (A Multilingual Form Understanding Benchmark)

XFUND is a multilingual form understanding benchmark dataset that includes human-labeled forms with key-value pairs in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese).

15 PAPERS • NO BENCHMARKS YET

LCCC

LCCC (Large-scale Cleaned Chinese Conversation corpus)

Contains a base version (6.8million dialogues) and a large version (12.0 million dialogues).

14 PAPERS • NO BENCHMARKS YET

MuCGEC

MuCGEC (Multi-Reference Multi-Source Evaluation Dataset for Chinese Grammatical Error Correction)

MuCGEC is a multi-reference multi-source evaluation dataset for Chinese Grammatical Error Correction (CGEC), consisting of 7,063 sentences collected from three different Chinese-as-a-Second-Language (CSL) learner sources. Each sentence has been corrected by three annotators, and their corrections are meticulously reviewed by an expert, resulting in 2.3 references per sentence.

14 PAPERS • 1 BENCHMARK

XStoryCloze

XStoryCloze consists of the professionally translated version of the English StoryCloze dataset (Spring 2016 version) to 10 non-English languages. This dataset is intended to be used for evaluating the zero- and few-shot learning capabilities of multlingual language models. This dataset is released by Meta AI.

14 PAPERS • NO BENCHMARKS YET

Chaoyang

Chaoyang dataset contains 1111 normal, 842 serrated, 1404 adenocarcinoma, 664 adenoma, and 705 normal, 321 serrated, 840 adenocarcinoma, 273 adenoma samples for training and testing, respectively. This noisy dataset is constructed in the real scenario.

13 PAPERS • 2 BENCHMARKS

KUAKE-QIC

KUAKE-QIC (Query Intent Classification Dataset)

KUAKE Query Intent Classification, a dataset for intent classification, is used for the KUAKE-QIC task. Given the queries of search engines, the task requires to classify each of them into one of 11 medical intent categories defined in KUAKE-QIC, including diagnosis, etiology analysis, treatment plan, medical advice, test result analysis, disease description, consequence prediction, precautions, intended effects, treatment fees, and others.

13 PAPERS • 1 BENCHMARK

xSID (Cross-lingual Slot and Intent Detection)

xSID, a new evaluation benchmark for cross-lingual (X) Slot and Intent Detection in 13 languages from 6 language families, including a very low-resource dialect, covering Arabic (ar), Chinese (zh), Danish (da), Dutch (nl), English (en), German (de), Indonesian (id), Italian (it), Japanese (ja), Kazakh (kk), Serbian (sr), Turkish (tr) and an Austro-Bavarian German dialect, South Tyrolean (de-st).

13 PAPERS • NO BENCHMARKS YET

DuLeMon (Baidu Long-term Memory Conversation)

DuLeMon is a large-scale Chinese Long-term Memory Conversation dataset, which simulates long-term memory conversations and focuses on the ability to actively construct and utilize the user's and the bot's persona in a long-term interaction. DuLeMon contains about 27.5k human-human conversations, 449k utterances, and 12k persona grounding sentences. This corpus can be used to explore Long-term Memory Conversation, Personalized Dialogue, and Persona Extraction / Matching / Retrieval.

12 PAPERS • NO BENCHMARKS YET

FewCLUE

Chinese Few-shot Learning Evaluation Benchmark (FewCLUE) is a comprehensive small sample evaluation benchmark in Chinese. It includes nine tasks, ranging from single-sentence and sentence-pair classification tasks to machine reading comprehension tasks.

12 PAPERS • 5 BENCHMARKS

CMRC 2017

CMRC 2017 (Chinese Machine Reading Comprehension 2017)

Contains two different types: cloze-style reading comprehension and user query reading comprehension, associated with large-scale training data as well as human-annotated validation and hidden test set.

11 PAPERS • 3 BENCHMARKS

InDL (In-Diagram Logic)

Dataset Introduction

11 PAPERS • 1 BENCHMARK

PsyQA

PsyQA is a Chinese Dataset for generating long counseling text for mental health support.

11 PAPERS • NO BENCHMARKS YET

Synbols

Synbols is a dataset generator designed for probing the behavior of learning algorithms. By defining the distribution over latent factors one can craft a dataset specifically tailored to answer specific questions about a given algorithm.

11 PAPERS • NO BENCHMARKS YET

Astock

(1) provide financial news for each specific stock. (2) provide various stock technical factors and fundamental factors for each stock.

10 PAPERS • 2 BENCHMARKS

CWRU Bearing Dataset

Data was collected for normal bearings, single-point drive end and fan end defects. Data was collected at 12,000 samples/second and at 48,000 samples/second for drive end bearing experiments. All fan end bearing data was collected at 12,000 samples/second.

10 PAPERS • 1 BENCHMARK

FM-IQA (Freestyle Multilingual Image Question Answering)

FM-IQA is a question-answering dataset containing over 150,000 images and 310,000 freestyle Chinese question-answer pairs and their English translations.

10 PAPERS • NO BENCHMARKS YET

HRSC2016 (High resolution ship collections 2016)

High-resolution ship collections 2016 (HRSC2016) is a data set used for scientific research. Currently, all of the images in HRSC2016 were collected from Google Earth.

10 PAPERS • NO BENCHMARKS YET

JEC-QA

JEC-QA is a LQA (Legal Question Answering) dataset collected from the National Judicial Examination of China. It contains 26,365 multiple-choice and multiple-answer questions in total. The task of the dataset is to predict the answer using the questions and relevant articles. To do well on JEC-QA, both retrieving and answering are important.

10 PAPERS • NO BENCHMARKS YET

PersonalDialog

PersonalDialog is a large-scale multi-turn dialogue dataset containing various traits from a large number of speakers. The dataset consists of 20.83M sessions and 56.25M utterances from 8.47M speakers. Each utterance is associated with a speaker who is marked with traits like Age, Gender, Location, Interest Tags, etc. Several anonymization schemes are designed to protect the privacy of each speaker.

10 PAPERS • NO BENCHMARKS YET

RS-Haze

A large-scale non-homogeneous remote sensing image dehazing dataset

10 PAPERS • 1 BENCHMARK

SARDet-100K

The SARDet-100K dataset encompasses a total of 116,598 images, and 245,653 instances distributed across six categories: Aircraft, Ship, Car, Bridge, Tank, and Harbor. SARDet100K dataset stands as the first large-scale SAR object detection dataset, comparable in size to the widely used COCO dataset (118K images). The scale and diversity of the SARDet-100K dataset provide researchers with robust training and evaluation for advancing SAR object detection algorithms and techniques, fostering the development of SOTA models in this domain.

10 PAPERS • 1 BENCHMARK

BSTC (Baidu Speech Translation Corpus)

BSTC (Baidu Speech Translation Corpus) is a large-scale dataset for automatic simultaneous interpretation. BSTC version 1.0 contains 50 hours of real speeches, including three parts, the audio files, the transcripts, and the translations. The corpus can be used to build automatic simultaneous interpretation system. The corpus is collected from the Chinese mandarin talks and reports, including science, technology, culture, economy, etc.,. The utterances in talks and reports are carefully transcribed into Chinese text, and further translated into English text. The sentence boundary is determined by the English text instead of the Chinese text which is analogous to the previous related corpus (TED and Translation Augmented LibriSpeech Corpus).

9 PAPERS • NO BENCHMARKS YET

CJRC

CJRC (Chinese judicial reading comprehension)

The Chinese judicial reading comprehension (CJRC) dataset contains approximately 10K documents and almost 50K questions with answers. The documents come from judgment documents and the questions are annotated by law experts.

9 PAPERS • 2 BENCHMARKS

CMeEE

CMeEE (Chinese Medical Named Entity Recognition Dataset)

Chinese Medical Named Entity Recognition, a dataset first released in CHIP20204, is used for CMeEE task. Given a pre-defined schema, the task is to identify and extract entities from the given sentence and classify them into nine categories: disease, clinical manifestations, drugs, medical equipment, medical procedures, body, medical examinations, microorganisms, and department.

9 PAPERS • 2 BENCHMARKS

CODEBRIM

CODEBRIM (COncrete DEfect BRidge IMage Dataset)

Dataset for multi-target classification of five commonly appearing concrete defects.

9 PAPERS • NO BENCHMARKS YET

MMCU

MMCU (Measuring Massive Multitask Chinese Understanding)

We propose a test to measure the multitask accuracy of large Chinese language models. We constructed a large-scale, multi-task test consisting of single and multiple-choice questions from various branches of knowledge. The test encompasses the fields of medicine, law, psychology, and education, with medicine divided into 15 sub-tasks and education into 8 sub-tasks. The questions in the dataset were manually collected by professionals from freely available online resources, including university medical examinations, national unified legal professional qualification examinations, psychological counselor exams, graduate entrance examinations for psychology majors, and the Chinese National College Entrance Examination. In total, we collected 11,900 questions, which we divided into a few-shot development set and a test set. The few-shot development set contains 5 questions per topic, amounting to 55 questions in total. The test set comprises 11,845 questions.

9 PAPERS • NO BENCHMARKS YET

MagicData-RAMC

The MagicData-RAMC corpus contains 180 hours of conversational speech data recorded from native speakers of Mandarin Chinese over mobile phones with a sampling rate of 16 kHz. The dialogs in the dialogs are classified into 15 diversified domains and tagged with topic labels, ranging from science and technology to ordinary life. Accurate transcription and precise speaker voice activity timestamps are manually labeled for each sample. Speakers' detailed information is also provided.

9 PAPERS • NO BENCHMARKS YET

OpenLane-V2 val

OpenLane-V2 is the world's first perception and reasoning benchmark for scene structure in autonomous driving. The primary task of the dataset is scene structure perception and reasoning, which requires the model to recognize the dynamic drivable states of lanes in the surrounding environment. The challenge of this dataset includes not only detecting lane centerlines and traffic elements but also recognizing the attribute of traffic elements and topology relationships on detected objects.

9 PAPERS • 1 BENCHMARK

Wukong

Wukong is a large-scale Chinese cross-modal dataset for benchmarking different multi-modal pre-training methods to facilitate the Vision-Language Pre-training (VLP). This dataset contains 100 million Chinese image-text pairs from the web. This base query list is taken from and is filtered according to the frequency of Chinese words and phrases.

9 PAPERS • NO BENCHMARKS YET

BEAT2 (BEAT-SMPLX-FLAME)

We propose EMAGE, a framework to generate full-body human gestures from audio and masked gestures, encompassing facial, local body, hands, and global movements. To achieve this, we first introduce BEAT2 (BEAT-SMPLX-FLAME), a new mesh-level holistic co-speech dataset. BEAT2 combines MoShed SMPLX body with FLAME head parameters and further refines the modeling of head, neck, and finger movements, offering a community-standardized, high-quality 3D motion captured dataset. EMAGE leverages masked body gesture priors during training to boost inference performance. It involves a Masked Audio Gesture Transformer, facilitating joint training on audio-to-gesture generation and masked gesture reconstruction to effectively encode audio and body gesture hints. Encoded body hints from masked gestures are then separately employed to generate facial and body movements. Moreover, EMAGE adaptively merges speech features from the audio's rhythm and content and utilizes four compositional VQ-VAEs to enh

8 PAPERS • 2 BENCHMARKS

BiToD

BiToD is a bilingual multi-domain dataset for end-to-end task-oriented dialogue modeling. BiToD contains over 7k multi-domain dialogues (144k utterances) with a large and realistic bilingual knowledge base. It serves as an effective benchmark for evaluating bilingual ToD systems and cross-lingual transfer learning approaches.

8 PAPERS • NO BENCHMARKS YET

CHIP-STS

CHIP-STS (Semantic Textual Similarity Dataset)

CHIP Semantic Textual Similarity, a dataset for sentence similarity in the non-i.i.d. (non-independent and identically distributed) setting, is used for the CHIP-STS task. Specifically, the task aims to transfer learning between disease types on Chinese disease questions and answer data. Given question pairs related to 5 different diseases (The disease types in the training and testing set are different), the task intends to determine whether the semantics of the two sentences are similar.

8 PAPERS • 1 BENCHMARK

CMRC 2019

CMRC 2019 (Chinese Machine Reading Comprehension 2019)

CMRC 2019 is a Chinese Machine Reading Comprehension dataset that was used in The Third Evaluation Workshop on Chinese Machine Reading Comprehension. Specifically, CMRC 2019 is a sentence cloze-style machine reading comprehension dataset that aims to evaluate the sentence-level inference ability.

8 PAPERS • 3 BENCHMARKS

E-KAR (Benchmark for Explainable Knowledge-intensive Analogical Reasoning)

The ability to recognize analogies is fundamental to human cognition. Existing benchmarks to test word analogy do not reveal the underneath process of analogical reasoning of neural models.

8 PAPERS • NO BENCHMARKS YET

RRS

RRS (Restoration-200k for Response Selection)

| | Train | Validation | Test | Ranking Test | | --------- | ----- | ---------- | ------- | ------------ | | size | 0.4M | 50K | 5K | 800 | | pos:neg | 1:1 | 1:9 | 1.2:8.8 | - | | avg turns | 5.0 | 5.0 | 5.0 | 5.0 |

8 PAPERS • 1 BENCHMARK

WMT 2018 News (WMT 2018 News Translation Task)

News translation is a recurring WMT task. The test set is a collection of parallel corpora consisting of about 1500 English sentences translated into 5 languages (Chinese, Czech, Estonian, German, Finnish, Russian, Turkish) and additional 1500 sentences from each of the 7 languages translated to English. The sentences were selected from dozens of news websites and translated by professional translators.

8 PAPERS • NO BENCHMARKS YET

Weather2K

A multivariate spatio-temporal benchmark dataset for meteorological forecasting based on real-time observation data from ground weather stations.

8 PAPERS • 16 BENCHMARKS

Weibo21

Weibo21 is a benchmark of fake news dataset for multi-domain fake news detection (MFND) with domain label annotated, which consists of 4,488 fake news and 4,640 real news from 9 different domains.

8 PAPERS • NO BENCHMARKS YET

2018 n2c2 (Track 2) - Adverse Drug Events and Medication Extraction

2018 n2c2 (Track 2) - Adverse Drug Events and Medication Extraction (2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records)

Abstract Objective This article summarizes the preparation, organization, evaluation, and results of Track 2 of the 2018 National NLP Clinical Challenges shared task. Track 2 focused on extraction of adverse drug events (ADEs) from clinical records and evaluated 3 tasks: concept extraction, relation classification, and end-to-end systems. We perform an analysis of the results to identify the state of the art in these tasks, learn from it, and build on it.

7 PAPERS • NO BENCHMARKS YET

Datasets

333 dataset results for Chinese