XGLUE is an evaluation benchmark XGLUE,which is composed of 11 tasks that span 19 languages. For each task, the training data is only available in English. This means that to succeed at XGLUE, a model must have a strong zero-shot cross-lingual transfer capability to learn from the English data of a specific task and transfer what it learned to other languages. Comparing to its concurrent work XTREME, XGLUE has two characteristics: First, it includes cross-lingual NLU and cross-lingual NLG tasks at the same time; Second, besides including 5 existing cross-lingual tasks (i.e. NER, POS, MLQA, PAWS-X and XNLI), XGLUE selects 6 new tasks from Bing scenarios as well, including News Classification (NC), Query-Ad Matching (QADSM), Web Page Ranking (WPR), QA Matching (QAM), Question Generation (QG) and News Title Generation (NTG). Such diversities of languages, tasks and task origin provide a comprehensive benchmark for quantifying the quality of a pre-trained model on cross-lingual natural lan
21 PAPERS • 2 BENCHMARKS
Korean Language Understanding Evaluation (KLUE) benchmark is a series of datasets to evaluate natural language understanding capability of Korean language models. KLUE consists of 8 diverse and representative tasks, which are accessible to anyone without any restrictions. With ethical considerations in mind, we deliberately design annotation guidelines to obtain unambiguous annotations for all datasets. Furthermore, we build an evaluation system and carefully choose evaluations metrics for every task, thus establishing fair comparison across Korean language models.
19 PAPERS • 1 BENCHMARK
KorNLI is a Korean Natural Language Inference (NLI) dataset. The dataset is constructed by automatically translating the training sets of the SNLI, XNLI and MNLI datasets. To ensure translation quality, two professional translators with at least seven years of experience who specialize in academic papers/books as well as business contracts post-edited a half of the dataset each and cross-checked each other’s translation afterward. It contains 942,854 training examples translated automatically and 7,500 evaluation (development and test) examples translated manually
17 PAPERS • NO BENCHMARKS YET
KorSTS is a dataset for semantic textural similarity (STS) in Korean. The dataset is constructed by automatically the STS-B dataset. To ensure translation quality, two professional translators with at least seven years of experience who specialize in academic papers/books as well as business contracts post-edited a half of the dataset each and cross-checked each other’s translation afterward. The KorSTS dataset comprises 5,749 training examples translated automatically and 2,879 evaluation examples translated manually.
13 PAPERS • NO BENCHMARKS YET