Opusparcus is a paraphrase corpus for six European languages: German, English, Finnish, French, Russian, and Swedish. The paraphrases are extracted from the OpenSubtitles2016 corpus, which contains subtitles from movies and TV shows.
15 PAPERS • NO BENCHMARKS YET
Taiga is a corpus, where text sources and their meta-information are collected according to popular ML tasks.
5 PAPERS • NO BENCHMARKS YET
The Russian Corpus of Linguistic Acceptability (RuCoLA) is built from the ground up under the well-established binary LA approach. RuCoLA consists of 9.8k in-domain sentences from linguistic publications and 3.6k out-of-domain sentence produced by generative models.
4 PAPERS • 1 BENCHMARK
CLSE is an augmented version of the Schema-Guided Dialog Dataset. The corpus includes 34 languages and covers 74 different semantic types to support various applications from airline ticketing to video games.
2 PAPERS • NO BENCHMARKS YET
The Lenta Short Sentences dataset is a text dataset for language modelling for the Russian language. It consists of 236K sentences sampled from the Lenta News dataset.
1 PAPER • NO BENCHMARKS YET