5 dataset results for Text Generation AND Russian

Opusparcus is a paraphrase corpus for six European languages: German, English, Finnish, French, Russian, and Swedish. The paraphrases are extracted from the OpenSubtitles2016 corpus, which contains subtitles from movies and TV shows.

15 PAPERS • NO BENCHMARKS YET

Taiga Corpus (An open-source corpus for machine learning.)

Taiga is a corpus, where text sources and their meta-information are collected according to popular ML tasks.

5 PAPERS • NO BENCHMARKS YET

RuCoLA

The Russian Corpus of Linguistic Acceptability (RuCoLA) is built from the ground up under the well-established binary LA approach. RuCoLA consists of 9.8k in-domain sentences from linguistic publications and 3.6k out-of-domain sentence produced by generative models.

4 PAPERS • 1 BENCHMARK

CLSE (Corpus of Linguistically Significant Entities)

CLSE is an augmented version of the Schema-Guided Dialog Dataset. The corpus includes 34 languages and covers 74 different semantic types to support various applications from airline ticketing to video games.

2 PAPERS • NO BENCHMARKS YET

Lenta Short Sentences

The Lenta Short Sentences dataset is a text dataset for language modelling for the Russian language. It consists of 236K sentences sampled from the Lenta News dataset.

1 PAPER • NO BENCHMARKS YET

Datasets

5 dataset results for Text Generation AND Russian