2 dataset results for Texts AND Aragonese

WikiMatrix is a dataset of parallel sentences in the textual content of Wikipedia for all possible language pairs. The mined data consists of:

87 PAPERS • NO BENCHMARKS YET

OSCAR

OSCAR or Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. The dataset used for training multilingual models such as BART incorporates 138 GB of text.

56 PAPERS • NO BENCHMARKS YET

Datasets

2 dataset results for Texts AND Aragonese