It consists of an extensive collection of a high quality cross-lingual fact-to-text dataset in 11 languages: Assamese (as), Bengali (bn), Gujarati (gu), Hindi (hi), Kannada (kn), Malayalam (ml), Marathi (mr), Oriya (or), Punjabi (pa), Tamil (ta), Telugu (te), and monolingual dataset in English (en). This is the Wikipedia text <--> Wikidata KG aligned corpus used to train the data-to-text generation model. The Train & validation splits are created using distant supervision methods and Test data is generated through human annotations.
2 PAPERS • 1 BENCHMARK