A new challenging dataset in English spanning 18 domains, which is substantially bigger and linguistically more diverse than existing datasets.
84 PAPERS • 2 BENCHMARKS
xSID, a new evaluation benchmark for cross-lingual (X) Slot and Intent Detection in 13 languages from 6 language families, including a very low-resource dialect, covering Arabic (ar), Chinese (zh), Danish (da), Dutch (nl), English (en), German (de), Indonesian (id), Italian (it), Japanese (ja), Kazakh (kk), Serbian (sr), Turkish (tr) and an Austro-Bavarian German dialect, South Tyrolean (de-st).
13 PAPERS • NO BENCHMARKS YET
A labelled version of the ORCAS click-based dataset of Web queries, which provides 18 million connections to 10 million distinct queries.
1 PAPER • 1 BENCHMARK
This dataset for Intent classification from human speech covers 14 coarse-grained intents from the Banking domain. This work is inspired by a similar release in the Minds-14 dataset - here, we restrict ourselves to Indian English but with a much larger training set. The data was generated by 11 (Indian English) speakers and recorded over a telephony line. We also provide access to anonymized speaker information - like gender, languages spoken, and native language - to allow more structured discussions around robustness and bias in the models you train.