The MEDIA French corpus is dedicated to semantic extraction from speech in a context of human/machine dialogues. The corpus has manual transcription and conceptual annotation of dialogues from 250 speakers. It is split into the following three parts : (1) the training set (720 dialogues, 12K sentences), (2) the development set (79 dialogues, 1.3K sentences, and (3) the test set (200 dialogues, 3K sentences).
6 PAPERS • NO BENCHMARKS YET
MultiSubs is a dataset of multilingual subtitles gathered from the OPUS OpenSubtitles dataset, which in turn was sourced from opensubtitles.org. We have supplemented some text fragments (visually salient nouns in this release) within the subtitles with web images, where the word sense of the fragment has been disambiguated using a cross-lingual approach. We have introduced a fill-in-the-blank task and a lexical translation task to demonstrate the utility of the dataset. Please refer to our paper for a more detailed description of the dataset and tasks. Multisubs will benefit research on visual grounding of words especially in the context of free-form sentence.
4 PAPERS • 5 BENCHMARKS