Acappella comprises around 46 hours of a cappella solo singing videos sourced from YouTbe, sampled across different singers and languages. Four languages are considered: English, Spanish, Hindi and others.
5 PAPERS • NO BENCHMARKS YET
The German Lipreading dataset consists of 250,000 publicly available videos of the faces of speakers of the Hessian Parliament, which was processed for word-level lip reading using an automatic pipeline. The format is similar to that of the English language Lip Reading in the Wild (LRW) dataset, with each H264-compressed MPEG-4 video encoding one word of interest in a context of 1.16 seconds duration, which yields compatibility for studying transfer learning between both datasets. Choosing video material based on naturally spoken language in a natural environment ensures more robust results for real-world applications than artificially generated datasets with as little noise as possible. The 500 different spoken words ranging between 4-18 characters in length each have 500 instances and separate MPEG-4 audio- and text metadata-files, originating from 1018 parliamentary sessions. Additionally, the complete TextGrid files containing the segmentation information of those sessions are also
Audiovisual Moments in Time (AVMIT) is a large-scale dataset of audiovisual action events. The dataset includes the annotation of 57,177 audiovisual videos from the Moments in Time dataset, each independently evaluated by 3 of 11 trained participants. Each annotation pertains to whether the labelled audiovisual action event is present and whether it is the most prominent feature of the video. The dataset also provides a curated test set of 960 videos across 16 classes, suitable for comparative experiments involving computational models and human participants, specifically when addressing research questions where audiovisual correspondence is of critical importance.
1 PAPER • NO BENCHMARKS YET