MAD (Movie Audio Descriptions) is an automatically curated large-scale dataset for the task of natural language grounding in videos or natural language moment retrieval. MAD exploits available audio descriptions of mainstream movies. Such audio descriptions are redacted for visually impaired audiences and are therefore highly descriptive of the visual content being displayed. MAD contains over 384,000 natural language sentences grounded in over 1,200 hours of video, and provides a unique setup for video grounding as the visual stream is truly untrimmed with an average video duration of 110 minutes. 2 orders of magnitude longer than legacy datasets.
27 PAPERS • 2 BENCHMARKS
METU-VIREF is a video referring expression dataset comprising of videos from VIRAT Ground and ILSVRC2015 VID datasets. VIRAT is a surveillance dataset and contains mainly people and vehicles. To line up with this and restrict the domain, only videos that contain vehicles from the ILSVRC dataset are used. The METU-VIREF dataset does not contain whole videos from these datasets (the videos need to be downloaded from the respective sources) but just referring expressions for video sequences containing an object pair. For this, object pairs are chosen which had a relation that a meaningful referring expression could be written for.
1 PAPER • NO BENCHMARKS YET