Aesthetic Visual Analysis is a dataset for aesthetic image assessment that contains over 250,000 images along with a rich variety of meta-data including a large number of aesthetic scores for each image, semantic labels for over 60 categories as well as labels related to photographic style.
11 PAPERS • 3 BENCHMARKS
Given 10 minimally contrastive (highly similar) images and a complex description for one of them, the task is to retrieve the correct image. The source of most images are videos and descriptions as well as retrievals come from human.
9 PAPERS • 1 BENCHMARK
DeepSportradar is a benchmark suite of computer vision tasks, datasets and benchmarks for automated sport understanding. DeepSportradar currently supports four challenging tasks related to basketball: ball 3D localization, camera calibration, player instance segmentation and player re-identification. For each of the four tasks, a detailed description of the dataset, objective, performance metrics, and the proposed baseline method are provided.
3 PAPERS • NO BENCHMARKS YET
Largest, first-of-its-kind, in-the-wild, fine-grained workout/exercise posture analysis dataset, covering three different exercises: BackSquat, Barbell Row, and Overhead Press. Seven different types of exercise errors are covered. Unlabeled data is also provided to facilitate self-supervised learning.
2 PAPERS • NO BENCHMARKS YET
VTC is a large-scale multimodal dataset containing video-caption pairs (~300k) alongside comments that can be used for multimodal representation learning.
We provide a database containing shot scale annotations (i.e., the apparent distance of the camera from the subject of a filmed scene) for more than 792,000 image frames. Frames belong to 124 full movies from the entire filmographies by 6 important directors: Martin Scorsese, Jean-Luc Godard, Béla Tarr, Federico Fellini, Michelangelo Antonioni, and Ingmar Bergman. Each frame, extracted from videos at 1 frame per second, is annotated on the following scale categories: Extreme Close Up (ECU), Close Up (CU), Medium Close Up (MCU), Medium Shot (MS), Medium Long Shot (MLS), Long Shot (LS), Extreme Long Shot (ELS), Foreground Shot (FS), and Insert Shots (IS). Two independent coders annotated all frames from the 124 movies, whilst a third one checked their coding and made decisions in cases of disagreement. The CineScale database enables AI-driven interpretation of shot scale data and opens to a large set of research activities related to the automatic visual analysis of cinematic material, s
1 PAPER • NO BENCHMARKS YET
Placepedia contains 240K places with 35M images from all over the world. Each place is associated with its district, city/town/village, state/province, country, continent, and a large amount of diverse photos. Both administrative areas and places have rich side information, e.g. discription, population, category, function. In addition, two cleaned subsets (Places-Coarse and Places-Fine) for experiments are provided.
Trailers12k is a movie trailer dataset comprised of 12,000 titles associated to ten genres. It distinguishes from other datasets by its collection procedure aimed at providing a high-quality publicly available dataset.
WildQA is a video understanding dataset of videos recorded in outside settings. The dataset can be used to evaluate models for video question answering.
1 PAPER • 1 BENCHMARK