🔔 Share your dataset with the ML community!

Filter by Modality (clear)

Filter by Task

Filter by Language

879 dataset results for Videos

In order to create the TED-talks dataset, 3,035 YouTube videos were downloaded using the "TED talks" query. From these initial candidates, videos in which the upper part of the person is visible for at least 64 frames, and the height of the person bounding box was at least 384 pixels were selected. Static videos were manually filtered out and videos in which a person is doing something other than presenting.

12 PAPERS • 1 BENCHMARK

VOT2014 (Visual Object Tracking Challenge 2014)

The dataset comprises 25 short sequences showing various objects in challenging backgrounds. Eight sequences are from the VOT2013 challenge (bolt, bicycle, david, diving, gymnastics, hand, sunshade, woman). The new sequences show complementary objects and backgrounds, for example a fish underwater or a surfer riding a big wave. The sequences were chosen from a large pool of sequences using a methodology based on clustering visual features of object and background so that those 25 sequences sample evenly well the existing pool.

12 PAPERS • 1 BENCHMARK

WSVD (Web Stereo Video Dataset)

The Web Stereo Video Dataset consists of 553 stereoscopic videos from YouTube. This dataset has a wide variety of scene types, and features many nonrigid objects.

12 PAPERS • NO BENCHMARKS YET

WanJuan

WanJuan is a large-scale training corpus that includes multiple modalities. The dataset incorporates text, image-text, and video modalities, with a total volume exceeding 2TB.

12 PAPERS • NO BENCHMARKS YET

BL30K

BL30K is a synthetic dataset rendered using Blender with ShapeNet's data. We break the dataset into six segments, each with approximately 5K videos. The videos are organized in a similar format as DAVIS and YouTubeVOS, so dataloaders for those datasets can be used directly. Each video is 160 frames long, and each frame has a resolution of 768*512. There are 3-5 objects per video, and each object has a random smooth trajectory -- we tried to optimize the trajectories in a greedy fashion to minimize object intersection (not guaranteed), with occlusions still possible (happen a lot in reality). See MiVOS for details.

11 PAPERS • NO BENCHMARKS YET

BrnoCompSpeed

The dataset contains 21 full-HD videos, each around 1 hr long, captured at six different locations. Vehicles in the videos (20 865 instances in total) are annotated with the precise speed measurements from optical gates using LiDAR and verified with several reference GPS tracks. The dataset is available for download and it contains the videos and metadata (calibration, lengths of features in image, annotations, and so on) for future comparison and evaluation.

11 PAPERS • 1 BENCHMARK

CMD

CMD (Condensed Movies Dataset)

Consists of the key scenes from over 3K movies: each key scene is accompanied by a high level semantic description of the scene, character face-tracks, and metadata about the movie. The dataset is scalable, obtained automatically from YouTube, and is freely available for anybody to download and use.

11 PAPERS • NO BENCHMARKS YET

CMU-MOSI (Multimodal Corpus of Sentiment Intensity)

The Multimodal Corpus of Sentiment Intensity (CMU-MOSI) dataset is a collection of 2199 opinion video clips. Each opinion video is annotated with sentiment in the range [-3,3]. The dataset is rigorously annotated with labels for subjectivity, sentiment intensity, per-frame and per-opinion annotated visual features, and per-milliseconds annotated audio features.

11 PAPERS • 2 BENCHMARKS

Condensed Movies

A large-scale video dataset, featuring clips from movies with detailed captions.

11 PAPERS • 1 BENCHMARK

EVE (End-to-end Video-based Eye-tracking)

EVE (End-to-end Video-based Eye-tracking) is a dataset for eye-tracking. It is collected from 54 participants and consists of 4 camera views, over 12 million frames and 1327 unique visual stimuli (images, video, text), adding up to approximately 105 hours of video data in total.

11 PAPERS • NO BENCHMARKS YET

HR-ShanghaiTech

The human-Related version of the ShanghaiTech Campus, was first presented by Morais et al. in the paper "Learning Regularity in Skeleton Trajectories for Anomaly Detection in Videos".

11 PAPERS • 1 BENCHMARK

Hyper-Kvasir Dataset

HyperKvasir dataset contains 110,079 images and 374 videos where it captures anatomical landmarks and pathological and normal findings. A total of around 1 million images and video frames altogether.

11 PAPERS • 2 BENCHMARKS

Inter4K

A video dataset for benchmarking upsampling methods. Inter4K contains 1,000 ultra-high resolution videos with 60 frames per second (fps) from online resources. The dataset provides standardized video resolutions at ultra-high definition (UHD/4K), quad-high definition (QHD/2K), full-high definition (FHD/1080p), (standard) high definition (HD/720p), one quarter of full HD (qHD/520p) and one ninth of a full HD (nHD/360p). We use frame rates of 60, 50, 30, 24 and 15 fps for each resolution. Based on this standardization, both super-resolution and frame interpolation tests can be performed for different scaling sizes ($\times 2$, $\times 3$ and $\times 4$). In this paper, we use Inter4K to address frame upsampling and interpolation. Inter4K provides both standardized UHD resolution and 60 fps for all of videos by also containing a diverse set of 1,000 5-second videos. Differences between scenes originate from the equipment (e.g., professional 4K cameras or phones), lighting conditions, vari

11 PAPERS • NO BENCHMARKS YET

MedVidQA

MedVidQA (Medical Video Question Answering)

The MedVidQA dataset contains the collection of 3, 010 manually created health-related questions and timestamps as visual answers to those questions from trusted video sources, such as accredited medical schools with an established reputation, health institutes, health education, and medical practitioners.

11 PAPERS • NO BENCHMARKS YET

QuerYD

A large-scale dataset for retrieval and event localisation in video. A unique feature of the dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description of the visual content.

11 PAPERS • 1 BENCHMARK

TITAN

TITAN consists of 700 labeled video-clips (with odometry) captured from a moving vehicle on highly interactive urban traffic scenes in Tokyo. The dataset includes 50 labels including vehicle states and actions, pedestrian age groups, and targeted pedestrian action attributes that are organized hierarchically corresponding to atomic, simple/complex-contextual, transportive, and communicative actions.

11 PAPERS • NO BENCHMARKS YET

TRIPOD (TuRnIng POint Dataset)

TRIPOD contains screenplays and plot synopses with turning point (TP) annotations for 99 movies. Each movie contains:

11 PAPERS • NO BENCHMARKS YET

ViTT (Video Timeline Tags)

The ViTT dataset consists of human produced segment-level annotations for 8,169 videos. Of these, 5,840 videos have been annotated once, and the rest of the videos have been annotated twice or more. A total of 12,461 sets of annotations are released. The videos in the dataset are from the Youtube-8M dataset.

11 PAPERS • 2 BENCHMARKS

Video2GIF

The Video2GIF dataset contains over 100,000 pairs of GIFs and their source videos. The GIFs were collected from two popular GIF websites (makeagif.com, gifsoup.com) and the corresponding source videos were collected from YouTube in Summer 2015. IDs and URLs of the GIFs and the videos are provided, along with temporal alignment of GIF segments to their source videos. The dataset shall be used to evaluate GIF creation and video highlight techniques.

11 PAPERS • NO BENCHMARKS YET

VideoSet

VideoSet is a large-scale compressed video quality dataset based on just-noticeable-difference (JND) measurement.

11 PAPERS • NO BENCHMARKS YET

YT-UGC (YouTube UGC)

YT-UGC is a large scale UGC (User Generated Content) dataset (1,500 20 sec video clips) sampled from millions of YouTube videos. The dataset covers popular categories like Gaming, Sports, and new features like High Dynamic Range (HDR). This dataset can be used to study video compression and quality assessment.

11 PAPERS • NO BENCHMARKS YET

AVSBench (Audio −Visual Segmentation)

AVSBench is a pixel-level audio-visual segmentation benchmark that provides ground truth labels for sounding objects. The dataset is divided into three subsets: AVSBench-object (Single-source subset, Multi-sources subset) and AVSBench-semantic (Semantic-labels subset). Accordingly, three settings are studied:

10 PAPERS • NO BENCHMARKS YET

CCD (Car Crash Dataset)

Car Crash Dataset (CCD) is collected for traffic accident analysis. It contains real traffic accident videos captured by dashcam mounted on driving vehicles, which is critical to developing safety-guaranteed self-driving systems. CCD is distinguished from existing datasets for diversified accident annotations, including environmental attributes (day/night, snowy/rainy/good weather conditions), whether ego-vehicles involved, accident participants, and accident reason descriptions.

10 PAPERS • 1 BENCHMARK

Countix

Countix is a real world dataset of repetition videos collected in the wild (i.e.YouTube) covering a wide range of semantic settings with significant challenges such as camera and object motion, diverse set of periods and counts, and changes in the speed of repeated actions. Countix include repeated videos of workout activities (squats, pull ups, battle rope training, exercising arm), dance moves (pirouetting, pumping fist), playing instruments (playing ukulele), using tools repeatedly (hammer hitting objects, chainsaw cutting wood, slicing onion), artistic performances (hula hooping, juggling soccer ball), sports (playing ping pong and tennis) and many others. Figure 6 illustrates some examples from the dataset as well as the distribution of repetition counts and period lengths.

10 PAPERS • NO BENCHMARKS YET

DDPM

DDPM (Deception Detection and Physiological Monitoring)

The Deception Detection and Physiological Monitoring (DDPM) dataset captures an interview scenario in which the interviewee attempts to deceive the interviewer on selected responses. The interviewee is recorded in RGB, near-infrared, and long-wave infrared, along with cardiac pulse, blood oxygenation, and audio. After collection, data were annotated for interviewer/interviewee, curated, ground-truthed, and organized into train/test parts for a set of canonical deception detection experiments. The dataset contains almost 13 hours of recordings of 70 subjects, and over 8 million visible-light, near-infrared, and thermal video frames, along with appropriate meta, audio, and pulse oximeter data.

10 PAPERS • NO BENCHMARKS YET

DramaQA

The DramaQA focuses on two perspectives: 1) Hierarchical QAs as an evaluation metric based on the cognitive developmental stages of human intelligence. 2) Character-centered video annotations to model local coherence of the story. The dataset is built upon the TV drama "Another Miss Oh" and it contains 17,983 QA pairs from 23,928 various length video clips, with each QA pair belonging to one of four difficulty levels.

10 PAPERS • 3 BENCHMARKS

EgoDexter

The EgoDexter dataset provides both 2D and 3D pose annotations for 4 testing video sequences with 3190 frames. The videos are recorded with body-mounted camera from egocentric viewpoints and contain cluttered backgrounds, fast camera motion, and complex interactions with various objects. Fingertip positions were manually annotated for 1485 out of 3190 frames.

10 PAPERS • NO BENCHMARKS YET

Ekman6

the YF-E6 emotion dataset using the 6 basic emotion type as keywords on social video-sharing websites including YouTube and Flickr, leading to a total of 3000 videos. The dataset is labeled through crowdsourcing by 10 different annotators (5 males and 5 females), whose age ranged from 22 to 45. Annotators were given detailed definition for each emotion before performing the task. Every video is manually labeled by all the annotators. A video is excluded from the final dataset when over half of annotations are inconsistent with the initial search keyword.

10 PAPERS • 1 BENCHMARK

GTA-IM Dataset (GTA Indoor Motion)

The GTA Indoor Motion dataset (GTA-IM) that emphasizes human-scene interactions in the indoor environments. It consists of HD RGB-D image sequences of 3D human motion from a realistic game engine. The dataset has clean 3D human pose and camera pose annotations, and large diversity in human appearances, indoor environments, camera views, and human activities.

10 PAPERS • 2 BENCHMARKS

HIC (Hands in Action)

The Hands in action dataset (HIC) dataset has RGB-D sequences of hands interacting with objects.

10 PAPERS • NO BENCHMARKS YET

HR-Avenue

The human-Related version of the CUHK Avenue dataset, first presented by Morais et al. in the paper "Learning Regularity in Skeleton Trajectories for Anomaly Detection in Videos".

10 PAPERS • 1 BENCHMARK

LVOS

LVOS is a dataset for long-term video object segmentation (VOS). It consists of 220 videos with a total duration of 421 minutes. The videos in our LVOS last 1.59 minutes on average, which is 20 times longer than videos in existing VOS datasets. Each video includes various attributes, especially challenges deriving from the wild, such as long-term reappearing and cross-temporal similar objects.

10 PAPERS • NO BENCHMARKS YET

PANDA

PANDA is the first gigaPixel-level humAN-centric viDeo dAtaset, for large-scale, long-term, and multi-object visual analysis. The videos in PANDA were captured by a gigapixel camera and cover real-world scenes with both wide field-of-view (~1 square kilometer area) and high-resolution details (~gigapixel-level/frame). The scenes may contain 4k head counts with over 100x scale variation. PANDA provides enriched and hierarchical ground-truth annotations, including 15,974.6k bounding boxes, 111.8k fine-grained attribute labels, 12.7k trajectories, 2.2k groups and 2.9k interactions.

10 PAPERS • NO BENCHMARKS YET

REDS (REalistic and Diverse Scenes dataset realistic and dynamic scenes)

The realistic and dynamic scenes (REDS) dataset was proposed in the NTIRE19 Challenge. The dataset is composed of 300 video sequences with resolution of 720×1,280, and each video has 100 frames, where the training set, the validation set and the testing set have 240, 30 and 30 videos, respectively

10 PAPERS • 2 BENCHMARKS

TED Gesture Dataset

Co-speech gestures are everywhere. People make gestures when they chat with others, give a public speech, talk on a phone, and even think aloud. Despite this ubiquity, there are not many datasets available. The main reason is that it is expensive to recruit actors/actresses and track precise body motions. There are a few datasets available (e.g., MSP AVATAR [17] and Personality Dyads Corpus [18]), but their sizes are limited to less than 3 h, and they lack diversity in speech content and speakers. The gestures also could be unnatural owing to inconvenient body tracking suits and acting in a lab environment.

10 PAPERS • 1 BENCHMARK

TSU (Toyota Smarthome Untrimmed)

Toyota Smarthome Untrimmed (TSU) is a dataset for activity detection in long untrimmed videos. The dataset contains 536 videos with an average duration of 21 mins. Since this dataset is based on the same footage video as Toyota Smarthome Trimmed version, it features the same challenges and introduces additional ones. The dataset is annotated with 51 activities.

10 PAPERS • 1 BENCHMARK

USF

USF (Human ID Gait Challenge Dataset)

The USF Human ID Gait Challenge Dataset is a dataset of videos for gait recognition. It has videos from 122 subjects in up to 32 possible combinations of variations in factors.

10 PAPERS • NO BENCHMARKS YET

VideoCC3M (Video-Conceptual-Captions)

We propose a new, scalable video-mining pipeline which transfers captioning supervision from image datasets to video and audio. We use this pipeline to mine paired video and captions, using the Conceptual Captions3M image dataset as a seed dataset. Our resulting dataset VideoCC3M consists of millions of weakly paired clips with text captions and will be released publicly.

10 PAPERS • NO BENCHMARKS YET

VideoMatte240K

VideoMatte240K consists of 484 high-resolution green screen videos and generate a total of 240,709 unique frames of alpha mattes and foregrounds with chroma-key software Adobe After Effects. The videos are purchased as stock footage or found as royalty-free materials online. 384 videos are at 4K resolution and 100 are in HD. The videos are split by 479 : 5 to form the train and validation sets. The dataset consists of a vast amount of human subjects, clothing, and poses that are helpful for training robust models.

10 PAPERS • NO BENCHMARKS YET

BLVD

BLVD is a large scale 5D semantics dataset collected by the Visual Cognitive Computing and Intelligent Vehicles Lab. This dataset contains 654 high-resolution video clips owing 120k frames extracted from Changshu, Jiangsu Province, China, where the Intelligent Vehicle Proving Center of China (IVPCC) is located. The frame rate is 10fps/sec for RGB data and 3D point cloud. The dataset contains fully annotated frames which yield 249,129 3D annotations, 4,902 independent individuals for tracking with the length of overall 214,922 points, 6,004 valid fragments for 5D interactive event recognition, and 4,900 individuals for 5D intention prediction. These tasks are contained in four kinds of scenarios depending on the object density (low and high) and light conditions (daytime and nighttime).

9 PAPERS • NO BENCHMARKS YET

Blackbird

The Blackbird unmanned aerial vehicle (UAV) dataset is a large-scale, aggressive indoor flight dataset collected using a custom-built quadrotor platform for use in evaluation of agile perception. The Blackbird dataset contains over 10 hours of flight data from 168 flights over 17 flight trajectories and 5 environments. Each flight includes sensor data from 120Hz stereo and downward-facing photorealistic virtual cameras, 100Hz IMU, motor speed sensors, and 360Hz millimeter-accurate motion capture ground truth. Camera images for each flight were photorealistically rendered using FlightGoggles across a variety of environments to facilitate easy experimentation of high performance perception algorithms.

9 PAPERS • NO BENCHMARKS YET

How2R

Amazon Mechanical Turk (AMT) is used to collect annotations on HowTo100M videos. 30k 60-second clips are randomly sampled from 9,421 videos and present each clip to the turkers, who are asked to select a video segment containing a single, self-contained scene. After this segment selection step, another group of workers are asked to write descriptions for each displayed segment. Narrations are not provided to the workers to ensure that their written queries are based on visual content only. These final video segments are 10-20 seconds long on average, and the length of queries ranges from 8 to 20 words. From this process, 51,390 queries are collected for 24k 60-second clips from 9,371 videos in HowTo100M, on average 2-3 queries per clip. The video clips and its associated queries are split into 80% train, 10% val and 10% test.

9 PAPERS • NO BENCHMARKS YET

ImageNet-VidVRD

ImageNet-VidVRD dataset contains 1,000 videos selected from ILVSRC2016-VID dataset based on whether the video contains clear visual relations. It is split into 800 training set and 200 test set, and covers common subject/objects of 35 categories and predicates of 132 categories. Ten people contributed to labeling the dataset, which includes object trajectory labeling and relation labeling. Since the ILVSRC2016-VID dataset has the object trajectory annotation for 30 categories already, we supplemented the annotations by labeling the remaining 5 categories. In order to save the labor of relation labeling, we labeled typical segments of the videos in the training set and the whole of the videos in the test set.

9 PAPERS • 3 BENCHMARKS

MSU HDR Video Reconstruction Benchmark

This is a dataset for a video inverse-tone-mapping task. The dataset contains various contents for the task of restoring HDR video: fireworks, flowers, football, night city, scenes with reflections. Videos have different brightness ranges and contain different types of lighting. The camera for shooting the dataset captures 14 stops of the dynamic range.

9 PAPERS • 1 BENCHMARK

MUGEN

MUGEN is a large-scale video-audio-text dataset MUGEN, collected using the open-sourced platform game CoinRun. MUGEN can help progress research in many tasks in multimodal understanding and generation.

9 PAPERS • NO BENCHMARKS YET

OpenASL

Large-scale American Sign Language (ASL) - English dataset collected from online video sites (e.g., YouTube). OpenASL contains 288 hours of ASL videos in multiple domains from over 200 signers.

9 PAPERS • NO BENCHMARKS YET

OpenLane-V2 val

OpenLane-V2 is the world's first perception and reasoning benchmark for scene structure in autonomous driving. The primary task of the dataset is scene structure perception and reasoning, which requires the model to recognize the dynamic drivable states of lanes in the surrounding environment. The challenge of this dataset includes not only detecting lane centerlines and traffic elements but also recognizing the attribute of traffic elements and topology relationships on detected objects.

9 PAPERS • 1 BENCHMARK

PATS (Pose Audio Transcript Style)

PATS dataset consists of a diverse and large amount of aligned pose, audio and transcripts. With this dataset, we hope to provide a benchmark that would help develop technologies for virtual agents which generate natural and relevant gestures.

9 PAPERS • NO BENCHMARKS YET

Datasets

879 dataset results for Videos