6 dataset results for Novel View Synthesis AND Videos

OmniObject3D is a large vocabulary 3D object dataset with massive high-quality real-scanned 3D objects. OmniObject3D has several appealing properties:

31 PAPERS • NO BENCHMARKS YET

ACID (Aerial Coastline Imagery Dataset)

ACID consists of thousands of aerial drone videos of different coastline and nature scenes on YouTube. Structure-from-motion is used to get camera poses.

16 PAPERS • 2 BENCHMARKS

RealEstate10K

RealEstate10K is a large dataset of camera poses corresponding to 10 million frames derived from about 80,000 video clips, gathered from about 10,000 YouTube videos. For each clip, the poses form a trajectory where each pose specifies the camera position and orientation along the trajectory. These poses are derived by running SLAM and bundle adjustment algorithms on a large set of videos.

9 PAPERS • 2 BENCHMARKS

SWORD ('Scenes with occluded regions' dataset)

The new dataset contains around 1,500 train videos and 290 test videos, with 50 frames per video on average. The dataset was obtained after processing the manually captured video sequences of static real-life urban scenes. The main property of the dataset is the abundance of close objects and, consequently, the larger prevalence of occlusions. According to the introduced heuristic, the mean area of occluded image parts for SWORD is approximately five times larger than for RealEstate10k data (14% vs 3% respectively). This rationalizes the collection and usage of SWORD and explains that SWORD allows training more powerful models despite being of smaller size.

4 PAPERS • 1 BENCHMARK

Replay

Replay is a collection of multi-view, multi-modal videos of humans interacting socially. Each scene is filmed in high production quality, from different viewpoints with several static cameras, as well as wearable action cameras, and recorded with a large array of microphones at different positions in the room. The full Replay dataset consists of 68 scenes of social interactions between people, such as playing boarding games, exercising, or unwrapping presents. Each scene is about 5 minutes long and filmed with 12 cameras, static and dynamic. Audio is captured separately by 12 binaural microphones and additional near-range microphones for each actor and for each egocentric video. All sensors are temporally synchronized, undistorted, geometrically calibrated, and color calibrated.

1 PAPER • NO BENCHMARKS YET

Synthetic Soccer NeRF Dataset

Synthetic dataset comprising three different environments for multi-camera dynamic novel view synthesis for soccer. This dataset is made compatible for Nerfstudio, and includes data parsers with various settings to reproduce the settings of our paper "Dynamic NeRFs for Soccer Scenes" and more.

1 PAPER • NO BENCHMARKS YET

Datasets

6 dataset results for Novel View Synthesis AND Videos