The largest and most realistic dataset available for TCC. It consists of 600 real-world videos recorded with a high-resolution mobile phone camera shooting 1824 x 1368 sized pictures. The length of these videos ranges from 3 to 17 frames (7.3 on average, the median is 7.0 and mode is 8.5). Ground truth information is present only for the last frame in each video (i.e., the shot frame), and was collected using a gray surface calibration target.
4 PAPERS • NO BENCHMARKS YET
The data we use include 366 monthly series, 427 quarterly series and 518 yearly series. They were supplied by both tourism bodies (such as Tourism Australia, the Hong Kong Tourism Board and Tourism New Zealand) and various academics, who had used them in previous tourism forecasting studies (please refer to the acknowledgements and details of the data sources and availability).
This meta-dataset is composed of previously known datasets.
4 PAPERS • 1 BENCHMARK
Abstract: The task for this dataset is to forecast the spatio-temporal traffic volume based on the historical traffic volume and other features in neighboring locations.
This dataset is a collection of labelled PCAP files, both encrypted and unencrypted, across 10 applications, as well as a pandas dataframe in HDF5 format containing detailed metadata summarizing the connections from those files. It was created to assist the development of machine learning tools that would allow operators to see the traffic categories of both encrypted and unencrypted traffic flows. In particular, features of the network packet traffic timing and size information (both inside of and outside of the VPN) can be leveraged to predict the application category that generated the traffic.
The eICU Collaborative Research Database is a large multi-center critical care database made available by Philips Healthcare in partnership with the MIT Laboratory for Computational Physiology.
The eSports Sensors dataset contains sensor data collected from 10 players in 22 matches in League of Legends. The sensor data collected includes:
4 PAPERS • 2 BENCHMARKS
The exiD dataset introduces a groundbreaking collection of naturalistic road user trajectories at highway entries and exits in Germany, meticulously captured with drones to navigate past the limitations of conventional traffic data collection methods, such as occlusions. This approach not only allows for the precise extraction of each road user’s trajectory and type but also ensures very high positional accuracy, thanks to sophisticated computer vision algorithms. Its innovative data collection technique minimizes errors and maximizes the quality and reliability of the dataset, making it a valuable resource for advanced research and development in the field of automated driving technologies.
Autism spectrum disorder (ASD) is characterized by qualitative impairment in social reciprocity, and by repetitive, restricted, and stereotyped behaviors/interests. Previously considered rare, ASD is now recognized to occur in more than 1% of children. Despite continuing research advances, their pace and clinical impact have not kept up with the urgency to identify ways of determining the diagnosis at earlier ages, selecting optimal treatments, and predicting outcomes. For the most part this is due to the complexity and heterogeneity of ASD. To face these challenges, large-scale samples are essential, but single laboratories cannot obtain sufficiently large datasets to reveal the brain mechanisms underlying ASD. In response, the Autism Brain Imaging Data Exchange (ABIDE) initiative has aggregated functional and structural brain imaging data collected from laboratories around the world to accelerate our understanding of the neural bases of autism. With the ultimate goal of facilitating
3 PAPERS • NO BENCHMARKS YET
Context A radio signal consists in two channels, channel I (for 'In phase') and channel Q (for 'Quadrature') and can be assimilated as a stream of complex numbers. It may convey information by coding it as a sequence of symbols sampled from a finite set of complex numbers called a "modulation". There exist several standard modulations such as (non exhaustive list): BPSK, QAM, QPSK of order N, PSK of order N…
Three-dimensional position of external markers placed on the chest and abdomen of healthy individuals breathing during intervals from 73s to 222s. The markers move because of the respiratory motion, and their position is sampled at approximately 10Hz. Markers are metallic objects used during external beam radiotherapy to track and predict the motion of tumors due to breathing for accurate dose delivery.
3 PAPERS • 1 BENCHMARK
The Household Object Movements from Everyday Routines (HOMER) dataset is composed of routine behaviors for five households, spanning 50 days for the train split and 10 days for test split. The households are based on an identical apartment setting with four rooms and 108 objects and 33 atomic actions such as find, grab, etc.
The original dataset from Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting contains traffic readings collected from 207 loop detectors on highways in Los Angeles County, aggregated in 5 minutes intervals over four months between March 2012 and June 2012.
This database includes 25 long-term ECG recordings of human subjects with atrial fibrillation (mostly paroxysmal).
IMU, WiFi data along with aligned Visual SLAM groundtruth locations from a smartphone carried during natural human motion
The original dataset from Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting contains 6 months of traffic readings from 01/01/2017 to 05/31/2017 collected every 5 minutes by 325 traffic sensors in San Francisco Bay Area. The measurements are provided by California Transportation Agencies (CalTrans) Performance Measurement System (PeMS).
Overview This database of simulated arterial pulse waves is designed to be representative of a sample of pulse waves measured from healthy adults. It contains pulse waves for 4,374 virtual subjects, aged from 25-75 years old (in 10 year increments). The database contains a baseline set of pulse waves for each of the six age groups, created using cardiovascular properties (such as heart rate and arterial stiffness) which are representative of healthy subjects at each age group. It also contains 728 further virtual subjects at each age group, in which each of the cardiovascular properties are varied within normal ranges. This allows for extensive in silico analyses of haemodynamics and the performance of pulse wave analysis algorithms.
SKAB is designed for evaluating algorithms for anomaly detection. The benchmark currently includes 30+ datasets plus Python modules for algorithms’ evaluation. Each dataset represents a multivariate time series collected from the sensors installed on the testbed. All instances are labeled for evaluating the results of solving outlier detection and changepoint detection problems.
3 PAPERS • 2 BENCHMARKS
The dataset is approved for public release, distribution unlimited.
This dataset contains aircraft trajectories in an untowered terminal airspace collected over 8 months surrounding the Pittsburgh-Butler Regional Airport [ICAO:KBTP], a single runway GA airport, 10 miles North of the city of Pittsburgh, Pennsylvania. The trajectory data is recorded using an on-site setup that includes an ADS-B receiver. The trajectory data provided spans days from 18 Sept 2020 till 23 Apr 2021 and includes a total of 111 days of data discounting downtime, repairs, and bad weather days with no traffic. Data is collected starting at 1:00 AM local time to 11:00 PM local time. The dataset uses an Automatic Dependent Surveillance-Broadcast (ADS-B) receiver placed within the airport premises to capture the trajectory data. The receiver uses both the 1090 MHz and 978 MHz frequencies to listen to these broadcasts. The ADS-B uses satellite navigation to produce accurate location and timestamp for the targets which is recorded on-site using our custom setup. Weather data during t
Visuelle 2.0 is a dataset containing real data for 5355 clothing products of the retail fast-fashion Italian company, Nuna Lie. Specifically, Visuelle 2.0 provides data from 6 fashion seasons (partitioned in Autumn-Winter and Spring-Summer) from 2017-2019, right before the Covid-19 pandemic. Each product is accompanied by an HD image, textual tags and more. The time series data are disaggregated at the shop level, and include the sales, inventory stock, max-normalized prices (for the sake of confidentiality} and discounts. Exogenous time series data is also provided, in the form of Google Trends based on the textual tags and multivariate weather conditions of the stores’ locations. Finally, we also provide purchase data for 667K customers whose identity has been anonymized, to capture personal preferences. With these data, Visuelle 2.0 allows to cope with several problems which characterize the activity of a fast fashion company: new product demand forecasting, short-observation new pr
WEAR is an outdoor sports dataset for both vision- and inertial-based human activity recognition (HAR). The dataset comprises data from 18 participants performing a total of 18 different workout activities with untrimmed inertial (acceleration) and camera (egocentric video) data recorded at 10 different outside locations. Unlike previous egocentric datasets, WEAR provides a challenging prediction scenario marked by purposely introduced activity variations as well as an overall small information overlap across modalities.
voraus-AD contains machine data of a collaborative robot, which moves a can by performing an industrial pick-and-place task. The samples consist of time series of machine data, each recorded over one pick-and-place operation. As usual in anomaly detection, the training set contains only normal data, which includes regular samples without anomalies. The test set contains both, normal data and anomalies, including 12 diverse anomaly types. In order to create a realistic scenario, we have divided the normal data into training and test data as follows: Up to a certain period of time, only training data including 948 samples was recorded. Subsequently, recordings of anomalies (755 samples) and normal data (419 samples) for the test set were taken alternately. This simulates a real application where training data would be recorded first in the same way to train the model before the test case occurs. To exclude temperature effects, we let robots warm up for half an hour before each recording.
The data consist of 70 records, divided into a learning set of 35 records (a01 through a20, b01 through b05, and c01 through c10), and a test set of 35 records (x01 through x35), all of which may be downloaded from this page. Recordings vary in length from slightly less than 7 hours to nearly 10 hours each. Each recording includes a continuous digitized ECG signal, a set of apnea annotations (derived by human experts on the basis of simultaneously recorded respiration and related signals), and a set of machine-generated QRS annotations (in which all beats regardless of type have been labeled normal). In addition, eight recordings (a01 through a04, b01, and c01 through c03) are accompanied by four additional signals (Resp C and Resp A, chest and abdominal respiratory effort signals obtained using inductance plethysmography; Resp N, oronasal airflow measured using nasal thermistors; and SpO2, oxygen saturation).
2 PAPERS • 1 BENCHMARK
The BRUSH dataset (BRown University Stylus Handwriting) contains 27,649 online handwriting samples from a total of 170 writers. Every sequence is labeled with intended characters such that dataset users can identify to which character a point in a sequence corresponds. The dataset was introduced in the paper "Generating Handwriting via Decoupled Style Descriptors" by Atsunobu Kotani, Stefanie Tellex, James Tompkin from Brown University, presented at European Conference on Computer Vision (ECCV) 2020.
2 PAPERS • NO BENCHMARKS YET
The temporal variability in calving front positions of marine-terminating glaciers permits inference on the frontal ablation. Frontal ablation, the sum of the calving rate and the melt rate at the terminus, significantly contributes to the mass balance of glaciers. Therefore, the glacier area has been declared as an Essential Climate Variable product by the World Meteorological Organization. The presented dataset provides the necessary information for training deep learning techniques to automate the process of calving front delineation. The dataset includes Synthetic Aperture Radar (SAR) images of seven glaciers distributed around the globe. Five of them are located in Antarctica: Crane, Dinsmoore-Bombardier-Edgeworth, Mapple, Jorum and the Sjörgen-Inlet Glacier. The remaining glaciers are the Jakobshavn Isbrae Glacier in Greenland and the Columbia Glacier in Alaska. Several images were taken for each glacier, forming a time series. The time series lie in the time span between 1995 an
2 PAPERS • 2 BENCHMARKS
Climate models are critical tools for analyzing climate change and projecting its future impact. The machine learning (ML) community has taken an increased interest in supporting climate scientists’ efforts on various tasks such as climate model emulation, downscaling, and prediction tasks. However, traditional datasets based on single climate models are limiting. We thus present ClimateSet — a comprehensive collection of inputs and outputs from 36 climate models sourced from the Input4MIPs and CMIP6 archives, designed for large-scale ML applications.
This experiment was performed in order to empirically measure the energy use of small, electric Unmanned Aerial Vehicles (UAVs). We autonomously direct a DJI ® Matrice 100 (M100) drone to take off, carry a range of payload weights on a triangular flight pattern, and land. Between flights, we varied specified parameters through a set of discrete options, payload of 0 , 250 g and 500 g; altitude during cruise of 25 m, 50 m, 75 m and 100 m; and speed during cruise of 4 m/s, 6 m/s, 8 m/s, 10 m/s and 12 m/s.
The dataset contains historical technical data of Dhaka Stock Exchange (DSE). The data was collected from different sources found in the internet where the data was publicly available. The data available here are used for information and research purposes and though to the best of our knowledge, it does not contain any mistakes, there might still be some mistakes. It is not encourages to use this dataset for portfolio management purposes and use this dataset out of your own interest. The contributors do not hold any liability if it is used for any purposes.
Fusion-DHL is a multimodal sensor dataset with ground-truth positions.
HiRID is a freely accessible critical care dataset containing data relating to almost 34 thousand patient admissions to the Department of Intensive Care Medicine of the Bern University Hospital, Switzerland (ICU), an interdisciplinary 60-bed unit admitting >6,500 patients per year. The ICU offers the full range of modern interdisciplinary intensive care medicine for adult patients. The dataset was developed in cooperation between the Swiss Federal Institute of Technology (ETH) Zürich, Switzerland and the ICU.
2 PAPERS • 6 BENCHMARKS
The dataset contains the hotel demand and revenue of 8 major tourist destinations in the US (e.g., Los Angeles, Orlando ...). The dataset contains sales, daily occupancy, demand, and revenue of the upper-middle class hotels.
A new spatio-temporal benchmark dataset (Hurricane), is suited for forecasting during extreme events and anomalies. The dataset is provided through the Florida Department of Revenue which provides the monthly sales revenue (2003-2020) for the tourism industry for all 67 counties of Florida which are prone to annual hurricanes. Furthermore, we aligned and joined the raw time series with the history of hurricane categories based on time for each county. More precisely, the hurricane category indicates the maximum sustained wind speed which can result in catastrophic damages (Oceanic 2022).
This dataset presents a set of large-scale ridesharing Dial-a-Ride Problem (DARP) instances. The instances were created as a standardized set of ridesharing DARP problems for the purpose of benchmarking and comparing different solution methods.
The Lorenz dataset contains 100000 time-series with length 24. The data has 5 modes and it is obtained using the Lorenz equation with 5 different seed values.
We provide a dataset called MMAC Captions for sensor-augmented egocentric-video captioning. The dataset contains 5,002 activity descriptions by extending the CMU-MMAC dataset. A number of activity description examples can be found in the homepage.
the MTHS dataset contains 30Hz PPG signals obtained from 62 patients, including 35 men and 27 women. The ground truth data includes heart rate and oxygen saturation levels sampled at 1Hz. The HR and SPo2 measurement is obtained using a pulse oximeter (M70). An iPhone 5s was used to obtain the ppg recordings at 30 fps.
Texture-based studies and designs have been in focus recently. Whisker-based multidimensional surface texture data is missing in the literature. This data is critical for robotics and machine perception algorithms in the classification and regression of textural surfaces. We present a novel sensor design to acquire multidimensional texture information. The surface texture's roughness and hardness were measured experimentally using sweeping and dabbing. The data is made available to the research community for further advancing texture perception studies.
PDEBench provides a diverse and comprehensive set of benchmarks for scientific machine learning, including challenging and realistic physical problems. The repository consists of the code used to generate the datasets, to upload and download the datasets from the data repository, as well as to train and evaluate different machine learning models as baseline. PDEBench features a much wider range of PDEs than existing benchmarks and included realistic and difficult problems (both forward and inverse), larger ready-to-use datasets comprising various initial and boundary conditions, and PDE parameters. Moreover, PDEBench was crated to make the source code extensible and we invite active participation to improve and extent the benchmark.
PPG-DaLiA is a publicly available dataset for PPG-based heart rate estimation. This multimodal dataset features physiological and motion data, recorded from both a wrist- and a chest-worn device, of 15 subjects while performing a wide range of activities under close to real-life conditions. The included ECG data provides heart rate ground truth. The included PPG- and 3D-accelerometer data can be used for heart rate estimation, while compensating for motion artefacts.
Automated leaf segmentation is a challenging area in computer vision. Recent advances in machine learning approaches allowed to achieve better results than traditional image processing techniques; however, training such systems often require large annotated data sets. To contribute with annotated data sets and help to overcome this bottleneck in plant phenotyping research, here we provide a novel photometric stereo (PS) data set with annotated leaf masks. This data set forms part of the work done in the BBSRC Tools and Resources Development project BB/N02334X/1.
The Rainforest Automation Energy (RAE) dataset was create to help smart grid researchers test their algorithms which make use of smart meter data. This initial release of RAE contains 1Hz data (mains and sub-meters) from two residential houses. In addition to power data, environmental and sensor data from the house's thermostat is included. Sub-meter data from one of the houses includes heat pump and rental suite captures which is of interest to power utilities.
RSDD-Time is a dataset of 598 manually annotated self-reported depression diagnosis posts from Reddit that include temporal information about the diagnosis. Annotations include whether a mental health condition is present and how recently the diagnosis happened. Additionally, the dataset includes exact temporal spans that relate to the date of diagnosis.
Technical Information Dates range from 2017-09-11 to 2018-02-16 and the time interval is 1 minute. This is a MultiIndex CSV file, to load in pandas use:
The softwarised network data zoo (SNDZoo) is an open collection of software networking data sets aiming to streamline and ease machine learning research in the software networking domain. Most of the published data sets focus on, but are not limited to, the performance of virtualised network functions (VNFs). The data is collected using fully automated NFV benchmarking frameworks, such as tng-bench, developed by us or third party solutions like Gym. The collection of the presented data sets follows the general VNF benchmarking methodology described in.
Forecast Sales using ARIMA and SARIMA
Solar Power Data for Integration Studies NREL's Solar Power Data for Integration Studies are synthetic solar photovoltaic (PV) power plant data points for the United States representing the year 2006.
Electrophysiological data from implanted electrodes in the human brain are rare, and therefore scientific access to it has remained somewhat exclusive. Here we present a freely-available curated library of implanted electrocorticographic (ECoG) data and analyses for 16 benchmark behavioral experiments, with 204 individual datasets from 34 patients made with the same amplifiers (at the same sampling rate and filter settings). In every case, electrode positions have been carefully registered to brain anatomy. A large set of fully-commented analysis scripts to interpret these data using modern techniques is embedded in the library alongside the data. All data, anatomic correlations, and analysis files (MATLAB code) are in a common, intuitive file structure at https://searchworks.stanford.edu/view/zk881ps0522. The library may be used as course material or serve as a starter package for researchers early in their career or for established groups, to modify the analyses and re-apply them in