GIRT-Data is the first and largest dataset of issue report templates (IRTs) in both YAML and Markdown format. This dataset and its corresponding open-source crawler tool are intended to support research in this area and to encourage more developers to use IRTs in their repositories. The stable version of the dataset contains 1_084_300 repositories, and 50_032 of them support IRTs.
2 PAPERS • NO BENCHMARKS YET
The dataset contains the hotel demand and revenue of 8 major tourist destinations in the US (e.g., Los Angeles, Orlando ...). The dataset contains sales, daily occupancy, demand, and revenue of the upper-middle class hotels.
Timely and effective response to humanitarian crises requires quick and accurate analysis of large amounts of text data, a process that can highly benefit from expert-assisted NLP systems trained on validated and annotated data in the humanitarian response domain. To enable creation of such NLP systems, we introduce and release HumSet, a novel and rich multilingual dataset of humanitarian response documents annotated by experts in the humanitarian response community. The dataset provides documents in three languages (English, French, Spanish) and covers a variety of humanitarian crises from 2018 to 2021 across the globe. For each document, HumSet provides selected snippets (entries) as well as assigned classes to each entry annotated using common humanitarian information analysis frameworks. HumSet also provides novel and challenging entry extraction and multi-label entry classification tasks. In this paper, we take a first step towards approaching these tasks and conduct a set of expe
A new spatio-temporal benchmark dataset (Hurricane), is suited for forecasting during extreme events and anomalies. The dataset is provided through the Florida Department of Revenue which provides the monthly sales revenue (2003-2020) for the tourism industry for all 67 counties of Florida which are prone to annual hurricanes. Furthermore, we aligned and joined the raw time series with the history of hurricane categories based on time for each county. More precisely, the hurricane category indicates the maximum sustained wind speed which can result in catastrophic damages (Oceanic 2022).
2 PAPERS • 1 BENCHMARK
IHDS is a nationally representative, multi-topic panel survey of 41,554 households in 1503 villages and 971 urban neighborhoods across India.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
This dataset presents a set of large-scale ridesharing Dial-a-Ride Problem (DARP) instances. The instances were created as a standardized set of ridesharing DARP problems for the purpose of benchmarking and comparing different solution methods.
The original dataset was provided by Orange telecom in France, which contains anonymized and aggregated human mobility data. The Multivariate-Mobility-Paris dataset comprises information from 2020-08-24 to 2020-11-04 (72 days during the COVID-19 pandemic), with time granularity of 30 minutes and spatial granularity of 6 coarse regions in Paris, France. In other words, it represents a multivariate time series dataset.
The Musk dataset describes a set of molecules, and the objective is to detect musks from non-musks. This dataset describes a set of 92 molecules of which 47 are judged by human experts to be musks and the remaining 45 molecules are judged to be non-musks. There are 166 features available that describe the molecules based on the shape of the molecule.
The Musk2 dataset is a set of 102 molecules of which 39 are judged by human experts to be musks and the remaining 63 molecules are judged to be non-musks. Each instance corresponds to a possible configuration of a molecule. The 166 features that describe these molecules depend upon the exact shape, or conformation, of the molecule.
Context This large dataset with users interactions logs (page views) from a news portal was kindly provided by Globo.com, the most popular news portal in Brazil, for reproducibility of the experiments with CHAMELEON - a meta-architecture for contextual hybrid session-based news recommender systems. The source code was made available at GitHub.
The Papers with Code Leaderboards dataset is a collection of over 5,000 results capturing performance of machine learning models. Each result is a tuple of form (task, dataset, metric name, metric value). The data was collected using the Papers with Code review interface.
Dataset Card for The Cancer Genome Atlas (TCGA) Multimodal Dataset <!-- Provide a quick summary of the dataset. -->
A dataset consisting of recipient 46 users and, 26180 tweets. The dataset includes the news feed of the users and 13 features that may influence the relevance of the tweets.
Transaction fee mechanism (TFM) is an essential component of a blockchain protocol. However, a systematic evaluation of the real-world impact of TFMs is still absent. Using rich data from the Ethereum blockchain, mempool, and exchanges, we study the effect of EIP-1559, one of the first deployed TFMs that depart from the traditional first-price auction paradigm. We conduct a rigorous and comprehensive empirical study to examine its causal effect on blockchain transaction fee dynamics, transaction waiting time and security. Our results show that EIP-1559 improves the user experience by making fee estimation easier, mitigating intra-block difference of gas price paid, and reducing users' waiting times. However, EIP-1559 has only a small effect on gas fee levels and consensus security. In addition, we found that when Ether's price is more volatile, the waiting time is significantly higher. We also verify that a larger block size increases the presence of siblings. These findings suggest ne
The softwarised network data zoo (SNDZoo) is an open collection of software networking data sets aiming to streamline and ease machine learning research in the software networking domain. Most of the published data sets focus on, but are not limited to, the performance of virtualised network functions (VNFs). The data is collected using fully automated NFV benchmarking frameworks, such as tng-bench, developed by us or third party solutions like Gym. The collection of the presented data sets follows the general VNF benchmarking methodology described in.
The resources for this dataset can be found at https://www.openml.org/d/182
This resource is designed to allow for research into Natural Language Generation. In particular, with neural data-to-text approaches although it is not limited to these.
The dataset represents data generated from a commonly used model in population genetics. It comprises a matrix of 1,000,000 rows and 9 columns, representing parameters and summaries generated by an infinite-sites coalescent model for genetic variation. The first two columns encode the scaled mutation rate (theta) and scaled recombination rate (rho). The subsequent seven columns are data summaries: number of segregating sites (C1), standard uniform random noise acting as a distractor (C2), pairwise mean number of nucleotidic differences (C3), mean $R^2$ across pairs separated by <10% of the simulated genomic regions (C4), number of distinct haplotypes (C5), frequency of the most common haplotype (C6), number of singleton haplotypes (C7).
The Traffic Accident Prediction (TAP) data repository offers extensive coverage for 1,000 US cities (TAP-city) and 49 states (TAP-state), providing real-world road structure data that can be easily used for graph-based machine learning methods such as Graph Neural Networks. Additionally, it features multi-dimensional geospatial attributes, including angular and directional features, that are useful for analyzing transportation networks. The TAP repository has the potential to benefit the research community in various applications, including traffic crash prediction, road safety analysis, and traffic crash mitigation. The datasets can be accessed in the TAP-city and TAP-state directories.
We present TNCR, a new table dataset with varying image quality collected from free open source websites. TNCR dataset can be used for table detection in scanned document images and their classification into 5 different classes.
The code to create the dataset is available here. The dataset used in the paper is available on github
2 PAPERS • 2 BENCHMARKS
WDC SOTAB is a benchmark that features two annotation tasks: Column Type Annotation and Columns Property Annotation. The goal of the Column Type Annotation (CTA) task is to annotate the columns of a table with 91 Schema.org types, such as telephone, duration, Place, or Organization. The goal of the Columns Property Annotation (CPA) task is to annotate pairs of table columns with one out of 176 Schema.org properties, such as gtin13, startDate, priceValidUntil, or recipeIngredient. The benchmark consists of 59,548 tables annotated for CTA and 48,379 tables annotated for CPA originating from 74,215 different websites. The tables are split into training-, validation- and test sets for both tasks. The tables cover 17 popular Schema.org types including Product, LocalBusiness, Event, and JobPosting. The tables originate from the Schema.org Table Corpus.
Wyze Rule Recommendation Dataset. It is a big dataset with 300,000 users. Please cite [1] if you used the dataset and cite [2] if you referenced the algorithm.
This data set comes from breast cancer tissue samples deposited to The Cancer Genome Atlas (TCGA) project. TCGA contains data on tumour samples were assayed on several platforms; this data set compiles results obtained using Agilent mRNA expression microarrays.
From the official description:
Kickstarter is a community of more than 10 million people comprising of creative, tech enthusiasts who help in bringing creative project to life. Till now, more than $3 billion dollars have been contributed by the members in fueling creative projects. The projects can be literally anything – a device, a game, an app, a film etc.
Two datasets featuring binary and multi-class classification. The datasets were first introduced by K. Lang [1]. They can, for instance, be accessed at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.
These are larger MATLAB .mat files required for reproducing plots from the sgbaird-5DOF/interp repository for grain boundary property interpolation. gitID-0055bee_uuID-475a2dfd_paper-data6.mat contains multiple trials of five degree-of-freedom interpolation model runs for various interpolation schemes. gpr46883_gitID-b473165_puuID-50ffdcf6_kim-rng11.mat contains a Gaussian Process Regression model trained on 46883 Fe simulation GBs. See Five degree-of-freedom property interpolation of arbitrary grain boundaries via Voronoi fundamental zone framework DOI: 10.1016/j.commatsci.2021.110756 for the peer-reviewed, published version of the paper.
1 PAPER • NO BENCHMARKS YET
The collected dataset consists of multivariate time series (MTS) data belonging to several ATMs banking along with the annotations that the operators did when they performed a maintenance task on any of the machines.
Multimodal object recognition is still an emerging field. Thus, publicly available datasets are still rare and of small size. This dataset was developed to help fill this void and presents multimodal data for 63 objects with some visual and haptic ambiguity. The dataset contains visual, kinesthetic and tactile (audio/vibrations) data. To completely solve sensory ambiguity, sensory integration/fusion would be required. This report describes the creation and structure of the dataset. The first section explains the underlying approach used to capture the visual and haptic properties of the objects. The second section describes the technical aspects (experimental setup) needed for the collection of the data. The third section introduces the objects, while the final section describes the structure and content of the dataset.
This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)). The prediction task is to determine whether a person makes over $50K a year.
Data collected from two budget surveys (FY2021 in 2020 and FY2022 in 2021) in collaboration with the City of Austin budget department. Data contains preferences for each respondent and the day of their participation.
We collaborate with Blue Hexagon to release a dataset containing timestamped malware samples and well-curated family information for research purposes. The BODMAS dataset contains 57,293 malware samples and 77,142 benign samples collected from August 2019 to September 2020, with carefully curated family information (581 families). We also provide preprocessed feature vectors and metadata available to everyone. The malware binaries can be obtained per request.
The dataset contains a total of 253,070 records, with 18 features. The features are categorized into four different types: Metadata, Primary Data, Engagement Stats, and Label. Under the Metadata category contains basic information about the channel and video, such as their unique identifiers, date and time of publication, and thumbnail URLs. The Primary Data category contains information about the title and description of the video. The "Processed" columns refer to the cleaned data after denoising, deduplication and debiased for further analysis. The Engagement Stats category contains data on user engagement metrics for each video. The Label category contains predefined auto labels, human annotated labels, and AI generated pseudo labels. Auto labels are labels that are automatically derived based on a review of their titles, descriptions, and thumbnails over time. Channels with consistently misleading, exaggerated, or sensationalized content were labeled as clickbait. Those focusing on
The Berlin V2X dataset offers high-resolution GPS-located wireless measurements across diverse urban environments in the city of Berlin for both cellular and sidelink radio access technologies, acquired with up to 4 cars over 3 days. The data enables thus a variety of different ML studies towards vehicle-to-anything (V2X) communication.
Hand-disambiguation of a sample of U.S. patents inventor mentions from PatentsView.org.
Several datasets are fostering innovation in higher-level functions for everyone, everywhere. By providing this repository, we hope to encourage the research community to focus on hard problems. In this repository, we present the real results severity (BIRADS) and pathology (post-report) classifications provided by the Radiologist Director from the Radiology Department of Hospital Fernando Fonseca while diagnosing several patients (see dataset-uta4-dicom) from our User Tests and Analysis 4 (UTA4) study. Here, we provide a dataset for the measurements of both severity (BIRADS) and pathology classifications concerning the patient diagnostic. Work and results are published on a top Human-Computer Interaction (HCI) conference named AVI 2020 (page). Results were analyzed and interpreted from our Statistical Analysis charts. The user tests were made in clinical institutions, where clinicians diagnose several patients for a Single-Modality vs Multi-Modality comparison. For example, in these t
Several datasets are fostering innovation in higher-level functions for everyone, everywhere. By providing this repository, we hope to encourage the research community to focus on hard problems. In this repository, we present our severity rates (BIRADS) of clinicians while diagnosing several patients from our User Tests and Analysis 4 (UTA4) study. Here, we provide a dataset for the measurements of severity rates (BIRADS) concerning the patient diagnostic. Work and results are published on a top Human-Computer Interaction (HCI) conference named AVI 2020 (page). Results were analyzed and interpreted from our Statistical Analysis charts. The user tests were made in clinical institutions, where clinicians diagnose several patients for a Single-Modality vs Multi-Modality comparison. For example, in these tests, we used both prototype-single-modality and prototype-multi-modality repositories for the comparison. On the same hand, the hereby dataset represents the pieces of information of bot
The CANDOR corpus is a large, novel, multimodal corpus of 1,656 recorded conversations in spoken English. This 7+ million word, 850 hour corpus totals over 1TB of audio, video, and transcripts, with moment-to-moment measures of vocal, facial, and semantic expression, along with an extensive survey of speaker post conversation reflections.
This data set includes votes for each of the U.S. House of Representatives Congressmen on the 16 key votes identified by the CQA. The CQA lists nine different types of votes: voted for, paired for, and announced for (these three simplified to yea), voted against, paired against, and announced against (these three simplified to nay), voted present, voted present to avoid conflict of interest, and did not vote or otherwise make a position known (these three simplified to an unknown disposition).
1 PAPER • 1 BENCHMARK
Problem Statement
"The Chicago Face Database was developed at the University of Chicago by Debbie S. Ma, Joshua Correll, and Bernd Wittenbrink. The CFD is intended for use in scientific research. It provides high-resolution, standardized photographs of male and female faces of varying ethnicity between the ages of 17-65. Extensive norming data are available for each individual model. These data include both physical attributes (e.g., face size) as well as subjective ratings by independent judges (e.g., attractiveness).
This dataset contains a two-column CSV file, where the first column ("Valid_citing_DOI") contains the DOI of a citing entity retrieved in Crossref, while the second column ("Invalid_cited_DOI") contains the invalid DOI of a cited entity identified by looking at the field "reference" in the JSON document returned by querying the Crossref API with the citing DOI.
Co/FeMn bilayers measured.
The dataset contains 30 million cryptocurrency-related tweets from 10.10.2020 to 3.3.2021. See https://github.com/meakbiyik/ask-who-not-what for more details.