🔔 Share your dataset with the ML community!

Filter by Modality (clear)

Filter by Task

Filter by Language

178 dataset results for Tabular

GIRT-Data (GitHub Issue Report Template Dataset)

GIRT-Data is the first and largest dataset of issue report templates (IRTs) in both YAML and Markdown format. This dataset and its corresponding open-source crawler tool are intended to support research in this area and to encourage more developers to use IRTs in their repositories. The stable version of the dataset contains 1_084_300 repositories, and 50_032 of them support IRTs.

2 PAPERS • NO BENCHMARKS YET

Hotel Sales (Time Series)

The dataset contains the hotel demand and revenue of 8 major tourist destinations in the US (e.g., Los Angeles, Orlando ...). The dataset contains sales, daily occupancy, demand, and revenue of the upper-middle class hotels.

2 PAPERS • NO BENCHMARKS YET

HumSet

Timely and effective response to humanitarian crises requires quick and accurate analysis of large amounts of text data, a process that can highly benefit from expert-assisted NLP systems trained on validated and annotated data in the humanitarian response domain. To enable creation of such NLP systems, we introduce and release HumSet, a novel and rich multilingual dataset of humanitarian response documents annotated by experts in the humanitarian response community. The dataset provides documents in three languages (English, French, Spanish) and covers a variety of humanitarian crises from 2018 to 2021 across the globe. For each document, HumSet provides selected snippets (entries) as well as assigned classes to each entry annotated using common humanitarian information analysis frameworks. HumSet also provides novel and challenging entry extraction and multi-label entry classification tasks. In this paper, we take a first step towards approaching these tasks and conduct a set of expe

2 PAPERS • NO BENCHMARKS YET

Hurricane (Time Series Hurricane)

A new spatio-temporal benchmark dataset (Hurricane), is suited for forecasting during extreme events and anomalies. The dataset is provided through the Florida Department of Revenue which provides the monthly sales revenue (2003-2020) for the tourism industry for all 67 counties of Florida which are prone to annual hurricanes. Furthermore, we aligned and joined the raw time series with the history of hurricane categories based on time for each county. More precisely, the hurricane category indicates the maximum sustained wind speed which can result in catastrophic damages (Oceanic 2022).

2 PAPERS • 1 BENCHMARK

IHDS

IHDS (Indian Human Developement Survey)

IHDS is a nationally representative, multi-topic panel survey of 41,554 households in 1503 villages and 971 urban neighborhoods across India.

2 PAPERS • NO BENCHMARKS YET

Information Extraction from Tables

Information Extraction from Tables (Extraction materials compositions from tables of materials science research papers)

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

2 PAPERS • NO BENCHMARKS YET

Large-scale Ridesharing DARP Instances Based on Real Travel Demand

This dataset presents a set of large-scale ridesharing Dial-a-Ride Problem (DARP) instances. The instances were created as a standardized set of ridesharing DARP problems for the purpose of benchmarking and comparing different solution methods.

2 PAPERS • NO BENCHMARKS YET

Multivariate-Mobility-Paris

The original dataset was provided by Orange telecom in France, which contains anonymized and aggregated human mobility data. The Multivariate-Mobility-Paris dataset comprises information from 2020-08-24 to 2020-11-04 (72 days during the COVID-19 pandemic), with time granularity of 30 minutes and spatial granularity of 6 coarse regions in Paris, France. In other words, it represents a multivariate time series dataset.

2 PAPERS • NO BENCHMARKS YET

Musk v1

The Musk dataset describes a set of molecules, and the objective is to detect musks from non-musks. This dataset describes a set of 92 molecules of which 47 are judged by human experts to be musks and the remaining 45 molecules are judged to be non-musks. There are 166 features available that describe the molecules based on the shape of the molecule.

2 PAPERS • 1 BENCHMARK

Musk v2

The Musk2 dataset is a set of 102 molecules of which 39 are judged by human experts to be musks and the remaining 63 molecules are judged to be non-musks. Each instance corresponds to a possible configuration of a molecule. The 166 features that describe these molecules depend upon the exact shape, or conformation, of the molecule.

2 PAPERS • NO BENCHMARKS YET

News Interactions on Globo.com

News Interactions on Globo.com (News Portal User Interactions by Globo.com - A large dataset for news recommendations offline evaluation and analytics)

Context This large dataset with users interactions logs (page views) from a news portal was kindly provided by Globo.com, the most popular news portal in Brazil, for reproducibility of the experiments with CHAMELEON - a meta-architecture for contextual hybrid session-based news recommender systems. The source code was made available at GitHub.

2 PAPERS • NO BENCHMARKS YET

PWC Leaderboards (Papers with Code Leaderboards)

The Papers with Code Leaderboards dataset is a collection of over 5,000 results capturing performance of machine learning models. Each result is a tuple of form (task, dataset, metric name, metric value). The data was collected using the Papers with Code review interface.

2 PAPERS • 1 BENCHMARK

PanCancer Multimodal (HoneyBee)

Dataset Card for The Cancer Genome Atlas (TCGA) Multimodal Dataset

2 PAPERS • NO BENCHMARKS YET

Ranking social media news feed

A dataset consisting of recipient 46 users and, 26180 tweets. The dataset includes the news feed of the users and 13 features that may influence the relevance of the tweets.

2 PAPERS • NO BENCHMARKS YET

Replication Data for: "Empirical Analysis of EIP-1559: Transaction Fees, Waiting Time, and Consensus Security"

Transaction fee mechanism (TFM) is an essential component of a blockchain protocol. However, a systematic evaluation of the real-world impact of TFMs is still absent. Using rich data from the Ethereum blockchain, mempool, and exchanges, we study the effect of EIP-1559, one of the first deployed TFMs that depart from the traditional first-price auction paradigm. We conduct a rigorous and comprehensive empirical study to examine its causal effect on blockchain transaction fee dynamics, transaction waiting time and security. Our results show that EIP-1559 improves the user experience by making fee estimation easier, mitigating intra-block difference of gas price paid, and reducing users' waiting times. However, EIP-1559 has only a small effect on gas fee levels and consensus security. In addition, we found that when Ether's price is more volatile, the waiting time is significantly higher. We also verify that a larger block size increases the presence of siblings. These findings suggest ne

2 PAPERS • NO BENCHMARKS YET

SNDZoo (The Softwarised Network Data Zoo)

The softwarised network data zoo (SNDZoo) is an open collection of software networking data sets aiming to streamline and ease machine learning research in the software networking domain. Most of the published data sets focus on, but are not limited to, the performance of virtualised network functions (VNFs). The data is collected using fully automated NFV benchmarking frameworks, such as tng-bench, developed by us or third party solutions like Gym. The collection of the presented data sets follows the general VNF benchmarking methodology described in.

2 PAPERS • NO BENCHMARKS YET

Satimage

The resources for this dataset can be found at https://www.openml.org/d/182

2 PAPERS • NO BENCHMARKS YET

SportSett

This resource is designed to allow for research into Natural Language Generation. In particular, with neural data-to-text approaches although it is not limited to these.

2 PAPERS • NO BENCHMARKS YET

Summaries of genetic variation

The dataset represents data generated from a commonly used model in population genetics. It comprises a matrix of 1,000,000 rows and 9 columns, representing parameters and summaries generated by an infinite-sites coalescent model for genetic variation. The first two columns encode the scaled mutation rate (theta) and scaled recombination rate (rho). The subsequent seven columns are data summaries: number of segregating sites (C1), standard uniform random noise acting as a distractor (C2), pairwise mean number of nucleotidic differences (C3), mean $R^2$ across pairs separated by <10% of the simulated genomic regions (C4), number of distinct haplotypes (C5), frequency of the most common haplotype (C6), number of singleton haplotypes (C7).

2 PAPERS • NO BENCHMARKS YET

TAP

TAP (Traffic Accident Prediction data repository)

The Traffic Accident Prediction (TAP) data repository offers extensive coverage for 1,000 US cities (TAP-city) and 49 states (TAP-state), providing real-world road structure data that can be easily used for graph-based machine learning methods such as Graph Neural Networks. Additionally, it features multi-dimensional geospatial attributes, including angular and directional features, that are useful for analyzing transportation networks. The TAP repository has the potential to benefit the research community in various applications, including traffic crash prediction, road safety analysis, and traffic crash mitigation. The datasets can be accessed in the TAP-city and TAP-state directories.

2 PAPERS • NO BENCHMARKS YET

TNCR Dataset (Table Net Detection and Classification Dataset)

We present TNCR, a new table dataset with varying image quality collected from free open source websites. TNCR dataset can be used for table detection in scanned document images and their classification into 5 different classes.

2 PAPERS • NO BENCHMARKS YET

Vehicle Claims

The code to create the dataset is available here. The dataset used in the paper is available on github

2 PAPERS • 2 BENCHMARKS

WDC SOTAB

WDC SOTAB is a benchmark that features two annotation tasks: Column Type Annotation and Columns Property Annotation. The goal of the Column Type Annotation (CTA) task is to annotate the columns of a table with 91 Schema.org types, such as telephone, duration, Place, or Organization. The goal of the Columns Property Annotation (CPA) task is to annotate pairs of table columns with one out of 176 Schema.org properties, such as gtin13, startDate, priceValidUntil, or recipeIngredient. The benchmark consists of 59,548 tables annotated for CTA and 48,379 tables annotated for CPA originating from 74,215 different websites. The tables are split into training-, validation- and test sets for both tasks. The tables cover 17 popular Schema.org types including Product, LocalBusiness, Event, and JobPosting. The tables originate from the Schema.org Table Corpus.

2 PAPERS • 2 BENCHMARKS

WyzeRule

Wyze Rule Recommendation Dataset. It is a big dataset with 300,000 users. Please cite [1] if you used the dataset and cite [2] if you referenced the algorithm.

2 PAPERS • NO BENCHMARKS YET

bcTCGA

bcTCGA (The Cancer Genome Atlas Program)

This data set comes from breast cancer tissue samples deposited to The Cancer Genome Atlas (TCGA) project. TCGA contains data on tumour samples were assayed on several platforms; this data set compiles results obtained using Agilent mRNA expression microarrays.

2 PAPERS • NO BENCHMARKS YET

e2006

e2006 (10-K Corpus)

From the official description:

2 PAPERS • NO BENCHMARKS YET

kickstarter

kickstarter (Funding Successful Projects on Kickstarter)

Kickstarter is a community of more than 10 million people comprising of creative, tech enthusiasts who help in bringing creative project to life. Till now, more than $3 billion dollars have been contributed by the members in fueling creative projects. The projects can be literally anything – a device, a game, an app, a film etc.

2 PAPERS • 1 BENCHMARK

news20

news20 (NewsWeeder: learning to filter netnews)

Two datasets featuring binary and multi-class classification. The datasets were first introduced by K. Lang [1]. They can, for instance, be accessed at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.

2 PAPERS • NO BENCHMARKS YET

5DOF GB Interpolation (Five Degree-of-Freedom Grain Boundary Interpolation)

These are larger MATLAB .mat files required for reproducing plots from the sgbaird-5DOF/interp repository for grain boundary property interpolation. gitID-0055bee_uuID-475a2dfd_paper-data6.mat contains multiple trials of five degree-of-freedom interpolation model runs for various interpolation schemes. gpr46883_gitID-b473165_puuID-50ffdcf6_kim-rng11.mat contains a Gaussian Process Regression model trained on 46883 Fe simulation GBs. See Five degree-of-freedom property interpolation of arbitrary grain boundaries via Voronoi fundamental zone framework DOI: 10.1016/j.commatsci.2021.110756 for the peer-reviewed, published version of the paper.

1 PAPER • NO BENCHMARKS YET

A comparison of different maturity models

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 PAPER • NO BENCHMARKS YET

ATMs fault prediction

The collected dataset consists of multivariate time series (MTS) data belonging to several ATMs banking along with the annotations that the operators did when they performed a maintenance task on any of the machines.

1 PAPER • NO BENCHMARKS YET

AU Dataset for Visuo-Haptic Object Recognition for Robots

Multimodal object recognition is still an emerging field. Thus, publicly available datasets are still rare and of small size. This dataset was developed to help fill this void and presents multimodal data for 63 objects with some visual and haptic ambiguity. The dataset contains visual, kinesthetic and tactile (audio/vibrations) data. To completely solve sensory ambiguity, sensory integration/fusion would be required. This report describes the creation and structure of the dataset. The first section explains the underlying approach used to capture the visual and haptic properties of the objects. The second section describes the technical aspects (experimental setup) needed for the collection of the data. The third section introduces the objects, while the final section describes the structure and content of the dataset.

1 PAPER • NO BENCHMARKS YET

Adult Census Income

Adult Census Income (adult_census_income)

This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)). The prediction task is to determine whether a person makes over $50K a year.

1 PAPER • NO BENCHMARKS YET

Austin Budget Survey Data FY2021 and FY2022

Data collected from two budget surveys (FY2021 in 2020 and FY2022 in 2021) in collaboration with the City of Austin budget department. Data contains preferences for each respondent and the day of their participation.

1 PAPER • NO BENCHMARKS YET

BODMAS

BODMAS (Blue Hexagon Open Dataset for Malware AnalysiS)

We collaborate with Blue Hexagon to release a dataset containing timestamped malware samples and well-curated family information for research purposes. The BODMAS dataset contains 57,293 malware samples and 77,142 benign samples collected from August 2019 to September 2020, with carefully curated family information (581 families). We also provide preprocessed feature vectors and metadata available to everyone. The malware binaries can be obtained per request.

1 PAPER • NO BENCHMARKS YET

BaitBuster-Bangla: A Comprehensive Dataset for Clickbait Detection in Bangla with Multi-Feature and Multi-Modal Analysis

The dataset contains a total of 253,070 records, with 18 features. The features are categorized into four different types: Metadata, Primary Data, Engagement Stats, and Label. Under the Metadata category contains basic information about the channel and video, such as their unique identifiers, date and time of publication, and thumbnail URLs. The Primary Data category contains information about the title and description of the video. The "Processed" columns refer to the cleaned data after denoising, deduplication and debiased for further analysis. The Engagement Stats category contains data on user engagement metrics for each video. The Label category contains predefined auto labels, human annotated labels, and AI generated pseudo labels. Auto labels are labels that are automatically derived based on a review of their titles, descriptions, and thumbnails over time. Channels with consistently misleading, exaggerated, or sensationalized content were labeled as clickbait. Those focusing on

1 PAPER • NO BENCHMARKS YET

Berlin V2X

The Berlin V2X dataset offers high-resolution GPS-located wireless measurements across diverse urban environments in the city of Berlin for both cellular and sidelink radio access technologies, acquired with up to 4 cars over 3 days. The data enables thus a variety of different ML studies towards vehicle-to-anything (V2X) communication.

1 PAPER • NO BENCHMARKS YET

Binette's 2022 Inventors Benchmark

Hand-disambiguation of a sample of U.S. patents inventor mentions from PatentsView.org.

1 PAPER • NO BENCHMARKS YET

BreastClassifications4 ([MIMBCD-UI] UTA4: Severity & Pathology Classifications Dataset)

Several datasets are fostering innovation in higher-level functions for everyone, everywhere. By providing this repository, we hope to encourage the research community to focus on hard problems. In this repository, we present the real results severity (BIRADS) and pathology (post-report) classifications provided by the Radiologist Director from the Radiology Department of Hospital Fernando Fonseca while diagnosing several patients (see dataset-uta4-dicom) from our User Tests and Analysis 4 (UTA4) study. Here, we provide a dataset for the measurements of both severity (BIRADS) and pathology classifications concerning the patient diagnostic. Work and results are published on a top Human-Computer Interaction (HCI) conference named AVI 2020 (page). Results were analyzed and interpreted from our Statistical Analysis charts. The user tests were made in clinical institutions, where clinicians diagnose several patients for a Single-Modality vs Multi-Modality comparison. For example, in these t

1 PAPER • NO BENCHMARKS YET

BreastRates4 ([MIMBCD-UI] UTA4: Rates Dataset)

Several datasets are fostering innovation in higher-level functions for everyone, everywhere. By providing this repository, we hope to encourage the research community to focus on hard problems. In this repository, we present our severity rates (BIRADS) of clinicians while diagnosing several patients from our User Tests and Analysis 4 (UTA4) study. Here, we provide a dataset for the measurements of severity rates (BIRADS) concerning the patient diagnostic. Work and results are published on a top Human-Computer Interaction (HCI) conference named AVI 2020 (page). Results were analyzed and interpreted from our Statistical Analysis charts. The user tests were made in clinical institutions, where clinicians diagnose several patients for a Single-Modality vs Multi-Modality comparison. For example, in these tests, we used both prototype-single-modality and prototype-multi-modality repositories for the comparison. On the same hand, the hereby dataset represents the pieces of information of bot

1 PAPER • NO BENCHMARKS YET

CANDOR Corpus (CANDOR = Conversation: A Naturalistic Dataset of Online Recordings)

The CANDOR corpus is a large, novel, multimodal corpus of 1,656 recorded conversations in spoken English. This 7+ million word, 850 hour corpus totals over 1TB of audio, video, and transcripts, with moment-to-moment measures of vocal, facial, and semantic expression, along with an extensive survey of speaker post conversation reflections.

1 PAPER • NO BENCHMARKS YET

CVR (Congressional Voting Records Data Set)

This data set includes votes for each of the U.S. House of Representatives Congressmen on the 16 key votes identified by the CQA. The CQA lists nine different types of votes: voted for, paired for, and announced for (these three simplified to yea), voted against, paired against, and announced against (these three simplified to nay), voted present, voted present to avoid conflict of interest, and did not vote or otherwise make a position known (these three simplified to an unknown disposition).

1 PAPER • 1 BENCHMARK

Can you predict product backorder?

Problem Statement

1 PAPER • NO BENCHMARKS YET

Chicago Face Database (CFD)

"The Chicago Face Database was developed at the University of Chicago by Debbie S. Ma, Joshua Correll, and Bernd Wittenbrink. The CFD is intended for use in scientific research. It provides high-resolution, standardized photographs of male and female faces of varying ethnicity between the ages of 17-65. Extensive norming data are available for each individual model. These data include both physical attributes (e.g., face size) as well as subjective ratings by independent judges (e.g., attractiveness).

1 PAPER • NO BENCHMARKS YET

Citations to invalid DOI-identified entities obtained from processing DOI-to-DOI citations to add in COCI

This dataset contains a two-column CSV file, where the first column ("Valid_citing_DOI") contains the DOI of a citing entity retrieved in Crossref, while the second column ("Invalid_cited_DOI") contains the invalid DOI of a cited entity identified by looking at the field "reference" in the JSON document returned by querying the Crossref API with the citing DOI.

1 PAPER • NO BENCHMARKS YET

Co/FeMn bilayers

Co/FeMn bilayers measured.

1 PAPER • NO BENCHMARKS YET

Concerns and Value Judgments of Stakeholders in the Non-Fungible Tokens (NFTs) Market

Concerns and Value Judgments of Stakeholders in the Non-Fungible Tokens (NFTs) Market (Replication Data for: "Centralized or Decentralized?")

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 PAPER • NO BENCHMARKS YET

The dataset contains 30 million cryptocurrency-related tweets from 10.10.2020 to 3.3.2021. See https://github.com/meakbiyik/ask-who-not-what for more details.

1 PAPER • NO BENCHMARKS YET

Datasets

178 dataset results for Tabular