🔔 Share your dataset with the ML community!

Filter by Modality (clear)

Filter by Task

Filter by Language

244 dataset results for Graphs

STRING is a collection of protein-protein interaction (PPI) networks.

34 PAPERS • NO BENCHMARKS YET

This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014.

33 PAPERS • 6 BENCHMARKS

BioGRID

BioGRID (Biological General Repository for Interaction Datasets)

BioGRID is a biomedical interaction repository with data compiled through comprehensive curation efforts. The current index is version 4.2.192 and searches 75,868 publications for 1,997,840 protein and genetic interactions, 29,093 chemical interactions and 959,750 post translational modifications from major model organism species.

33 PAPERS • 2 BENCHMARKS

Ciao

The Ciao dataset contains rating information of users given to items, and also contain item category information. The data comes from the Epinions dataset.

32 PAPERS • 1 BENCHMARK

Worldtree

Worldtree is a corpus of explanation graphs, explanatory role ratings, and associated tablestore. It contains explanation graphs for 1,680 questions, and 4,950 tablestore rows across 62 semi-structured tables are provided. This data is intended to be paired with the AI2 Mercury Licensed questions.

32 PAPERS • NO BENCHMARKS YET

Decagon (Bio-decagon)

Bio-decagon is a dataset for polypharmacy side effect identification problem framed as a multirelational link prediction problem in a two-layer multimodal graph/network of two node types: drugs and proteins. Protein-protein interaction network describes relationships between proteins. Drug-drug interaction network contains 964 different types of edges (one for each side effect type) and describes which drug pairs lead to which side effects. Lastly, drug-protein links describe the proteins targeted by a given drug.

31 PAPERS • 1 BENCHMARK

Email-EU

EmailEU is a directed temporal network constructed from email exchanges in a large European research institution for a 803-day period. It contains 986 email addresses as nodes and 332,334 emails as edges with timestamps. There are 42 ground truth departments in the dataset.

31 PAPERS • NO BENCHMARKS YET

OGB-LSC (OGB Large-Scale Challenge)

OGB Large-Scale Challenge (OGB-LSC) is a collection of three real-world datasets for advancing the state-of-the-art in large-scale graph ML. OGB-LSC provides graph datasets that are orders of magnitude larger than existing ones and covers three core graph learning tasks -- link prediction, graph regression, and node classification.

31 PAPERS • 3 BENCHMARKS

The Pinterest dataset contains more than 1 million images associated to Pinterest users’ who have “pinned” them.

31 PAPERS • 1 BENCHMARK

CSL

CSL is a synthetic dataset introduced in Murphy et al. (2019) to test the expressivity of GNNs. In particular, graphs are isomorphic if they have the same degree and the task is to classify non-isomorphic graphs.

29 PAPERS • 2 BENCHMARKS

LDC2017T10

LDC2017T10 (Abstract Meaning Representation (AMR) Annotation Release 2.0)

Abstract Meaning Representation (AMR) Annotation Release 2.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 39,260 English natural language sentences from broadcast conversations, newswire, weblogs and web discussion forums.

27 PAPERS • 2 BENCHMARKS

ChEBI-20

Dataset contains 33,010 molecule-description pairs split into 80\%/10\%/10\% train/val/test splits. The goal of the task is to retrieve the relevant molecule for a natural language description. It is defined as follows:

24 PAPERS • 4 BENCHMARKS

REDDIT-12K

Reddit12k contains 11929 graphs each corresponding to an online discussion thread where nodes represent users, and an edge represents the fact that one of the two users responded to the comment of the other user. There is 1 of 11 graph labels associated with each of these 11929 discussion graphs, representing the category of the community.

24 PAPERS • NO BENCHMARKS YET

twitch-gamers

node classification on twitch-gamers

23 PAPERS • 2 BENCHMARKS

questions

Questions is an interaction graph of users of a question-answering website based on data provided by Yandex Q.

22 PAPERS • 1 BENCHMARK

roman-empire

Roman-empire is a word dependency graph based on the Roman Empire article from the English Wikipedia.

22 PAPERS • 1 BENCHMARK

AGENDA

AGENDA (Abstract GENeration DAtaset)

Abstract GENeration DAtaset (AGENDA) is a dataset of knowledge graphs paired with scientific abstracts. The dataset consists of 40k paper titles and abstracts from the Semantic Scholar Corpus taken from the proceedings of 12 top AI conferences.

19 PAPERS • 1 BENCHMARK

LastFM Asia

A social network of LastFM users which was collected from the public API in March 2020. Nodes are LastFM users from Asian countries and edges are mutual follower relationships between them. The vertex features are extracted based on the artists liked by the users. The task related to the graph is multinomial node classification - one has to predict the location of users. This target feature was derived from the country field for each user.

19 PAPERS • NO BENCHMARKS YET

Chameleon (48%/32%/20% fixed splits)

Node classification on Chameleon with the fixed 48%/32%/20% splits provided by Geom-GCN.

18 PAPERS • 2 BENCHMARKS

Deezer-Europe

Node classification on Deezer Europe with 50%/25%/25% random splits for training/validation/test.

18 PAPERS • 1 BENCHMARK

UMLS

UMLS (Unified Medical Language System)

The Unified Medical Language System (UMLS) is a comprehensive resource that integrates and disseminates essential terminology, classification standards, and coding systems. Its purpose is to foster the creation of more effective and interoperable biomedical information systems and services, including electronic health records. Here are the key aspects of the UMLS:

18 PAPERS • 1 BENCHMARK

Yeast

Yeast dataset consists of a protein-protein interaction network. Interaction detection methods have led to the discovery of thousands of interactions between proteins, and discerning relevance within large-scale data sets is important to present-day biology.

18 PAPERS • NO BENCHMARKS YET

Film (60%/20%/20% random splits)

Node classification on Film with 60%/20%/20% random splits for training/validation/test.

17 PAPERS • 1 BENCHMARK

PubMed (60%/20%/20% random splits)

Node classification on PubMed with 60%/20%/20% random splits for training/validation/test.

17 PAPERS • 1 BENCHMARK

Squirrel (48%/32%/20% fixed splits)

Node classification on Squirrel with the fixed 48%/32%/20% splits provided by Geom-GCN.

17 PAPERS • 2 BENCHMARKS

Squirrel (60%/20%/20% random splits)

Node classification on Squirrel with 60%/20%/20% random splits for training/validation/test.

17 PAPERS • 1 BENCHMARK

Wisconsin(60%/20%/20% random splits)

Node classification on Wisconsin with 60%/20%/20% random splits for training/validation/test.

17 PAPERS • 1 BENCHMARK

tolokers

Tolokers is a crowdsourcing platform workers network based on data provided by Toloka.

17 PAPERS • 1 BENCHMARK

Cornell (48%/32%/20% fixed splits)

Node classification on Cornell with the fixed 48%/32%/20% splits provided by Geom-GCN.

16 PAPERS • 2 BENCHMARKS

Cornell (60%/20%/20% random splits)

Node classification on Cornell with 60%/20%/20% random splits for training/validation/test.

16 PAPERS • 2 BENCHMARKS

Texas(60%/20%/20% random splits)

Node classification on Texas with 60%/20%/20% random splits for training/validation/test.

16 PAPERS • 1 BENCHMARK

minesweeper

minesweeper is a synthetic graph emulating the eponymous game.

16 PAPERS • 1 BENCHMARK

Argoverse 2 Motion Forecasting

The Argoverse 2 Motion Forecasting Dataset is a curated collection of 250,000 scenarios for training and validation. Each scenario is 11 seconds long and contains the 2D, birds-eye-view centroid and heading of each tracked object sampled at 10 Hz.

15 PAPERS • NO BENCHMARKS YET

Chameleon(60%/20%/20% random splits)

Node classification on Chameleon with 60%/20%/20% random splits for training/validation/test.

15 PAPERS • 1 BENCHMARK

Citeseer (48%/32%/20% fixed splits)

Node classification on Citeseer with the fixed 48%/32%/20% splits provided by Geom-GCN.

15 PAPERS • 1 BENCHMARK

Cora (48%/32%/20% fixed splits)

Node classification on Cora with the fixed 48%/32%/20% splits provided by Geom-GCN.

15 PAPERS • 1 BENCHMARK

PubMed (48%/32%/20% fixed splits)

Node classification on PubMed with the fixed 48%/32%/20% splits provided by Geom-GCN.

15 PAPERS • 1 BENCHMARK

Wiki-One

This dataset is a Wikipedia dump, split by relations to perform Few-Shot Knowledge Graph Completion.

15 PAPERS • NO BENCHMARKS YET

Wisconsin (48%/32%/20% fixed splits)

Node classification on Wisconsin with the fixed 48%/32%/20% splits provided by Geom-GCN.

15 PAPERS • 2 BENCHMARKS

BeerAdvocate

BeerAdvocate is a dataset that consists of beer reviews from beeradvocate. The data span a period of more than 10 years, including all ~1.5 million reviews up to November 2011. Each review includes ratings in terms of five "aspects": appearance, aroma, palate, taste, and overall impression. Reviews include product and user information, followed by each of these five ratings, and a plaintext review.

14 PAPERS • 1 BENCHMARK

Film(48%/32%/20% fixed splits)

Node classification on Film with the fixed 48%/32%/20% splits provided by Geom-GCN.

14 PAPERS • 2 BENCHMARKS

Texas (48%/32%/20% fixed splits)

Node classification on Texas with the fixed 48%/32%/20% splits provided by Geom-GCN.

14 PAPERS • 2 BENCHMARKS

amazon-ratings

amazon-ratings is a product co-purchasing network based on data from SNAP datasets

14 PAPERS • 1 BENCHMARK

Elliptic Dataset

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

13 PAPERS • 1 BENCHMARK

Linux

Linux (Linux Program Dependence Graphs)

The LINUX dataset consists of 48,747 Program Dependence Graphs (PDG) generated from the Linux kernel. Each graph represents a function, where a node represents one statement and an edge represents the dependency between the two statements

13 PAPERS • NO BENCHMARKS YET

MalNet

MalNet is a large public graph database, representing a large-scale ontology of software function call graphs. MalNet contains over 1.2 million graphs, averaging over 17k nodes and 39k edges per graph, across a hierarchy of 47 types and 696 families.

13 PAPERS • 4 BENCHMARKS

BotNet

The BotNet dataset is a set of topological botnet detection datasets forgraph neural networks.

12 PAPERS • NO BENCHMARKS YET

MovieGraphs

Provides detailed, graph-based annotations of social situations depicted in movie clips. Each graph consists of several types of nodes, to capture who is present in the clip, their emotional and physical attributes, their relationships (i.e., parent/child), and the interactions between them. Most interactions are associated with topics that provide additional details, and reasons that give motivations for actions.

12 PAPERS • NO BENCHMARKS YET