STRING is a collection of protein-protein interaction (PPI) networks.
34 PAPERS • NO BENCHMARKS YET
This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014.
33 PAPERS • 6 BENCHMARKS
BioGRID is a biomedical interaction repository with data compiled through comprehensive curation efforts. The current index is version 4.2.192 and searches 75,868 publications for 1,997,840 protein and genetic interactions, 29,093 chemical interactions and 959,750 post translational modifications from major model organism species.
33 PAPERS • 2 BENCHMARKS
The Ciao dataset contains rating information of users given to items, and also contain item category information. The data comes from the Epinions dataset.
32 PAPERS • 1 BENCHMARK
Worldtree is a corpus of explanation graphs, explanatory role ratings, and associated tablestore. It contains explanation graphs for 1,680 questions, and 4,950 tablestore rows across 62 semi-structured tables are provided. This data is intended to be paired with the AI2 Mercury Licensed questions.
32 PAPERS • NO BENCHMARKS YET
Bio-decagon is a dataset for polypharmacy side effect identification problem framed as a multirelational link prediction problem in a two-layer multimodal graph/network of two node types: drugs and proteins. Protein-protein interaction network describes relationships between proteins. Drug-drug interaction network contains 964 different types of edges (one for each side effect type) and describes which drug pairs lead to which side effects. Lastly, drug-protein links describe the proteins targeted by a given drug.
31 PAPERS • 1 BENCHMARK
EmailEU is a directed temporal network constructed from email exchanges in a large European research institution for a 803-day period. It contains 986 email addresses as nodes and 332,334 emails as edges with timestamps. There are 42 ground truth departments in the dataset.
31 PAPERS • NO BENCHMARKS YET
OGB Large-Scale Challenge (OGB-LSC) is a collection of three real-world datasets for advancing the state-of-the-art in large-scale graph ML. OGB-LSC provides graph datasets that are orders of magnitude larger than existing ones and covers three core graph learning tasks -- link prediction, graph regression, and node classification.
31 PAPERS • 3 BENCHMARKS
The Pinterest dataset contains more than 1 million images associated to Pinterest users’ who have “pinned” them.
CSL is a synthetic dataset introduced in Murphy et al. (2019) to test the expressivity of GNNs. In particular, graphs are isomorphic if they have the same degree and the task is to classify non-isomorphic graphs.
29 PAPERS • 2 BENCHMARKS
Abstract Meaning Representation (AMR) Annotation Release 2.0 was developed by the Linguistic Data Consortium (LDC), SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group and the Information Sciences Institute at the University of Southern California. It contains a sembank (semantic treebank) of over 39,260 English natural language sentences from broadcast conversations, newswire, weblogs and web discussion forums.
27 PAPERS • 2 BENCHMARKS
Dataset contains 33,010 molecule-description pairs split into 80\%/10\%/10\% train/val/test splits. The goal of the task is to retrieve the relevant molecule for a natural language description. It is defined as follows:
24 PAPERS • 4 BENCHMARKS
Reddit12k contains 11929 graphs each corresponding to an online discussion thread where nodes represent users, and an edge represents the fact that one of the two users responded to the comment of the other user. There is 1 of 11 graph labels associated with each of these 11929 discussion graphs, representing the category of the community.
24 PAPERS • NO BENCHMARKS YET
node classification on twitch-gamers
23 PAPERS • 2 BENCHMARKS
Questions is an interaction graph of users of a question-answering website based on data provided by Yandex Q.
22 PAPERS • 1 BENCHMARK
Roman-empire is a word dependency graph based on the Roman Empire article from the English Wikipedia.
Abstract GENeration DAtaset (AGENDA) is a dataset of knowledge graphs paired with scientific abstracts. The dataset consists of 40k paper titles and abstracts from the Semantic Scholar Corpus taken from the proceedings of 12 top AI conferences.
19 PAPERS • 1 BENCHMARK
A social network of LastFM users which was collected from the public API in March 2020. Nodes are LastFM users from Asian countries and edges are mutual follower relationships between them. The vertex features are extracted based on the artists liked by the users. The task related to the graph is multinomial node classification - one has to predict the location of users. This target feature was derived from the country field for each user.
19 PAPERS • NO BENCHMARKS YET
Node classification on Chameleon with the fixed 48%/32%/20% splits provided by Geom-GCN.
18 PAPERS • 2 BENCHMARKS
Node classification on Deezer Europe with 50%/25%/25% random splits for training/validation/test.
18 PAPERS • 1 BENCHMARK
The Unified Medical Language System (UMLS) is a comprehensive resource that integrates and disseminates essential terminology, classification standards, and coding systems. Its purpose is to foster the creation of more effective and interoperable biomedical information systems and services, including electronic health records. Here are the key aspects of the UMLS:
Yeast dataset consists of a protein-protein interaction network. Interaction detection methods have led to the discovery of thousands of interactions between proteins, and discerning relevance within large-scale data sets is important to present-day biology.
18 PAPERS • NO BENCHMARKS YET
Node classification on Film with 60%/20%/20% random splits for training/validation/test.
17 PAPERS • 1 BENCHMARK
Node classification on PubMed with 60%/20%/20% random splits for training/validation/test.
Node classification on Squirrel with the fixed 48%/32%/20% splits provided by Geom-GCN.
17 PAPERS • 2 BENCHMARKS
Node classification on Squirrel with 60%/20%/20% random splits for training/validation/test.
Node classification on Wisconsin with 60%/20%/20% random splits for training/validation/test.
Tolokers is a crowdsourcing platform workers network based on data provided by Toloka.
Node classification on Cornell with the fixed 48%/32%/20% splits provided by Geom-GCN.
16 PAPERS • 2 BENCHMARKS
Node classification on Cornell with 60%/20%/20% random splits for training/validation/test.
Node classification on Texas with 60%/20%/20% random splits for training/validation/test.
16 PAPERS • 1 BENCHMARK
minesweeper is a synthetic graph emulating the eponymous game.
The Argoverse 2 Motion Forecasting Dataset is a curated collection of 250,000 scenarios for training and validation. Each scenario is 11 seconds long and contains the 2D, birds-eye-view centroid and heading of each tracked object sampled at 10 Hz.
15 PAPERS • NO BENCHMARKS YET
Node classification on Chameleon with 60%/20%/20% random splits for training/validation/test.
15 PAPERS • 1 BENCHMARK
Node classification on Citeseer with the fixed 48%/32%/20% splits provided by Geom-GCN.
Node classification on Cora with the fixed 48%/32%/20% splits provided by Geom-GCN.
Node classification on PubMed with the fixed 48%/32%/20% splits provided by Geom-GCN.
This dataset is a Wikipedia dump, split by relations to perform Few-Shot Knowledge Graph Completion.
Node classification on Wisconsin with the fixed 48%/32%/20% splits provided by Geom-GCN.
15 PAPERS • 2 BENCHMARKS
BeerAdvocate is a dataset that consists of beer reviews from beeradvocate. The data span a period of more than 10 years, including all ~1.5 million reviews up to November 2011. Each review includes ratings in terms of five "aspects": appearance, aroma, palate, taste, and overall impression. Reviews include product and user information, followed by each of these five ratings, and a plaintext review.
14 PAPERS • 1 BENCHMARK
Node classification on Film with the fixed 48%/32%/20% splits provided by Geom-GCN.
14 PAPERS • 2 BENCHMARKS
Node classification on Texas with the fixed 48%/32%/20% splits provided by Geom-GCN.
amazon-ratings is a product co-purchasing network based on data from SNAP datasets
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
13 PAPERS • 1 BENCHMARK
The LINUX dataset consists of 48,747 Program Dependence Graphs (PDG) generated from the Linux kernel. Each graph represents a function, where a node represents one statement and an edge represents the dependency between the two statements
13 PAPERS • NO BENCHMARKS YET
MalNet is a large public graph database, representing a large-scale ontology of software function call graphs. MalNet contains over 1.2 million graphs, averaging over 17k nodes and 39k edges per graph, across a hierarchy of 47 types and 696 families.
13 PAPERS • 4 BENCHMARKS
The BotNet dataset is a set of topological botnet detection datasets forgraph neural networks.
12 PAPERS • NO BENCHMARKS YET
Provides detailed, graph-based annotations of social situations depicted in movie clips. Each graph consists of several types of nodes, to capture who is present in the clip, their emotional and physical attributes, their relationships (i.e., parent/child), and the interactions between them. Most interactions are associated with topics that provide additional details, and reasons that give motivations for actions.