The PubMed dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words.
1,093 PAPERS • 24 BENCHMARKS
The Open Graph Benchmark (OGB) is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner. OGB is a community-driven initiative in active development.
842 PAPERS • 16 BENCHMARKS
The FB15k dataset contains knowledge base relation triples and textual mentions of Freebase entity pairs. It has a total of 592,213 triplets with 14,951 entities and 1,345 relationships. FB15K-237 is a variant of the original dataset where inverse relations are removed, since it was found that a large number of test triplets could be obtained by inverting triplets in the training set.
573 PAPERS • 9 BENCHMARKS
The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.
510 PAPERS • 20 BENCHMARKS
The WN18 dataset has 18 relations scraped from WordNet for roughly 41,000 synsets, resulting in 141,442 triplets. It was found out that a large number of the test triplets can be found in the training set with another relation or the inverse relation. Therefore, a new version of the dataset WN18RR has been proposed to address this issue.
435 PAPERS • 5 BENCHMARKS
FB15k-237 is a link prediction dataset created from FB15k. While FB15k consists of 1,345 relations, 14,951 entities, and 592,213 triples, many triples are inverses that cause leakage from the training to testing and validation splits. FB15k-237 was created by Toutanova and Chen (2015) to ensure that the testing and evaluation datasets do not have inverse relation test leakage. In summary, FB15k-237 dataset contains 310,116 triples with 14,541 entities and 237 relation types.
409 PAPERS • 3 BENCHMARKS
WN18RR is a link prediction dataset created from WN18, which is a subset of WordNet. WN18 consists of 18 relations and 40,943 entities. However, many text triples are obtained by inverting triples from the training set. Thus the WN18RR dataset is created to ensure that the evaluation dataset does not have inverse relation test leakage. In summary, WN18RR dataset contains 93,003 triples with 40,943 entities and 11 relation types.
343 PAPERS • 3 BENCHMARKS
Yet Another Great Ontology (YAGO) is a Knowledge Graph that augments WordNet with common knowledge facts extracted from Wikipedia, converting WordNet from a primarily linguistic resource to a common knowledge base. YAGO originally consisted of more than 1 million entities and 5 million facts describing relationships between these entities. YAGO2 grounded entities, facts, and events in time and space, contained 446 million facts about 9.8 million entities, while YAGO3 added about 1 million more entities from non-English Wikipedia articles. YAGO3-10 a subset of YAGO3, containing entities which have a minimum of 10 relations each.
341 PAPERS • 7 BENCHMARKS
The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3703 unique words.
327 PAPERS • 14 BENCHMARKS
COLLAB is a scientific collaboration dataset. A graph corresponds to a researcher’s ego network, i.e., the researcher and its collaborators are nodes and an edge indicates collaboration between two researchers. A researcher’s ego network has three possible labels, i.e., High Energy Physics, Condensed Matter Physics, and Astro Physics, which are the fields that the researcher belongs to. The dataset has 5,000 graphs and each graph has label 0, 1, or 2.
233 PAPERS • 2 BENCHMARKS
The DBLP is a citation network dataset. The citation data is extracted from DBLP, ACM, MAG (Microsoft Academic Graph), and other sources. The first version contains 629,814 papers and 632,752 citations. Each paper is associated with abstract, authors, year, venue, and title. The data set can be used for clustering with network and side information, studying influence in the citation network, finding the most influential papers, topic modeling analysis, etc.
209 PAPERS • 5 BENCHMARKS
SNAP is a collection of large network datasets. It includes graphs representing social networks, citation networks, web graphs, online communities, online reviews and more.
154 PAPERS • NO BENCHMARKS YET
Orkut is a social network dataset consisting of friendship social network and ground-truth communities from Orkut.com on-line social network where users form friendship each other.
79 PAPERS • NO BENCHMARKS YET
The AMiner Dataset is a collection of different relational datasets. It consists of a set of relational networks such as citation networks, academic social networks or topic-paper-autor networks among others.
73 PAPERS • 1 BENCHMARK
Wiki-CS is a Wikipedia-based dataset for benchmarking Graph Neural Networks. The dataset is constructed from Wikipedia categories, specifically 10 classes corresponding to branches of computer science, with very high connectivity. The node features are derived from the text of the corresponding articles. They were calculated as the average of pretrained GloVe word embeddings (Pennington et al., 2014), resulting in 300-dimensional node features.
73 PAPERS • 2 BENCHMARKS
The Long Range Graph Benchmark (LRGB) is a collection of 5 graph learning datasets that arguably require long-range reasoning to achieve strong performance in a given task. The 5 datasets in this benchmark can be used to prototype new models that can capture long range dependencies in graphs.
49 PAPERS • 5 BENCHMARKS
MMKG is a collection of three knowledge graphs for link prediction and entity matching research. Contrary to other knowledge graph datasets, these knowledge graphs contain both numerical features and images for all entities as well as entity alignments between pairs of KGs. While MMKG is intended to perform relational reasoning across different entities and images, previous resources are intended to perform visual reasoning within the same image.
44 PAPERS • 5 BENCHMARKS
STRING is a collection of protein-protein interaction (PPI) networks.
34 PAPERS • NO BENCHMARKS YET
BioGRID is a biomedical interaction repository with data compiled through comprehensive curation efforts. The current index is version 4.2.192 and searches 75,868 publications for 1,997,840 protein and genetic interactions, 29,093 chemical interactions and 959,750 post translational modifications from major model organism species.
33 PAPERS • 2 BENCHMARKS
Bio-decagon is a dataset for polypharmacy side effect identification problem framed as a multirelational link prediction problem in a two-layer multimodal graph/network of two node types: drugs and proteins. Protein-protein interaction network describes relationships between proteins. Drug-drug interaction network contains 964 different types of edges (one for each side effect type) and describes which drug pairs lead to which side effects. Lastly, drug-protein links describe the proteins targeted by a given drug.
31 PAPERS • 1 BENCHMARK
EmailEU is a directed temporal network constructed from email exchanges in a large European research institution for a 803-day period. It contains 986 email addresses as nodes and 332,334 emails as edges with timestamps. There are 42 ground truth departments in the dataset.
31 PAPERS • NO BENCHMARKS YET
OGB Large-Scale Challenge (OGB-LSC) is a collection of three real-world datasets for advancing the state-of-the-art in large-scale graph ML. OGB-LSC provides graph datasets that are orders of magnitude larger than existing ones and covers three core graph learning tasks -- link prediction, graph regression, and node classification.
31 PAPERS • 3 BENCHMARKS
The Unified Medical Language System (UMLS) is a comprehensive resource that integrates and disseminates essential terminology, classification standards, and coding systems. Its purpose is to foster the creation of more effective and interoperable biomedical information systems and services, including electronic health records. Here are the key aspects of the UMLS:
18 PAPERS • 1 BENCHMARK
Arxiv ASTRO-PH (Astro Physics) collaboration network is from the e-print arXiv and covers scientific collaborations between authors papers submitted to Astro Physics category. If an author i co-authored a paper with author j, the graph contains a undirected edge from i to j. If the paper is co-authored by k authors this generates a completely connected (sub)graph on k nodes.
10 PAPERS • 2 BENCHMARKS
This is a benchmark set for Traveling salesman problem (TSP) with characteristics that are different from the existing benchmark sets. In particular, it focuses on small instances which prove to be challenging for one or more state-of-the-art TSP algorithms. These instances are based on difficult instances of Hamiltonian cycle problem (HCP). This includes instances from literature, specially modified randomly generated instances, and instances arising from the conversion of other difficult problems to HCP.
5 PAPERS • 1 BENCHMARK
The ACM dataset contains papers published in KDD, SIGMOD, SIGCOMM, MobiCOMM, and VLDB and are divided into three classes (Database, Wireless Communication, Data Mining). An heterogeneous graph is constructed, which comprises 3025 papers, 5835 authors, and 56 subjects. Paper features correspond to elements of a bag-of-words represented of keywords.
4 PAPERS • 1 BENCHMARK
The IS-A dataset is a dataset of relations extracted from a medical ontology. The different entities in the ontology are related by the “is a” relation. For example, ‘acute leukemia’ is a ‘leukemia’. The dataset has 294,693 nodes with 356,541 edges between them.
4 PAPERS • NO BENCHMARKS YET
KG20C is a Knowledge Graph about high quality papers from 20 top computer science Conferences. It can serve as a standard benchmark dataset in scholarly data analysis for several tasks, including knowledge graph embedding, link prediction, recommendation systems, and question answering .
ProteinKG25 is a large-scale KG dataset with aligned descriptions and protein sequences respectively to GO terms and proteins entities. ProteinKG25 contains 4,990,097 triplets (4,879,951 Protein-GO triplets and 110,146 GO-GO triplets), 612,483 entities (565,254 proteins and 47,229 GO terms) and 31 relations.
Arxiv GR-QC (General Relativity and Quantum Cosmology) collaboration network is from the e-print arXiv and covers scientific collaborations between authors papers submitted to General Relativity and Quantum Cosmology category. If an author i co-authored a paper with author j, the graph contains a undirected edge from i to j. If the paper is co-authored by k authors this generates a completely connected (sub)graph on k nodes.
3 PAPERS • 2 BENCHMARKS
The FB15k-237-low dataset is a variation of the FB15k-237 dataset where relations with a low number of triplets are kept.
3 PAPERS • NO BENCHMARKS YET
The Nations dataset is a small knowledge graph with 14 entities, 55 relations, and 1992 triples describing countries and their political relationships. This dataset is available for download from https://github.com/ZhenfengLei/KGDatasets.
2 PAPERS • NO BENCHMARKS YET
The FB1.5M dataset is a benchmark for Knowledge Graph Completion. It is based on Freebase and it contains 30 relations with less than 500 triplets as low-resource relations.
1 PAPER • NO BENCHMARKS YET
GO21 is a biomedical knowledge graph that models genes, proteins, drugs, and the hierarchy of the biological processes they participate in. It consists of 806,136 triples with 21 relations and 89127 entities. GO21 can be used for knowledge graph completion tasks (link prediction) as well as hierarchical reasoning tasks, such as ancestor-descendant prediction task proposed in the paper.
1 PAPER • 1 BENCHMARK
Hypertention Disease Medication dataset.
The dataset contains constructed multi-modal features (visual and textual), pseudo-labels (on heritage values and attributes), and graph structures (with temporal, social, and spatial links) constructed using User-Generated Content data collected from Flickr social media platform in three global cities containing UNESCO World Heritage property (Amsterdam, Suzhou, Venice). The motivation of data collection in this project is to provide datasets that could be both directly applicable for ML communities as test-bed, and theoretically informative for heritage and urban scholars to draw conclusions on for planning decision-making.
This dataset contains information on application install interactions of users in the Myket android application market. The dataset was created for the purpose of evaluating interaction prediction models, requiring user and item identifiers along with timestamps of the interactions. Hence, the dataset can be used for interaction prediction and building a recommendation system. Furthermore, the data forms a dynamic network of interactions, and we can also perform network representation learning on the nodes in the network, which are users and applications.
The PART-OF dataset is a dataset of relations extracted from a medical ontology. The different entities in the ontology are parts of the human body. The dataset has 16,894 nodes with 19,436 edges between them.