The PubMed dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector from a dictionary which consists of 500 unique words.
1,093 PAPERS • 24 BENCHMARKS
The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.
510 PAPERS • 20 BENCHMARKS
The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3703 unique words.
327 PAPERS • 14 BENCHMARKS
The DBLP is a citation network dataset. The citation data is extracted from DBLP, ACM, MAG (Microsoft Academic Graph), and other sources. The first version contains 629,814 papers and 632,752 citations. Each paper is associated with abstract, authors, year, venue, and title. The data set can be used for clustering with network and side information, studying influence in the citation network, finding the most influential papers, topic modeling analysis, etc.
209 PAPERS • 5 BENCHMARKS
SNAP is a collection of large network datasets. It includes graphs representing social networks, citation networks, web graphs, online communities, online reviews and more.
154 PAPERS • NO BENCHMARKS YET
Orkut is a social network dataset consisting of friendship social network and ground-truth communities from Orkut.com on-line social network where users form friendship each other.
79 PAPERS • NO BENCHMARKS YET
EmailEU is a directed temporal network constructed from email exchanges in a large European research institution for a 803-day period. It contains 986 email addresses as nodes and 332,334 emails as edges with timestamps. There are 42 ground truth departments in the dataset.
31 PAPERS • NO BENCHMARKS YET
Yeast dataset consists of a protein-protein interaction network. Interaction detection methods have led to the discovery of thousands of interactions between proteins, and discerning relevance within large-scale data sets is important to present-day biology.
18 PAPERS • NO BENCHMARKS YET
Placenta is a benchmark dataset for node classification in an underexplored domain: predicting microanatomical tissue structures from cell graphs in placenta histology whole slide images. Cell graphs are large (>1 million nodes per image), node features are varied (64-dimensions of 11 types of cells), class labels are imbalanced (9 classes ranging from 0.21% of the data to 40.0%), and cellular communities cluster into heterogeneously distributed tissues of widely varying sizes (from 11 nodes to 44,671 nodes for a single structure).
2 PAPERS • 1 BENCHMARK