7 dataset results for Topic Models AND Texts

AG News (AG’s News Corpus) is a subdataset of AG's corpus of news articles constructed by assembling titles and description fields of articles from the 4 largest classes (“World”, “Sports”, “Business”, “Sci/Tech”) of AG’s Corpus. The AG News contains 30,000 training and 1,900 test samples per class.

806 PAPERS • 10 BENCHMARKS

New York Times Annotated Corpus

The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom, the New York Times Indexing Service and the online production staff at nytimes.com. The corpus includes:

267 PAPERS • 8 BENCHMARKS

20 Newsgroups

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

26 PAPERS • 6 BENCHMARKS

OpoSum

OPOSUM is a dataset for the training and evaluation of Opinion Summarization models which contains Amazon reviews from six product domains: Laptop Bags, Bluetooth Headsets, Boots, Keyboards, Televisions, and Vacuums. The six training collections were created by downsampling from the Amazon Product Dataset introduced in McAuley et al. (2015) and contain reviews and their respective ratings.

8 PAPERS • NO BENCHMARKS YET

Mapping Topics in 100,000 Real-Life Moral Dilemmas

Mapping Topics in 100,000 Real-Life Moral Dilemmas (Tuan Dung nguyen)

This dataset accompanies the ICWSM 2022 paper "Mapping Topics in 100,000 Real-Life Moral Dilemmas".

1 PAPER • NO BENCHMARKS YET

OAGT

OAGT (Paper Topic Dataset)

OAGL is a paper topic dataset consisting of 6942930 records which comprise various scientific publication attributes like abstracts, titles, keywords, publication years, venues, etc. The last two fields of each record are the topic id from a taxonomy of 27 topics created from the entire collection and the 20 most significant topic words. Each dataset record (sample) is stored as a JSON line in the text file.

1 PAPER • NO BENCHMARKS YET

Reddit Ideology Database

Dataset with articles posted in the r/Liberal and r/Conservative subreddits. In total, we collected a corpus of 226,010 articles. We have collected news articles to understand political expression through the shared news articles.

1 PAPER • 1 BENCHMARK

Datasets

7 dataset results for Topic Models AND Texts