🔔 Share your dataset with the ML community!

Filter by Modality

Filter by Task (clear)

Filter by Language (clear)

14 dataset results for Fake News Detection AND English

LIAR is a publicly available dataset for fake news detection. A decade-long of 12.8K manually labeled short statements were collected in various contexts from POLITIFACT.COM, which provides detailed analysis report and links to source documents for each case. This dataset can be used for fact-checking research as well. Notably, this new dataset is an order of magnitude larger than previously largest public fake news datasets of similar type. The LIAR dataset4 includes 12.8K human labeled short statements from POLITIFACT.COM’s API, and each statement is evaluated by a POLITIFACT.COM editor for its truthfulness.

108 PAPERS • 1 BENCHMARK

FNC-1 (Fake News Challenge Stage 1)

FNC-1 was designed as a stance detection dataset and it contains 75,385 labeled headline and article pairs. The pairs are labelled as either agree, disagree, discuss, and unrelated. Each headline in the dataset is phrased as a statement

18 PAPERS • 2 BENCHMARKS

PolitiFact

Fact-checking (FC) articles which contains pairs (multimodal tweet and a FC-article) from politifact.com.

16 PAPERS • 1 BENCHMARK

COVID-19 Fake News Dataset

COVID-19 Fake News Dataset (COVID19 Fake News Detection in English)

Along with COVID-19 pandemic we are also fighting an `infodemic'. Fake news and rumors are rampant on social media. Believing in rumors can cause significant harm. This is further exacerbated at the time of a pandemic. To tackle this, we curate and release a manually annotated dataset of 10,700 social media posts and articles of real and fake news on COVID-19. We benchmark the annotated dataset with four machine learning baselines - Decision Tree, Logistic Regression , Gradient Boost , and Support Vector Machine (SVM). We obtain the best performance of 93.46\% F1-score with SVM.

11 PAPERS • 1 BENCHMARK

MM-COVID

MM-COVID (Multilingual and Multidimensional COVID-19 Fake News Data Repository)

MM-COVID is a dataset for fake news detection related to COVID-19. This dataset provides the multilingual fake news and the relevant social context. It contains 3,981 pieces of fake news content and 7,192 trustworthy information from English, Spanish, Portuguese, Hindi, French and Italian, 6 different languages.

8 PAPERS • NO BENCHMARKS YET

NELA-GT-2018

NELA-GT-2018 is a dataset for the study of misinformation that consists of 713k articles collected between 02/2018-11/2018. These articles are collected directly from 194 news and media outlets including mainstream, hyper-partisan, and conspiracy sources. It includes ground truth ratings of the sources from 8 different assessment sites covering multiple dimensions of veracity, including reliability, bias, transparency, adherence to journalistic standards, and consumer trust.

8 PAPERS • NO BENCHMARKS YET

UPFD (User Preference-aware Fake News Detection)

For benchmarking, please refer to its variant UPFD-POL and UPFD-GOS.

8 PAPERS • 2 BENCHMARKS

NELA-GT-2019

NELA-GT-2019 is an updated version of the NELA-GT-2018 dataset. NELA-GT-2019 contains 1.12M news articles from 260 sources collected between January 1st 2019 and December 31st 2019. Just as with NELA-GT-2018, these sources come from a wide range of mainstream news sources and alternative news sources. Included with the dataset are source-level ground truth labels from 7 different assessment sites covering multiple dimensions of veracity.

5 PAPERS • NO BENCHMARKS YET

MuMiN

MuMiN is a misinformation graph dataset containing rich social media data (tweets, replies, users, images, articles, hashtags), spanning 21 million tweets belonging to 26 thousand Twitter threads, each of which have been semantically linked to 13 thousand fact-checked claims across dozens of topics, events and domains, in 41 different languages, spanning more than a decade.

4 PAPERS • 3 BENCHMARKS

NELA-GT-2020

NELA-GT-2020 is an updated version of the NELA-GT-2019 dataset. NELA-GT-2020 contains nearly 1.8M news articles from 519 sources collected between January 1st, 2020 and December 31st, 2020. Just as with NELA-GT-2018 and NELA-GT-2019, these sources come from a wide range of mainstream news sources and alternative news sources. Included in the dataset are source-level ground truth labels from Media Bias/Fact Check (MBFC) covering multiple dimensions of veracity. Additionally, new in the 2020 dataset are the Tweets embedded in the collected news articles, adding an extra layer of information to the data.

4 PAPERS • NO BENCHMARKS YET

UPFD-GOS (User Preference-aware Fake News Detection)

The Gossipcop variant of the UPFD dataset for benchmarking.

3 PAPERS • 1 BENCHMARK

UPFD-POL (User Preference-aware Fake News Detection)

The PolitiFact variant of the UPFD dataset for benchmarking.

2 PAPERS • 1 BENCHMARK

Twitter MediaEval

Twitter MediaEval (MediaEval Benchmarking Initiative for Multimedia Evaluation)

The task addresses the problem of the appearance and propagation of posts that share misleading multimedia content (images or video). In the context of the task, different types of misleading use are considered:

1 PAPER • NO BENCHMARKS YET

CIDII Dataset (Correct Information and Disinformation about Islamic Issues)

The CIDII dataset is a binary classification, consisting of two classes of correct information and disinformation related to Islamic issues. The CIDII dataset belongs to our research (DISINFORMATION DETECTION ABOUT ISLAMIC ISSUES ON SOCIAL MEDIA USING DEEP LEARNING TECHNIQUES) published in MJCS journal in the link below: https://ejournal.um.edu.my/index.php/MJCS/article/view/41935

0 PAPER • NO BENCHMARKS YET

Datasets

14 dataset results for Fake News Detection AND English