LogiQA consists of 8,678 QA instances, covering multiple types of deductive reasoning. Results show that state-of-the-art neural models perform by far worse than human ceiling. The dataset can also serve as a benchmark for reinvestigating logical AI under the deep learning NLP setting.
76 PAPERS • NO BENCHMARKS YET
PearRead is a dataset of scientific peer reviews. The dataset consists of over 14K paper drafts and the corresponding accept/reject decisions in top-tier venues including ACL, NIPS and ICLR, as well as over 10K textual peer reviews written by experts for a subset of the papers.
33 PAPERS • NO BENCHMARKS YET
Evidence Inference is a corpus for this task comprising 10,000+ prompts coupled with full-text articles describing RCTs.
26 PAPERS • NO BENCHMARKS YET
Engine degradation simulation was carried out using C-MAPSS. Four different were sets simulated under different combinations of operational conditions and fault modes. Records several sensor channels to characterize fault evolution. The data set was provided by the Prognostics CoE at NASA Ames.
6 PAPERS • 3 BENCHMARKS
A View From Somewhere (AVFS)—a dataset of 638,180 face similarity judgments over 4,921 faces. Each judgment corresponds to the odd-one-out (i.e., least similar) face in a triplet of faces and is accompanied by both the identifier and demographic attributes of the annotator who made the judgment.
1 PAPER • NO BENCHMARKS YET
This is a real-world industrial benchmark dataset from a major medical device manufacturer for the prediction of customer escalations. The dataset contains features derived from IoT (machine log) and enterprise data including labels for escalation from a fleet of thousands of customers of high-end medical devices.
Multi-agent pursuit in matrix world (pursuitMW) is a partially observable Markov game (POMG) between a swarm of pursuers and a swarm of evaders. Algorithms can be developed for the pursuers, evaders, or both of them.