The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their future capabilities. Big-bench include more than 200 tasks.
238 PAPERS • 134 BENCHMARKS
We generate epistemic reasoning problems using modal logic to target theory of mind (tom) in natural language processing models.
3 PAPERS • 1 BENCHMARK
This dataset tests the capabilities of language models to correctly capture the meaning of words denoting probabilities (WEP, also called verbal probabilities), e.g. words like "probably", "maybe", "surely", "impossible".
1 PAPER • NO BENCHMARKS YET