Vibe-Eval

Introduced by Padlewski et al. in Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models

Vibe-Eval is a new open benchmark and framework for evaluating multimodal chat models¹². It was introduced by Reka Technologies⁴ and is designed to rigorously test these models' visual understanding capabilities⁴. Here are some key points about Vibe-Eval:

It consists of 269 ultra high-quality image-text prompts and their ground truth responses¹.
The prompts and responses have been extensively checked multiple times by the Reka team¹.
Vibe-Eval is designed to be difficult, challenging even to the current frontier models, and to induce greater separability among frontier-class models¹.
On 50% of the hard set, all frontier models fail to arrive at a perfect answer, leaving a lot of headroom for progress¹.
The prompts are created by actual AI experts who have a strong familiarity with the performance of frontier models¹.
While MMMU has been a pretty solid standard for evaluating multimodal models, it is still fundamentally a multiple-choice benchmark¹. Vibe-Eval, on the other hand, is an open-ended evaluation setup¹.
They also discuss challenges and trade-offs between human and model-based automatic evaluation and propose a lightweight automatic evaluation protocol based on Reka Core¹.
They plan to periodically run formal human evaluations on public models that do well on this benchmark¹.

(1) Vibe-Eval: A new open and hard evaluation suite for measuring progress .... https://www.reka.ai/news/vibe-eval. (2) Vibe-Eval: A hard evaluation suite for measuring progress of multimodal .... https://arxiv.org/pdf/2405.02287. (3) This AI Paper by Reka AI Introduces Vibe-Eval: A Comprehensive Suite .... https://www.marktechpost.com/2024/05/02/this-ai-paper-by-reka-ai-introduces-vibe-eval-a-comprehensive-suite-for-evaluating-ai-multimodal-models/. (4) Vibe-Eval: A hard evaluation suite for measuring progress of multimodal .... https://arxivtools.blob.core.windows.net/xueshuxiangzipaperhtml/2024_5_6/2405.02287.pdf.

Homepage

Benchmarks

Add a new result Link an existing benchmark

No benchmarks yet. Start a new benchmark or link an existing one.

Papers

Paper	Code	Results	Date	Stars

Vibe-Eval

Benchmarks

Add a new result Link an existing benchmark

Papers

Dataset Loaders

Add Remove

Tasks

Usage

License

Modalities

Languages

Vibe-Eval

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit

Usage

License Edit

Modalities Edit

Languages Edit

Benchmarks

Add a new result Link an existing benchmark

Dataset Loaders

Add Remove

Tasks

License

Modalities

Languages