Vibe-Eval is a new open benchmark and framework for evaluating multimodal chat models¹². It was introduced by Reka Technologies⁴ and is designed to rigorously test these models' visual understanding capabilities⁴. Here are some key points about Vibe-Eval:
- It consists of 269 ultra high-quality image-text prompts and their ground truth responses¹.
- The prompts and responses have been extensively checked multiple times by the Reka team¹.
- Vibe-Eval is designed to be difficult, challenging even to the current frontier models, and to induce greater separability among frontier-class models¹.
- On 50% of the hard set, all frontier models fail to arrive at a perfect answer, leaving a lot of headroom for progress¹.
- The prompts are created by actual AI experts who have a strong familiarity with the performance of frontier models¹.
- While MMMU has been a pretty solid standard for evaluating multimodal models, it is still fundamentally a multiple-choice benchmark¹. Vibe-Eval, on the other hand, is an open-ended evaluation setup¹.
- They also discuss challenges and trade-offs between human and model-based automatic evaluation and propose a lightweight automatic evaluation protocol based on Reka Core¹.
- They plan to periodically run formal human evaluations on public models that do well on this benchmark¹.
(1) Vibe-Eval: A new open and hard evaluation suite for measuring progress .... https://www.reka.ai/news/vibe-eval.
(2) Vibe-Eval: A hard evaluation suite for measuring progress of multimodal .... https://arxiv.org/pdf/2405.02287.
(3) This AI Paper by Reka AI Introduces Vibe-Eval: A Comprehensive Suite .... https://www.marktechpost.com/2024/05/02/this-ai-paper-by-reka-ai-introduces-vibe-eval-a-comprehensive-suite-for-evaluating-ai-multimodal-models/.
(4) Vibe-Eval: A hard evaluation suite for measuring progress of multimodal .... https://arxivtools.blob.core.windows.net/xueshuxiangzipaperhtml/2024_5_6/2405.02287.pdf.