Vibe-Eval is a new open benchmark and framework for evaluating multimodal chat models¹². It was introduced by Reka Technologies⁴ and is designed to rigorously test these models' visual understanding capabilities⁴. Here are some key points about Vibe-Eval:

  • It consists of 269 ultra high-quality image-text prompts and their ground truth responses¹.
  • The prompts and responses have been extensively checked multiple times by the Reka team¹.
  • Vibe-Eval is designed to be difficult, challenging even to the current frontier models, and to induce greater separability among frontier-class models¹.
  • On 50% of the hard set, all frontier models fail to arrive at a perfect answer, leaving a lot of headroom for progress¹.
  • The prompts are created by actual AI experts who have a strong familiarity with the performance of frontier models¹.
  • While MMMU has been a pretty solid standard for evaluating multimodal models, it is still fundamentally a multiple-choice benchmark¹. Vibe-Eval, on the other hand, is an open-ended evaluation setup¹.
  • They also discuss challenges and trade-offs between human and model-based automatic evaluation and propose a lightweight automatic evaluation protocol based on Reka Core¹.
  • They plan to periodically run formal human evaluations on public models that do well on this benchmark¹.

(1) Vibe-Eval: A new open and hard evaluation suite for measuring progress .... https://www.reka.ai/news/vibe-eval. (2) Vibe-Eval: A hard evaluation suite for measuring progress of multimodal .... https://arxiv.org/pdf/2405.02287. (3) This AI Paper by Reka AI Introduces Vibe-Eval: A Comprehensive Suite .... https://www.marktechpost.com/2024/05/02/this-ai-paper-by-reka-ai-introduces-vibe-eval-a-comprehensive-suite-for-evaluating-ai-multimodal-models/. (4) Vibe-Eval: A hard evaluation suite for measuring progress of multimodal .... https://arxivtools.blob.core.windows.net/xueshuxiangzipaperhtml/2024_5_6/2405.02287.pdf.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


License


  • Unknown

Modalities


Languages