Multi-Modal-CelebA-HQ is a large-scale face image dataset that has 30,000 high-resolution face images selected from the CelebA dataset by following CelebA-HQ. Each image has high-quality segmentation mask, sketch, descriptive text, and image with transparent background.
27 PAPERS • 3 BENCHMARKS
Recent advances in large language models have led to the development of multimodal LLMs (MLLMs), which take both image data and text as an input. Virtually all of these models have been announced within the past year, leading to a significant need for benchmarks evaluating the abilities of these models to reason truthfully and accurately on a diverse set of tasks. When Google announced Gemini (Gemini Team et al., 2023), they showcased its ability to solve rebuses—wordplay puzzles which involve creatively adding and subtracting letters from words derived from text and images. The diversity of rebuses allows for a broad evaluation of multimodal reasoning capabilities, including image recognition, multi- step reasoning, and understanding the human creator’s intent. We present REBUS: a collection of 333 hand-crafted rebuses spanning 13 diverse cate- gories, including hand-drawn and digital images created by nine contributors. Samples are presented in Table 1. Notably, GPT-4V, the most powe
1 PAPER • 1 BENCHMARK