TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video-based Generative Performance Benchmarking	VideoInstruct	VLM-RLAIF	Correctness of Information	3.63	# 1
Video-based Generative Performance Benchmarking	VideoInstruct	VLM-RLAIF	Detail Orientation	3.25	# 1
Video-based Generative Performance Benchmarking	VideoInstruct	VLM-RLAIF	Contextual Understanding	4	# 1
Video-based Generative Performance Benchmarking	VideoInstruct	VLM-RLAIF	Temporal Understanding	3.23	# 1
Video-based Generative Performance Benchmarking	VideoInstruct	VLM-RLAIF	Consistency	3.32	# 1
Video-based Generative Performance Benchmarking	VideoInstruct	VLM-RLAIF	mean	3.49	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/tuning-large-multimodal-models-for-videos/video-based-generative-performance)](https://paperswithcode.com/sota/video-based-generative-performance?p=tuning-large-multimodal-models-for-videos)`

Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback

6 Feb 2024 · Daechul Ahn, Yura Choi, Youngjae Yu, Dongyeop Kang, Jonghyun Choi ·

Recent advancements in large language models have influenced the development of video large multimodal models (VLMMs). The previous approaches for VLMMs involved Supervised Fine-Tuning (SFT) with instruction-tuned datasets, integrating LLM with visual encoders, and adding additional learnable modules. Video and text multimodal alignment remains challenging, primarily due to the deficient volume and quality of multimodal instruction-tune data compared to text-only data. We present a novel alignment strategy that employs multimodal AI system to oversee itself called Reinforcement Learning from AI Feedback (RLAIF), providing self-preference feedback to refine itself and facilitating the alignment of video and text modalities. In specific, we propose context-aware reward modeling by providing detailed video descriptions as context during the generation of preference feedback in order to enrich the understanding of video content. Demonstrating enhanced performance across diverse video benchmarks, our multimodal RLAIF approach, VLM-RLAIF, outperforms existing approaches, including the SFT model. We commit to open-sourcing our code, models, and datasets to foster further research in this area.

PDF Abstract

Code

Add Remove Mark official

yonseivnl/vlm-rlaif official

Tasks

Add Remove

Video-based Generative Performance Benchmarking

Datasets

VideoInstruct

Results from the Paper

Add Remove

Ranked #1 on Video-based Generative Performance Benchmarking on VideoInstruct

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video-based Generative Performance Benchmarking	VideoInstruct	VLM-RLAIF	Correctness of Information	3.63	# 1	Compare
			Detail Orientation	3.25	# 1	Compare
			Contextual Understanding	4	# 1	Compare
			Temporal Understanding	3.23	# 1	Compare
			Consistency	3.32	# 1	Compare
			mean	3.49	# 1	Compare

Methods

Add Remove

RLAIF • SFT

Edit Social Preview

Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove