CAT: enhancing multimodal large language model to answer questions in dynamic audio-visual scenarios

This paper focuses on the challenge of answering questions in scenarios that are composed of rich and complex dynamic audiovisual components. Although existing Multimodal Large Language Models (MLLMs) can respond to audio-visual content, these responses are sometimes ambiguous and fail to describe s...

Full description

Bibliographic Details
Main Authors: Ye, Q, Yu, Z, Shao, R, Xie, X, Torr, P, Cao, X
Format: Conference item
Language:English
Published: IEEE 2024