CAT: enhancing multimodal large language model to answer questions in dynamic audio-visual scenarios
This paper focuses on the challenge of answering questions in scenarios that are composed of rich and complex dynamic audiovisual components. Although existing Multimodal Large Language Models (MLLMs) can respond to audio-visual content, these responses are sometimes ambiguous and fail to describe s...
Main Authors: | Ye, Q, Yu, Z, Shao, R, Xie, X, Torr, P, Cao, X |
---|---|
Format: | Conference item |
Language: | English |
Published: |
Association for Computing Machinery
2024
|
Similar Items
-
Debiasing visual question and answering with answer preference
by: Zhang, Xinye
Published: (2020) -
Multimodal audio-visual emotion detection
by: Chaudhary, Nitesh Kumar
Published: (2021) -
Causalqa: a causal framework for question answering
by: Dutta, Angshuk
Published: (2022) -
Towards a hierarchical framework for predicting the best answer in a question answering system
by: Chua, Alton Yeow Kuan, et al.
Published: (2009) -
Clinical Question-Answering over Distributed EHR Data
by: Jiang, Emily
Published: (2024)