CAT: enhancing multimodal large language model to answer questions in dynamic audio-visual scenarios
This paper focuses on the challenge of answering questions in scenarios that are composed of rich and complex dynamic audiovisual components. Although existing Multimodal Large Language Models (MLLMs) can respond to audio-visual content, these responses are sometimes ambiguous and fail to describe s...
Main Authors: | , , , , , |
---|---|
Format: | Conference item |
Language: | English |
Published: |
IEEE
2024
|