A sound approach: using large language models to generate audio descriptions for egocentric text-audio retrieval

Video databases from the internet are a valuable source of text-audio retrieval datasets. However, given that sound and vision streams represent different "views" of the data, treating visual descriptions as audio descriptions is far from optimal. Even if audio class labels are present, th...

Full description

Bibliographic Details
Main Authors: Oncescu, A-M, Henriques, JF, Zisserman, A, Albanie, S, Koepke, AS
Format: Conference item
Language:English
Published: IEEE 2024