A sound approach: using large language models to generate audio descriptions for egocentric text-audio retrieval

Video databases from the internet are a valuable source of text-audio retrieval datasets. However, given that sound and vision streams represent different "views" of the data, treating visual descriptions as audio descriptions is far from optimal. Even if audio class labels are present, th...

Celý popis

Podrobná bibliografie
Hlavní autoři: Oncescu, A-M, Henriques, JF, Zisserman, A, Albanie, S, Koepke, AS
Médium: Conference item
Jazyk:English
Vydáno: IEEE 2024