A sound approach: using large language models to generate audio descriptions for egocentric text-audio retrieval

Video databases from the internet are a valuable source of text-audio retrieval datasets. However, given that sound and vision streams represent different "views" of the data, treating visual descriptions as audio descriptions is far from optimal. Even if audio class labels are present, th...

Бүрэн тодорхойлолт

Номзүйн дэлгэрэнгүй
Үндсэн зохиолчид: Oncescu, A-M, Henriques, JF, Zisserman, A, Albanie, S, Koepke, AS
Формат: Conference item
Хэл сонгох:English
Хэвлэсэн: IEEE 2024