A sound approach: using large language models to generate audio descriptions for egocentric text-audio retrieval
Video databases from the internet are a valuable source of text-audio retrieval datasets. However, given that sound and vision streams represent different "views" of the data, treating visual descriptions as audio descriptions is far from optimal. Even if audio class labels are present, th...
Үндсэн зохиолчид: | , , , , |
---|---|
Формат: | Conference item |
Хэл сонгох: | English |
Хэвлэсэн: |
IEEE
2024
|