A sound approach: using large language models to generate audio descriptions for egocentric text-audio retrieval
Video databases from the internet are a valuable source of text-audio retrieval datasets. However, given that sound and vision streams represent different "views" of the data, treating visual descriptions as audio descriptions is far from optimal. Even if audio class labels are present, th...
Main Authors: | , , , , |
---|---|
Format: | Conference item |
Language: | English |
Published: |
IEEE
2024
|