Separating the “chirp” from the “chat”: self-supervised visual grounding of sound and language
We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visual aligned features solely through watching videos. We show that DenseAV can discover the “meaning” of words and the “location” of sounds without explicit localization...
Những tác giả chính: | , , , |
---|---|
Định dạng: | Conference item |
Ngôn ngữ: | English |
Được phát hành: |
IEEE
2024
|