Separating the “chirp” from the “chat”: self-supervised visual grounding of sound and language

We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visual aligned features solely through watching videos. We show that DenseAV can discover the “meaning” of words and the “location” of sounds without explicit localization...

पूर्ण विवरण

ग्रंथसूची विवरण
मुख्य लेखकों: Hamilton, M, Zisserman, A, Hershey, JR, Freeman, WT
स्वरूप: Conference item
भाषा:English
प्रकाशित: IEEE 2024
_version_ 1826315213845037056
author Hamilton, M
Zisserman, A
Hershey, JR
Freeman, WT
author_facet Hamilton, M
Zisserman, A
Hershey, JR
Freeman, WT
author_sort Hamilton, M
collection OXFORD
description We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visual aligned features solely through watching videos. We show that DenseAV can discover the “meaning” of words and the “location” of sounds without explicit localization supervision. Furthermore, it automatically discovers and distinguishes between these two types of associations without supervision. We show that DenseAV's localization abilities arise from a new multi-head feature aggregation operator that directly compares dense image and audio representations for contrastive learning. In contrast, many other systems that learn “global” audio and video representations cannot localize words and sound. Finally, we contribute two new datasets to improve the evaluation of AV representations through speech and sound prompted semantic segmentation. On these and other datasets we show DenseAV dramatically outperforms the prior art on speech and sound prompted semantic segmentation. DenseAV outperforms the current state-of-the-art, ImageBind, on cross-modal retrieval using fewer than half of the parameters. Project Page: https://aka.ms/denseav
first_indexed 2024-12-09T03:21:53Z
format Conference item
id oxford-uuid:2d4a16f0-f8b7-43b1-ab5a-ee996814d2d0
institution University of Oxford
language English
last_indexed 2024-12-09T03:21:53Z
publishDate 2024
publisher IEEE
record_format dspace
spelling oxford-uuid:2d4a16f0-f8b7-43b1-ab5a-ee996814d2d02024-11-19T11:44:54ZSeparating the “chirp” from the “chat”: self-supervised visual grounding of sound and languageConference itemhttp://purl.org/coar/resource_type/c_5794uuid:2d4a16f0-f8b7-43b1-ab5a-ee996814d2d0EnglishSymplectic ElementsIEEE2024Hamilton, MZisserman, AHershey, JRFreeman, WTWe present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visual aligned features solely through watching videos. We show that DenseAV can discover the “meaning” of words and the “location” of sounds without explicit localization supervision. Furthermore, it automatically discovers and distinguishes between these two types of associations without supervision. We show that DenseAV's localization abilities arise from a new multi-head feature aggregation operator that directly compares dense image and audio representations for contrastive learning. In contrast, many other systems that learn “global” audio and video representations cannot localize words and sound. Finally, we contribute two new datasets to improve the evaluation of AV representations through speech and sound prompted semantic segmentation. On these and other datasets we show DenseAV dramatically outperforms the prior art on speech and sound prompted semantic segmentation. DenseAV outperforms the current state-of-the-art, ImageBind, on cross-modal retrieval using fewer than half of the parameters. Project Page: https://aka.ms/denseav
spellingShingle Hamilton, M
Zisserman, A
Hershey, JR
Freeman, WT
Separating the “chirp” from the “chat”: self-supervised visual grounding of sound and language
title Separating the “chirp” from the “chat”: self-supervised visual grounding of sound and language
title_full Separating the “chirp” from the “chat”: self-supervised visual grounding of sound and language
title_fullStr Separating the “chirp” from the “chat”: self-supervised visual grounding of sound and language
title_full_unstemmed Separating the “chirp” from the “chat”: self-supervised visual grounding of sound and language
title_short Separating the “chirp” from the “chat”: self-supervised visual grounding of sound and language
title_sort separating the chirp from the chat self supervised visual grounding of sound and language
work_keys_str_mv AT hamiltonm separatingthechirpfromthechatselfsupervisedvisualgroundingofsoundandlanguage
AT zissermana separatingthechirpfromthechatselfsupervisedvisualgroundingofsoundandlanguage
AT hersheyjr separatingthechirpfromthechatselfsupervisedvisualgroundingofsoundandlanguage
AT freemanwt separatingthechirpfromthechatselfsupervisedvisualgroundingofsoundandlanguage