Separating the “chirp” from the “chat”: self-supervised visual grounding of sound and language

We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visual aligned features solely through watching videos. We show that DenseAV can discover the “meaning” of words and the “location” of sounds without explicit localization...

সম্পূর্ণ বিবরণ

গ্রন্থ-পঞ্জীর বিবরন
প্রধান লেখক: Hamilton, M, Zisserman, A, Hershey, JR, Freeman, WT
বিন্যাস: Conference item
ভাষা:English
প্রকাশিত: IEEE 2024

অনুরূপ উপাদানগুলি