Separating the “chirp” from the “chat”: self-supervised visual grounding of sound and language

Separating the “chirp” from the “chat”: self-supervised visual grounding of sound and language

We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visual aligned features solely through watching videos. We show that DenseAV can discover the “meaning” of words and the “location” of sounds without explicit localization...

Description complète

Détails bibliographiques
Auteurs principaux:	Hamilton, M, Zisserman, A, Hershey, JR, Freeman, WT
Format:	Conference item
Langue:	English
Publié:	IEEE 2024

Documents similaires

Multi-task self-supervised visual learning
par: Doersch, C, et autres
Publié: (2017)

Ambient Sound Provides Supervision for Visual Learning
par: Owens, Andrew Hale, et autres
Publié: (2017)

Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning
par: Owens, Andrew, et autres
Publié: (2021)

Self-Supervised Learning for Audio-Visual Relationships of Videos With Stereo Sounds
par: Tomoya Sato, et autres
Publié: (2022-01-01)

Self-supervised learning of audio-visual objects from video
par: Afouras, T, et autres
Publié: (2020)

Enhancement of sound by soft reflections in exponentially chirped crystals
par: A. Cebrecos, et autres
Publié: (2014-12-01)

Music Gesture for Visual Sound Separation
par: Gan, Chuang, et autres
Publié: (2021)

First observations of oblique ionospheric sounding chirp signal in Mexico
par: M.A. Sergeeva, et autres
Publié: (2019-03-01)

Features of backscatter ionospheric sounding as studied with a chirp ionosonde
par: Ponomarchuk S.N., et autres
Publié: (2017-09-01)

Audio-Visual Self-Supervised Terrain Type Recognition for Ground Mobile Platforms
par: Akiyoshi Kurobe, et autres
Publié: (2021-01-01)

Self-supervised learning for spinal MRIs
par: Jamaludin, A, et autres
Publié: (2017)

Self-Supervised Moving Vehicle Tracking With Stereo Sound
par: Gan, Chuang, et autres
Publié: (2021)

Unsupervised discovery of visual object class hierarchies
par: Sivic, J, et autres
Publié: (2008)

Localizing visual sounds the hard way
par: Vedaldi, A, et autres
Publié: (2021)

Weakly supervised scale-invariant learning of models for visual recognition
par: Fergus, R, et autres
Publié: (2006)

Enhancing bowel sound recognition with self-attention and self-supervised pre-training.
par: Yansuo Yu, et autres
Publié: (2024-01-01)

Self-supervised co-training for video representation learning
par: Han, T, et autres
Publié: (2020)

ESTIMATING ANTENNA COUPLING FACTOR FOR PROBLEM OF TOPSIDEIONOSPHERE SOUNDING FROM SPACE BY CHIRP SIGNALS
par: Podlesnyi A.V., et autres
Publié: (2019-12-01)

Self-supervised learning of class embeddings from video
par: Wiles, O, et autres
Publié: (2020)

Sight to Sound: An End-to-End Approach for Visual Piano Transcription
par: Koepke, S, et autres
Publié: (2020)

ASDNet: An Efficient Self-Supervised Convolutional Network for Anomalous Sound Detection
par: Dewei Kong, et autres
Publié: (2025-01-01)

Self-Supervised Transfer Learning from Natural Images for Sound Classification
par: Sungho Shin, et autres
Publié: (2021-03-01)

Visually Indicated Sounds
par: Isola, Phillip, et autres
Publié: (2017)

Direct Underwater Sound Velocity Measurement Based on the Acousto-Optic Self-Interference Effect between the Chirp Signal and the Optical Frequency Comb
par: Zihui Yang, et autres
Publié: (2022-12-01)

Extraction of Individual EEG Gamma Frequencies from the Responses to Click-Based Chirp-Modulated Sounds
par: Aurimas Mockevičius, et autres
Publié: (2023-03-01)

Chat2VIS: Generating Data Visualizations via Natural Language Using ChatGPT, Codex and GPT-3 Large Language Models
par: Paula Maddigan, et autres
Publié: (2023-01-01)

Disentangled Speech Embeddings Using Cross-Modal Self-Supervision
par: Nagrani, A, et autres
Publié: (2020)

Self-Supervised Audio-Visual Co-Segmentation
par: Rouditchenko, Andrew, et autres
Publié: (2022)

Self-Supervised Audio-Visual Co-Segmentation
par: Rouditchenko, Andrew, et autres
Publié: (2021)

Self-Supervised Autoencoders for Visual Anomaly Detection
par: Alexander Bauer, et autres
Publié: (2024-12-01)

Combining Unsupervised and Supervised Learning for Sample Efficient Continuous Language Grounding
par: Oliver Roesler
Publié: (2022-09-01)

Weakly-supervised fingerspelling recognition in British Sign Language videos
par: Prajwal, KR, et autres
Publié: (2022)

Self-supervised learning of a facial attribute embedding from video
par: Wiles, O, et autres
Publié: (2018)

Self-supervised video object segmentation by motion grouping
par: Yang, C, et autres
Publié: (2021)

A Climate Hyperspectral Infrared Radiance Product (CHIRP) Combining the AIRS and CrIS Satellite Sounding Record
par: L. Larrabee Strow, et autres
Publié: (2021-01-01)

Diagnostics of HF radio channel: based on data from backscatter ionospheric sounding by continuous chirp signal
par: Ponomarchuk S.N., et autres
Publié: (2018-06-01)

Application of Optimized Adaptive Chirp Mode Decomposition Method in Chirp Signal
par: Junyuan Wang, et autres
Publié: (2020-05-01)

Self-supervised multi-modal alignment for whole body medical imaging
par: Windsor, R, et autres
Publié: (2021)

Now you're speaking my language: visual language identification
par: Afouras, T, et autres
Publié: (2020)

Exploring the Utility of ChatGPT for Self-directed Online Language Learning
par: Zixi Li, et autres
Publié: (2024-09-01)