Separating the “chirp” from the “chat”: self-supervised visual grounding of sound and language

Separating the “chirp” from the “chat”: self-supervised visual grounding of sound and language

We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visual aligned features solely through watching videos. We show that DenseAV can discover the “meaning” of words and the “location” of sounds without explicit localization...

Descrizione completa

Dettagli Bibliografici
Autori principali:	Hamilton, M, Zisserman, A, Hershey, JR, Freeman, WT
Natura:	Conference item
Lingua:	English
Pubblicazione:	IEEE 2024

Documenti analoghi

Multi-task self-supervised visual learning
di: Doersch, C, et al.
Pubblicazione: (2017)

Ambient Sound Provides Supervision for Visual Learning
di: Owens, Andrew Hale, et al.
Pubblicazione: (2017)

Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning
di: Owens, Andrew, et al.
Pubblicazione: (2021)

Self-Supervised Learning for Audio-Visual Relationships of Videos With Stereo Sounds
di: Tomoya Sato, et al.
Pubblicazione: (2022-01-01)

Self-supervised learning of audio-visual objects from video
di: Afouras, T, et al.
Pubblicazione: (2020)

Enhancement of sound by soft reflections in exponentially chirped crystals
di: A. Cebrecos, et al.
Pubblicazione: (2014-12-01)

Music Gesture for Visual Sound Separation
di: Gan, Chuang, et al.
Pubblicazione: (2021)

First observations of oblique ionospheric sounding chirp signal in Mexico
di: M.A. Sergeeva, et al.
Pubblicazione: (2019-03-01)

Features of backscatter ionospheric sounding as studied with a chirp ionosonde
di: Ponomarchuk S.N., et al.
Pubblicazione: (2017-09-01)

Audio-Visual Self-Supervised Terrain Type Recognition for Ground Mobile Platforms
di: Akiyoshi Kurobe, et al.
Pubblicazione: (2021-01-01)

Self-supervised learning for spinal MRIs
di: Jamaludin, A, et al.
Pubblicazione: (2017)

Self-Supervised Moving Vehicle Tracking With Stereo Sound
di: Gan, Chuang, et al.
Pubblicazione: (2021)

Unsupervised discovery of visual object class hierarchies
di: Sivic, J, et al.
Pubblicazione: (2008)

Localizing visual sounds the hard way
di: Vedaldi, A, et al.
Pubblicazione: (2021)

Weakly supervised scale-invariant learning of models for visual recognition
di: Fergus, R, et al.
Pubblicazione: (2006)

Enhancing bowel sound recognition with self-attention and self-supervised pre-training.
di: Yansuo Yu, et al.
Pubblicazione: (2024-01-01)

Self-supervised co-training for video representation learning
di: Han, T, et al.
Pubblicazione: (2020)

ESTIMATING ANTENNA COUPLING FACTOR FOR PROBLEM OF TOPSIDEIONOSPHERE SOUNDING FROM SPACE BY CHIRP SIGNALS
di: Podlesnyi A.V., et al.
Pubblicazione: (2019-12-01)

Self-supervised learning of class embeddings from video
di: Wiles, O, et al.
Pubblicazione: (2020)

Sight to Sound: An End-to-End Approach for Visual Piano Transcription
di: Koepke, S, et al.
Pubblicazione: (2020)

ASDNet: An Efficient Self-Supervised Convolutional Network for Anomalous Sound Detection
di: Dewei Kong, et al.
Pubblicazione: (2025-01-01)

Self-Supervised Transfer Learning from Natural Images for Sound Classification
di: Sungho Shin, et al.
Pubblicazione: (2021-03-01)

Visually Indicated Sounds
di: Isola, Phillip, et al.
Pubblicazione: (2017)

Direct Underwater Sound Velocity Measurement Based on the Acousto-Optic Self-Interference Effect between the Chirp Signal and the Optical Frequency Comb
di: Zihui Yang, et al.
Pubblicazione: (2022-12-01)

Extraction of Individual EEG Gamma Frequencies from the Responses to Click-Based Chirp-Modulated Sounds
di: Aurimas Mockevičius, et al.
Pubblicazione: (2023-03-01)

Chat2VIS: Generating Data Visualizations via Natural Language Using ChatGPT, Codex and GPT-3 Large Language Models
di: Paula Maddigan, et al.
Pubblicazione: (2023-01-01)

Disentangled Speech Embeddings Using Cross-Modal Self-Supervision
di: Nagrani, A, et al.
Pubblicazione: (2020)

Self-Supervised Audio-Visual Co-Segmentation
di: Rouditchenko, Andrew, et al.
Pubblicazione: (2022)

Self-Supervised Audio-Visual Co-Segmentation
di: Rouditchenko, Andrew, et al.
Pubblicazione: (2021)

Self-Supervised Autoencoders for Visual Anomaly Detection
di: Alexander Bauer, et al.
Pubblicazione: (2024-12-01)

Combining Unsupervised and Supervised Learning for Sample Efficient Continuous Language Grounding
di: Oliver Roesler
Pubblicazione: (2022-09-01)

Weakly-supervised fingerspelling recognition in British Sign Language videos
di: Prajwal, KR, et al.
Pubblicazione: (2022)

Self-supervised learning of a facial attribute embedding from video
di: Wiles, O, et al.
Pubblicazione: (2018)

Self-supervised video object segmentation by motion grouping
di: Yang, C, et al.
Pubblicazione: (2021)

A Climate Hyperspectral Infrared Radiance Product (CHIRP) Combining the AIRS and CrIS Satellite Sounding Record
di: L. Larrabee Strow, et al.
Pubblicazione: (2021-01-01)

Diagnostics of HF radio channel: based on data from backscatter ionospheric sounding by continuous chirp signal
di: Ponomarchuk S.N., et al.
Pubblicazione: (2018-06-01)

Application of Optimized Adaptive Chirp Mode Decomposition Method in Chirp Signal
di: Junyuan Wang, et al.
Pubblicazione: (2020-05-01)

Self-supervised multi-modal alignment for whole body medical imaging
di: Windsor, R, et al.
Pubblicazione: (2021)

Now you're speaking my language: visual language identification
di: Afouras, T, et al.
Pubblicazione: (2020)

Exploring the Utility of ChatGPT for Self-directed Online Language Learning
di: Zixi Li, et al.
Pubblicazione: (2024-09-01)