Separating the “chirp” from the “chat”: self-supervised visual grounding of sound and language

Separating the “chirp” from the “chat”: self-supervised visual grounding of sound and language

We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visual aligned features solely through watching videos. We show that DenseAV can discover the “meaning” of words and the “location” of sounds without explicit localization...

Disgrifiad llawn

Manylion Llyfryddiaeth
Prif Awduron:	Hamilton, M, Zisserman, A, Hershey, JR, Freeman, WT
Fformat:	Conference item
Iaith:	English
Cyhoeddwyd:	IEEE 2024

Eitemau Tebyg

Multi-task self-supervised visual learning
gan: Doersch, C, et al.
Cyhoeddwyd: (2017)

Ambient Sound Provides Supervision for Visual Learning
gan: Owens, Andrew Hale, et al.
Cyhoeddwyd: (2017)

Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning
gan: Owens, Andrew, et al.
Cyhoeddwyd: (2021)

Self-Supervised Learning for Audio-Visual Relationships of Videos With Stereo Sounds
gan: Tomoya Sato, et al.
Cyhoeddwyd: (2022-01-01)

Self-supervised learning of audio-visual objects from video
gan: Afouras, T, et al.
Cyhoeddwyd: (2020)

Enhancement of sound by soft reflections in exponentially chirped crystals
gan: A. Cebrecos, et al.
Cyhoeddwyd: (2014-12-01)

Music Gesture for Visual Sound Separation
gan: Gan, Chuang, et al.
Cyhoeddwyd: (2021)

First observations of oblique ionospheric sounding chirp signal in Mexico
gan: M.A. Sergeeva, et al.
Cyhoeddwyd: (2019-03-01)

Features of backscatter ionospheric sounding as studied with a chirp ionosonde
gan: Ponomarchuk S.N., et al.
Cyhoeddwyd: (2017-09-01)

Audio-Visual Self-Supervised Terrain Type Recognition for Ground Mobile Platforms
gan: Akiyoshi Kurobe, et al.
Cyhoeddwyd: (2021-01-01)

Self-supervised learning for spinal MRIs
gan: Jamaludin, A, et al.
Cyhoeddwyd: (2017)

Self-Supervised Moving Vehicle Tracking With Stereo Sound
gan: Gan, Chuang, et al.
Cyhoeddwyd: (2021)

Unsupervised discovery of visual object class hierarchies
gan: Sivic, J, et al.
Cyhoeddwyd: (2008)

Localizing visual sounds the hard way
gan: Vedaldi, A, et al.
Cyhoeddwyd: (2021)

Weakly supervised scale-invariant learning of models for visual recognition
gan: Fergus, R, et al.
Cyhoeddwyd: (2006)

Enhancing bowel sound recognition with self-attention and self-supervised pre-training.
gan: Yansuo Yu, et al.
Cyhoeddwyd: (2024-01-01)

Self-supervised co-training for video representation learning
gan: Han, T, et al.
Cyhoeddwyd: (2020)

ESTIMATING ANTENNA COUPLING FACTOR FOR PROBLEM OF TOPSIDEIONOSPHERE SOUNDING FROM SPACE BY CHIRP SIGNALS
gan: Podlesnyi A.V., et al.
Cyhoeddwyd: (2019-12-01)

Self-supervised learning of class embeddings from video
gan: Wiles, O, et al.
Cyhoeddwyd: (2020)

Sight to Sound: An End-to-End Approach for Visual Piano Transcription
gan: Koepke, S, et al.
Cyhoeddwyd: (2020)

ASDNet: An Efficient Self-Supervised Convolutional Network for Anomalous Sound Detection
gan: Dewei Kong, et al.
Cyhoeddwyd: (2025-01-01)

Self-Supervised Transfer Learning from Natural Images for Sound Classification
gan: Sungho Shin, et al.
Cyhoeddwyd: (2021-03-01)

Visually Indicated Sounds
gan: Isola, Phillip, et al.
Cyhoeddwyd: (2017)

Direct Underwater Sound Velocity Measurement Based on the Acousto-Optic Self-Interference Effect between the Chirp Signal and the Optical Frequency Comb
gan: Zihui Yang, et al.
Cyhoeddwyd: (2022-12-01)

Extraction of Individual EEG Gamma Frequencies from the Responses to Click-Based Chirp-Modulated Sounds
gan: Aurimas Mockevičius, et al.
Cyhoeddwyd: (2023-03-01)

Chat2VIS: Generating Data Visualizations via Natural Language Using ChatGPT, Codex and GPT-3 Large Language Models
gan: Paula Maddigan, et al.
Cyhoeddwyd: (2023-01-01)

Disentangled Speech Embeddings Using Cross-Modal Self-Supervision
gan: Nagrani, A, et al.
Cyhoeddwyd: (2020)

Self-Supervised Audio-Visual Co-Segmentation
gan: Rouditchenko, Andrew, et al.
Cyhoeddwyd: (2022)

Self-Supervised Audio-Visual Co-Segmentation
gan: Rouditchenko, Andrew, et al.
Cyhoeddwyd: (2021)

Self-Supervised Autoencoders for Visual Anomaly Detection
gan: Alexander Bauer, et al.
Cyhoeddwyd: (2024-12-01)

Combining Unsupervised and Supervised Learning for Sample Efficient Continuous Language Grounding
gan: Oliver Roesler
Cyhoeddwyd: (2022-09-01)

Weakly-supervised fingerspelling recognition in British Sign Language videos
gan: Prajwal, KR, et al.
Cyhoeddwyd: (2022)

Self-supervised learning of a facial attribute embedding from video
gan: Wiles, O, et al.
Cyhoeddwyd: (2018)

Self-supervised video object segmentation by motion grouping
gan: Yang, C, et al.
Cyhoeddwyd: (2021)

A Climate Hyperspectral Infrared Radiance Product (CHIRP) Combining the AIRS and CrIS Satellite Sounding Record
gan: L. Larrabee Strow, et al.
Cyhoeddwyd: (2021-01-01)

Diagnostics of HF radio channel: based on data from backscatter ionospheric sounding by continuous chirp signal
gan: Ponomarchuk S.N., et al.
Cyhoeddwyd: (2018-06-01)

Application of Optimized Adaptive Chirp Mode Decomposition Method in Chirp Signal
gan: Junyuan Wang, et al.
Cyhoeddwyd: (2020-05-01)

Self-supervised multi-modal alignment for whole body medical imaging
gan: Windsor, R, et al.
Cyhoeddwyd: (2021)

Now you're speaking my language: visual language identification
gan: Afouras, T, et al.
Cyhoeddwyd: (2020)

Exploring the Utility of ChatGPT for Self-directed Online Language Learning
gan: Zixi Li, et al.
Cyhoeddwyd: (2024-09-01)