Separating the “chirp” from the “chat”: self-supervised visual grounding of sound and language

Separating the “chirp” from the “chat”: self-supervised visual grounding of sound and language

We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visual aligned features solely through watching videos. We show that DenseAV can discover the “meaning” of words and the “location” of sounds without explicit localization...

ver descrição completa

Detalhes bibliográficos
Main Authors:	Hamilton, M, Zisserman, A, Hershey, JR, Freeman, WT
Formato:	Conference item
Idioma:	English
Publicado em:	IEEE 2024

Registos relacionados

Multi-task self-supervised visual learning
Por: Doersch, C, et al.
Publicado em: (2017)

Ambient Sound Provides Supervision for Visual Learning
Por: Owens, Andrew Hale, et al.
Publicado em: (2017)

Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning
Por: Owens, Andrew, et al.
Publicado em: (2021)

Self-Supervised Learning for Audio-Visual Relationships of Videos With Stereo Sounds
Por: Tomoya Sato, et al.
Publicado em: (2022-01-01)

Self-supervised learning of audio-visual objects from video
Por: Afouras, T, et al.
Publicado em: (2020)

Enhancement of sound by soft reflections in exponentially chirped crystals
Por: A. Cebrecos, et al.
Publicado em: (2014-12-01)

Music Gesture for Visual Sound Separation
Por: Gan, Chuang, et al.
Publicado em: (2021)

First observations of oblique ionospheric sounding chirp signal in Mexico
Por: M.A. Sergeeva, et al.
Publicado em: (2019-03-01)

Features of backscatter ionospheric sounding as studied with a chirp ionosonde
Por: Ponomarchuk S.N., et al.
Publicado em: (2017-09-01)

Audio-Visual Self-Supervised Terrain Type Recognition for Ground Mobile Platforms
Por: Akiyoshi Kurobe, et al.
Publicado em: (2021-01-01)

Self-supervised learning for spinal MRIs
Por: Jamaludin, A, et al.
Publicado em: (2017)

Self-Supervised Moving Vehicle Tracking With Stereo Sound
Por: Gan, Chuang, et al.
Publicado em: (2021)

Unsupervised discovery of visual object class hierarchies
Por: Sivic, J, et al.
Publicado em: (2008)

Localizing visual sounds the hard way
Por: Vedaldi, A, et al.
Publicado em: (2021)

Weakly supervised scale-invariant learning of models for visual recognition
Por: Fergus, R, et al.
Publicado em: (2006)

Enhancing bowel sound recognition with self-attention and self-supervised pre-training.
Por: Yansuo Yu, et al.
Publicado em: (2024-01-01)

Self-supervised co-training for video representation learning
Por: Han, T, et al.
Publicado em: (2020)

ESTIMATING ANTENNA COUPLING FACTOR FOR PROBLEM OF TOPSIDEIONOSPHERE SOUNDING FROM SPACE BY CHIRP SIGNALS
Por: Podlesnyi A.V., et al.
Publicado em: (2019-12-01)

Self-supervised learning of class embeddings from video
Por: Wiles, O, et al.
Publicado em: (2020)

Sight to Sound: An End-to-End Approach for Visual Piano Transcription
Por: Koepke, S, et al.
Publicado em: (2020)

ASDNet: An Efficient Self-Supervised Convolutional Network for Anomalous Sound Detection
Por: Dewei Kong, et al.
Publicado em: (2025-01-01)

Self-Supervised Transfer Learning from Natural Images for Sound Classification
Por: Sungho Shin, et al.
Publicado em: (2021-03-01)

Visually Indicated Sounds
Por: Isola, Phillip, et al.
Publicado em: (2017)

Direct Underwater Sound Velocity Measurement Based on the Acousto-Optic Self-Interference Effect between the Chirp Signal and the Optical Frequency Comb
Por: Zihui Yang, et al.
Publicado em: (2022-12-01)

Extraction of Individual EEG Gamma Frequencies from the Responses to Click-Based Chirp-Modulated Sounds
Por: Aurimas Mockevičius, et al.
Publicado em: (2023-03-01)

Chat2VIS: Generating Data Visualizations via Natural Language Using ChatGPT, Codex and GPT-3 Large Language Models
Por: Paula Maddigan, et al.
Publicado em: (2023-01-01)

Disentangled Speech Embeddings Using Cross-Modal Self-Supervision
Por: Nagrani, A, et al.
Publicado em: (2020)

Self-Supervised Audio-Visual Co-Segmentation
Por: Rouditchenko, Andrew, et al.
Publicado em: (2022)

Self-Supervised Audio-Visual Co-Segmentation
Por: Rouditchenko, Andrew, et al.
Publicado em: (2021)

Self-Supervised Autoencoders for Visual Anomaly Detection
Por: Alexander Bauer, et al.
Publicado em: (2024-12-01)

Combining Unsupervised and Supervised Learning for Sample Efficient Continuous Language Grounding
Por: Oliver Roesler
Publicado em: (2022-09-01)

Weakly-supervised fingerspelling recognition in British Sign Language videos
Por: Prajwal, KR, et al.
Publicado em: (2022)

Self-supervised learning of a facial attribute embedding from video
Por: Wiles, O, et al.
Publicado em: (2018)

Self-supervised video object segmentation by motion grouping
Por: Yang, C, et al.
Publicado em: (2021)

A Climate Hyperspectral Infrared Radiance Product (CHIRP) Combining the AIRS and CrIS Satellite Sounding Record
Por: L. Larrabee Strow, et al.
Publicado em: (2021-01-01)

Diagnostics of HF radio channel: based on data from backscatter ionospheric sounding by continuous chirp signal
Por: Ponomarchuk S.N., et al.
Publicado em: (2018-06-01)

Application of Optimized Adaptive Chirp Mode Decomposition Method in Chirp Signal
Por: Junyuan Wang, et al.
Publicado em: (2020-05-01)

Self-supervised multi-modal alignment for whole body medical imaging
Por: Windsor, R, et al.
Publicado em: (2021)

Now you're speaking my language: visual language identification
Por: Afouras, T, et al.
Publicado em: (2020)

Exploring the Utility of ChatGPT for Self-directed Online Language Learning
Por: Zixi Li, et al.
Publicado em: (2024-09-01)