Self-Supervised Sound Promotion Method of Sound Localization from Video

Compared to traditional unimodal methods, multimodal audio-visual correspondence learning has many advantages in the field of video understanding, but it also faces significant challenges. In order to fully utilize the feature information from both modalities, we needs to ensure accurate alignment o...

Full description

Bibliographic Details
Main Authors:	Yang Li, Xiaoli Zhao, Zhuoyao Zhang
Format:	Article
Language:	English
Published:	MDPI AG 2023-08-01
Series:	Electronics
Subjects:	audiovisual learning self-supervised sound localization multi-model
Online Access:	https://www.mdpi.com/2079-9292/12/17/3558

_version_	1797582659359080448
author	Yang Li Xiaoli Zhao Zhuoyao Zhang
author_facet	Yang Li Xiaoli Zhao Zhuoyao Zhang
author_sort	Yang Li
collection	DOAJ
description	Compared to traditional unimodal methods, multimodal audio-visual correspondence learning has many advantages in the field of video understanding, but it also faces significant challenges. In order to fully utilize the feature information from both modalities, we needs to ensure accurate alignment of the semantic information from each modality, rather than simply concatenating them together. This requires consideration of how to design fusion networks that can better perform this task. Current algorithms heavily rely on the network’s output results for sound-object localization while neglecting the potential issue of suppressed feature information due to the internal structure of the network. Thus, we propose a sound promotion method (SPM), a self-supervised framework that aims to increase the contribution of voices to produce better performance of the audiovisual learning. We first cluster the audio separately to generate pseudo-labels and then use the clusters to train the backbone of audio. Finally, we explore the impact of our method to several existing approaches on MUSIC datasets and the results prove that our proposed method is able to produce better performance.
first_indexed	2024-03-10T23:25:37Z
format	Article
id	doaj.art-4b724e6c7b2c48af87eae28f3249cace
institution	Directory Open Access Journal
issn	2079-9292
language	English
last_indexed	2024-03-10T23:25:37Z
publishDate	2023-08-01
publisher	MDPI AG
record_format	Article
series	Electronics
spelling	doaj.art-4b724e6c7b2c48af87eae28f3249cace2023-11-19T08:00:53ZengMDPI AGElectronics2079-92922023-08-011217355810.3390/electronics12173558Self-Supervised Sound Promotion Method of Sound Localization from VideoYang Li0Xiaoli Zhao1Zhuoyao Zhang2School of Electronicand Electrical Engineering, Shanghai University of Engineering Science, 333 Longteng Road, Shanghai 201620, ChinaSchool of Electronicand Electrical Engineering, Shanghai University of Engineering Science, 333 Longteng Road, Shanghai 201620, ChinaSchool of Electronicand Electrical Engineering, Shanghai University of Engineering Science, 333 Longteng Road, Shanghai 201620, ChinaCompared to traditional unimodal methods, multimodal audio-visual correspondence learning has many advantages in the field of video understanding, but it also faces significant challenges. In order to fully utilize the feature information from both modalities, we needs to ensure accurate alignment of the semantic information from each modality, rather than simply concatenating them together. This requires consideration of how to design fusion networks that can better perform this task. Current algorithms heavily rely on the network’s output results for sound-object localization while neglecting the potential issue of suppressed feature information due to the internal structure of the network. Thus, we propose a sound promotion method (SPM), a self-supervised framework that aims to increase the contribution of voices to produce better performance of the audiovisual learning. We first cluster the audio separately to generate pseudo-labels and then use the clusters to train the backbone of audio. Finally, we explore the impact of our method to several existing approaches on MUSIC datasets and the results prove that our proposed method is able to produce better performance.https://www.mdpi.com/2079-9292/12/17/3558audiovisual learningself-supervisedsound localizationmulti-model
spellingShingle	Yang Li Xiaoli Zhao Zhuoyao Zhang Self-Supervised Sound Promotion Method of Sound Localization from Video Electronics audiovisual learning self-supervised sound localization multi-model
title	Self-Supervised Sound Promotion Method of Sound Localization from Video
title_full	Self-Supervised Sound Promotion Method of Sound Localization from Video
title_fullStr	Self-Supervised Sound Promotion Method of Sound Localization from Video
title_full_unstemmed	Self-Supervised Sound Promotion Method of Sound Localization from Video
title_short	Self-Supervised Sound Promotion Method of Sound Localization from Video
title_sort	self supervised sound promotion method of sound localization from video
topic	audiovisual learning self-supervised sound localization multi-model
url	https://www.mdpi.com/2079-9292/12/17/3558
work_keys_str_mv	AT yangli selfsupervisedsoundpromotionmethodofsoundlocalizationfromvideo AT xiaolizhao selfsupervisedsoundpromotionmethodofsoundlocalizationfromvideo AT zhuoyaozhang selfsupervisedsoundpromotionmethodofsoundlocalizationfromvideo

Self-Supervised Sound Promotion Method of Sound Localization from Video

Similar Items