3 directional Inception-ResUNet: Deep spatial feature learning for multichannel singing voice separation with distortion.

Singing voice separation on robots faces the problem of interpreting ambiguous auditory signals. The acoustic signal, which the humanoid robot perceives through its onboard microphones, is a mixture of singing voice, music, and noise, with distortion, attenuation, and reverberation. In this paper, w...

Full description

Bibliographic Details
Main Authors: DaDong Wang, Jie Wang, MingChen Sun
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2024-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0289453
_version_ 1797326200811552768
author DaDong Wang
Jie Wang
MingChen Sun
author_facet DaDong Wang
Jie Wang
MingChen Sun
author_sort DaDong Wang
collection DOAJ
description Singing voice separation on robots faces the problem of interpreting ambiguous auditory signals. The acoustic signal, which the humanoid robot perceives through its onboard microphones, is a mixture of singing voice, music, and noise, with distortion, attenuation, and reverberation. In this paper, we used the 3D Inception-ResUNet structure in the U-shaped encoding and decoding network to improve the utilization of the spatial and spectral information of the spectrogram. Multiobjectives were used to train the model: magnitude consistency loss, phase consistency loss, and magnitude correlation consistency loss. We recorded the singing voice and accompaniment derived from the MIR-1K dataset with NAO robots and synthesized the 10-channel dataset for training the model. The experimental results show that the proposed model trained by multiple objectives reaches an average NSDR of 11.55 dB on the test dataset, which outperforms the comparison model.
first_indexed 2024-03-08T06:20:01Z
format Article
id doaj.art-0b796784f8df46eea81cde8f09ac923c
institution Directory Open Access Journal
issn 1932-6203
language English
last_indexed 2024-03-08T06:20:01Z
publishDate 2024-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj.art-0b796784f8df46eea81cde8f09ac923c2024-02-04T05:31:22ZengPublic Library of Science (PLoS)PLoS ONE1932-62032024-01-01191e028945310.1371/journal.pone.02894533 directional Inception-ResUNet: Deep spatial feature learning for multichannel singing voice separation with distortion.DaDong WangJie WangMingChen SunSinging voice separation on robots faces the problem of interpreting ambiguous auditory signals. The acoustic signal, which the humanoid robot perceives through its onboard microphones, is a mixture of singing voice, music, and noise, with distortion, attenuation, and reverberation. In this paper, we used the 3D Inception-ResUNet structure in the U-shaped encoding and decoding network to improve the utilization of the spatial and spectral information of the spectrogram. Multiobjectives were used to train the model: magnitude consistency loss, phase consistency loss, and magnitude correlation consistency loss. We recorded the singing voice and accompaniment derived from the MIR-1K dataset with NAO robots and synthesized the 10-channel dataset for training the model. The experimental results show that the proposed model trained by multiple objectives reaches an average NSDR of 11.55 dB on the test dataset, which outperforms the comparison model.https://doi.org/10.1371/journal.pone.0289453
spellingShingle DaDong Wang
Jie Wang
MingChen Sun
3 directional Inception-ResUNet: Deep spatial feature learning for multichannel singing voice separation with distortion.
PLoS ONE
title 3 directional Inception-ResUNet: Deep spatial feature learning for multichannel singing voice separation with distortion.
title_full 3 directional Inception-ResUNet: Deep spatial feature learning for multichannel singing voice separation with distortion.
title_fullStr 3 directional Inception-ResUNet: Deep spatial feature learning for multichannel singing voice separation with distortion.
title_full_unstemmed 3 directional Inception-ResUNet: Deep spatial feature learning for multichannel singing voice separation with distortion.
title_short 3 directional Inception-ResUNet: Deep spatial feature learning for multichannel singing voice separation with distortion.
title_sort 3 directional inception resunet deep spatial feature learning for multichannel singing voice separation with distortion
url https://doi.org/10.1371/journal.pone.0289453
work_keys_str_mv AT dadongwang 3directionalinceptionresunetdeepspatialfeaturelearningformultichannelsingingvoiceseparationwithdistortion
AT jiewang 3directionalinceptionresunetdeepspatialfeaturelearningformultichannelsingingvoiceseparationwithdistortion
AT mingchensun 3directionalinceptionresunetdeepspatialfeaturelearningformultichannelsingingvoiceseparationwithdistortion