Single channel speech separation with constrained utterance level permutation invariant training using grid LSTM

Utterance level permutation invariant training (uPIT) technique is a state-of-the-art deep learning architecture for speaker independent multi-talker separation. uPIT solves the label ambiguity problem by minimizing the mean square error (MSE) over all permutations between outputs and targets. Howev...

Full description

Bibliographic Details
Main Authors: Xu, Chenglin, Rao, Wei, Xiao, Xiong, Chng, Eng Siong, Li, Haizhou
Other Authors: School of Computer Science and Engineering
Format: Conference Paper
Language:English
Published: 2020
Subjects:
Online Access:https://hdl.handle.net/10356/137336
_version_ 1811681883709243392
author Xu, Chenglin
Rao, Wei
Xiao, Xiong
Chng, Eng Siong
Li, Haizhou
author2 School of Computer Science and Engineering
author_facet School of Computer Science and Engineering
Xu, Chenglin
Rao, Wei
Xiao, Xiong
Chng, Eng Siong
Li, Haizhou
author_sort Xu, Chenglin
collection NTU
description Utterance level permutation invariant training (uPIT) technique is a state-of-the-art deep learning architecture for speaker independent multi-talker separation. uPIT solves the label ambiguity problem by minimizing the mean square error (MSE) over all permutations between outputs and targets. However, uPIT may be sub-optimal at segmental level because the optimization is not calculated over the individual frames. In this paper, we propose a constrained uPIT (cuPIT) to solve this problem by computing a weighted MSE loss using dynamic information (i.e., delta and acceleration). The weighted loss ensures the temporal continuity of output frames with the same speaker. Inspired by the heuristics (i.e., vocal tract continuity) in computational auditory scene analysis, we then extend the model by adding a Grid LSTM layer, that we name it as cuPIT-Grid LSTM, to automatically learn both temporal and spectral patterns over the input magnitude spectrum simultaneously. The experimental results show 9.6% and 8.5% relative improvements on WSJ0-2mix dataset under both closed and open conditions comparing with the uPIT baseline.
first_indexed 2024-10-01T03:48:01Z
format Conference Paper
id ntu-10356/137336
institution Nanyang Technological University
language English
last_indexed 2024-10-01T03:48:01Z
publishDate 2020
record_format dspace
spelling ntu-10356/1373362020-03-18T04:43:09Z Single channel speech separation with constrained utterance level permutation invariant training using grid LSTM Xu, Chenglin Rao, Wei Xiao, Xiong Chng, Eng Siong Li, Haizhou School of Computer Science and Engineering 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Temasek Laboratories Engineering::Computer science and engineering Constrained Permutation Invariant Training Grid LSTM Utterance level permutation invariant training (uPIT) technique is a state-of-the-art deep learning architecture for speaker independent multi-talker separation. uPIT solves the label ambiguity problem by minimizing the mean square error (MSE) over all permutations between outputs and targets. However, uPIT may be sub-optimal at segmental level because the optimization is not calculated over the individual frames. In this paper, we propose a constrained uPIT (cuPIT) to solve this problem by computing a weighted MSE loss using dynamic information (i.e., delta and acceleration). The weighted loss ensures the temporal continuity of output frames with the same speaker. Inspired by the heuristics (i.e., vocal tract continuity) in computational auditory scene analysis, we then extend the model by adding a Grid LSTM layer, that we name it as cuPIT-Grid LSTM, to automatically learn both temporal and spectral patterns over the input magnitude spectrum simultaneously. The experimental results show 9.6% and 8.5% relative improvements on WSJ0-2mix dataset under both closed and open conditions comparing with the uPIT baseline. Accepted version 2020-03-18T04:43:09Z 2020-03-18T04:43:09Z 2018 Conference Paper Xu, C., Rao, W., Xiao, X., Chng, E. S., & Li, H. (2018). Single channel speech separation with constrained utterance level permutation invariant training using grid LSTM. Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6-10. doi:10.1109/icassp.2018.8462471 9781538646588 https://hdl.handle.net/10356/137336 10.1109/ICASSP.2018.8462471 2-s2.0-85054262852 6 10 en © 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The published version is available at: https://doi.org/10.1109/ICASSP.2018.8462471 application/pdf
spellingShingle Engineering::Computer science and engineering
Constrained Permutation Invariant Training
Grid LSTM
Xu, Chenglin
Rao, Wei
Xiao, Xiong
Chng, Eng Siong
Li, Haizhou
Single channel speech separation with constrained utterance level permutation invariant training using grid LSTM
title Single channel speech separation with constrained utterance level permutation invariant training using grid LSTM
title_full Single channel speech separation with constrained utterance level permutation invariant training using grid LSTM
title_fullStr Single channel speech separation with constrained utterance level permutation invariant training using grid LSTM
title_full_unstemmed Single channel speech separation with constrained utterance level permutation invariant training using grid LSTM
title_short Single channel speech separation with constrained utterance level permutation invariant training using grid LSTM
title_sort single channel speech separation with constrained utterance level permutation invariant training using grid lstm
topic Engineering::Computer science and engineering
Constrained Permutation Invariant Training
Grid LSTM
url https://hdl.handle.net/10356/137336
work_keys_str_mv AT xuchenglin singlechannelspeechseparationwithconstrainedutterancelevelpermutationinvarianttrainingusinggridlstm
AT raowei singlechannelspeechseparationwithconstrainedutterancelevelpermutationinvarianttrainingusinggridlstm
AT xiaoxiong singlechannelspeechseparationwithconstrainedutterancelevelpermutationinvarianttrainingusinggridlstm
AT chngengsiong singlechannelspeechseparationwithconstrainedutterancelevelpermutationinvarianttrainingusinggridlstm
AT lihaizhou singlechannelspeechseparationwithconstrainedutterancelevelpermutationinvarianttrainingusinggridlstm