Deep Clustering With Constant Q Transform For Multi-Talker Single Channel Speech Separation

Deep clustering technique is a state-of-the-art deep learning-based method for multi-talker speaker-independent speech separation. It solves the label ambiguity problem by mapping time-frequency (TF) bins of the mixed spectrogram to an embedding space, and assigning contrastive embedding vectors to...

Full description

Bibliographic Details
Main Authors: Ziqiang Shi, Huibin Lin, Liu Liu, Rujie Liu, Shoji Hayakawa, Jiqing Han
Format: Article
Language:English
Published: FRUCT 2018-11-01
Series:Proceedings of the XXth Conference of Open Innovations Association FRUCT
Subjects:
Online Access:https://fruct.org/publications/abstract23/files/Shi.pdf
_version_ 1811270107909849088
author Ziqiang Shi
Huibin Lin
Liu Liu
Rujie Liu
Shoji Hayakawa
Jiqing Han
author_facet Ziqiang Shi
Huibin Lin
Liu Liu
Rujie Liu
Shoji Hayakawa
Jiqing Han
author_sort Ziqiang Shi
collection DOAJ
description Deep clustering technique is a state-of-the-art deep learning-based method for multi-talker speaker-independent speech separation. It solves the label ambiguity problem by mapping time-frequency (TF) bins of the mixed spectrogram to an embedding space, and assigning contrastive embedding vectors to different TF regions in order to predict the mask of the target spectrogram of each speaker. The original deep clustering transforms the speech into the TF domain through a short-time Fourier transform (STFT). Since the frequency component of STFT is linear, while the frequency distribution of human auditory system is nonlinear. Therefore, we propose to use constant Q transform (CQT) instead of STFT to achieve a better simulation of the frequency resolving power of the human auditory system. The ideal upper bound of signal-to-distortion (SDR) of CQT based deep clustering is higher than that based on STFT. In the same experimental setting on WSJ0-mix2 corpus, we gave a detail description in selecting meta-parameters of CQT for speech separation, and finally the SDR improvements of this method achieved about 1dB better performance than the original deep clustering.
first_indexed 2024-04-12T21:55:21Z
format Article
id doaj.art-550e272c4c99475098ff07d37c51856f
institution Directory Open Access Journal
issn 2305-7254
2343-0737
language English
last_indexed 2024-04-12T21:55:21Z
publishDate 2018-11-01
publisher FRUCT
record_format Article
series Proceedings of the XXth Conference of Open Innovations Association FRUCT
spelling doaj.art-550e272c4c99475098ff07d37c51856f2022-12-22T03:15:20ZengFRUCTProceedings of the XXth Conference of Open Innovations Association FRUCT2305-72542343-07372018-11-0160223538542Deep Clustering With Constant Q Transform For Multi-Talker Single Channel Speech SeparationZiqiang Shi0Huibin Lin1Liu Liu2Rujie Liu3Shoji Hayakawa4Jiqing Han5Fujitsu Research and Development Center Beijing, ChinaFujitsu Research and Development Center Beijing, ChinaFujitsu Research and Development Center Beijing, ChinaFujitsu Research and Development Center Beijing, ChinaFujitsu Laboratories Ltd., Kawasaki, JapanHarbin Institute of Technology Harbin, ChinaDeep clustering technique is a state-of-the-art deep learning-based method for multi-talker speaker-independent speech separation. It solves the label ambiguity problem by mapping time-frequency (TF) bins of the mixed spectrogram to an embedding space, and assigning contrastive embedding vectors to different TF regions in order to predict the mask of the target spectrogram of each speaker. The original deep clustering transforms the speech into the TF domain through a short-time Fourier transform (STFT). Since the frequency component of STFT is linear, while the frequency distribution of human auditory system is nonlinear. Therefore, we propose to use constant Q transform (CQT) instead of STFT to achieve a better simulation of the frequency resolving power of the human auditory system. The ideal upper bound of signal-to-distortion (SDR) of CQT based deep clustering is higher than that based on STFT. In the same experimental setting on WSJ0-mix2 corpus, we gave a detail description in selecting meta-parameters of CQT for speech separation, and finally the SDR improvements of this method achieved about 1dB better performance than the original deep clustering.https://fruct.org/publications/abstract23/files/Shi.pdf Speech separationdeep learningconstant q transformembeddingclustering
spellingShingle Ziqiang Shi
Huibin Lin
Liu Liu
Rujie Liu
Shoji Hayakawa
Jiqing Han
Deep Clustering With Constant Q Transform For Multi-Talker Single Channel Speech Separation
Proceedings of the XXth Conference of Open Innovations Association FRUCT
Speech separation
deep learning
constant q transform
embedding
clustering
title Deep Clustering With Constant Q Transform For Multi-Talker Single Channel Speech Separation
title_full Deep Clustering With Constant Q Transform For Multi-Talker Single Channel Speech Separation
title_fullStr Deep Clustering With Constant Q Transform For Multi-Talker Single Channel Speech Separation
title_full_unstemmed Deep Clustering With Constant Q Transform For Multi-Talker Single Channel Speech Separation
title_short Deep Clustering With Constant Q Transform For Multi-Talker Single Channel Speech Separation
title_sort deep clustering with constant q transform for multi talker single channel speech separation
topic Speech separation
deep learning
constant q transform
embedding
clustering
url https://fruct.org/publications/abstract23/files/Shi.pdf
work_keys_str_mv AT ziqiangshi deepclusteringwithconstantqtransformformultitalkersinglechannelspeechseparation
AT huibinlin deepclusteringwithconstantqtransformformultitalkersinglechannelspeechseparation
AT liuliu deepclusteringwithconstantqtransformformultitalkersinglechannelspeechseparation
AT rujieliu deepclusteringwithconstantqtransformformultitalkersinglechannelspeechseparation
AT shojihayakawa deepclusteringwithconstantqtransformformultitalkersinglechannelspeechseparation
AT jiqinghan deepclusteringwithconstantqtransformformultitalkersinglechannelspeechseparation