Deep Clustering With Constant Q Transform For Multi-Talker Single Channel Speech Separation

Deep clustering technique is a state-of-the-art deep learning-based method for multi-talker speaker-independent speech separation. It solves the label ambiguity problem by mapping time-frequency (TF) bins of the mixed spectrogram to an embedding space, and assigning contrastive embedding vectors to...

Full description

Bibliographic Details
Main Authors:	Ziqiang Shi, Huibin Lin, Liu Liu, Rujie Liu, Shoji Hayakawa, Jiqing Han
Format:	Article
Language:	English
Published:	FRUCT 2018-11-01
Series:	Proceedings of the XXth Conference of Open Innovations Association FRUCT
Subjects:	Speech separation deep learning constant q transform embedding clustering
Online Access:	https://fruct.org/publications/abstract23/files/Shi.pdf

_version_	1828241259437752320
author	Ziqiang Shi Huibin Lin Liu Liu Rujie Liu Shoji Hayakawa Jiqing Han
author_facet	Ziqiang Shi Huibin Lin Liu Liu Rujie Liu Shoji Hayakawa Jiqing Han
author_sort	Ziqiang Shi
collection	DOAJ
description	Deep clustering technique is a state-of-the-art deep learning-based method for multi-talker speaker-independent speech separation. It solves the label ambiguity problem by mapping time-frequency (TF) bins of the mixed spectrogram to an embedding space, and assigning contrastive embedding vectors to different TF regions in order to predict the mask of the target spectrogram of each speaker. The original deep clustering transforms the speech into the TF domain through a short-time Fourier transform (STFT). Since the frequency component of STFT is linear, while the frequency distribution of human auditory system is nonlinear. Therefore, we propose to use constant Q transform (CQT) instead of STFT to achieve a better simulation of the frequency resolving power of the human auditory system. The ideal upper bound of signal-to-distortion (SDR) of CQT based deep clustering is higher than that based on STFT. In the same experimental setting on WSJ0-mix2 corpus, we gave a detail description in selecting meta-parameters of CQT for speech separation, and finally the SDR improvements of this method achieved about 1dB better performance than the original deep clustering.
first_indexed	2024-04-12T21:55:21Z
format	Article
id	doaj.art-550e272c4c99475098ff07d37c51856f
institution	Directory Open Access Journal
issn	2305-7254 2343-0737
language	English
last_indexed	2024-04-12T21:55:21Z
publishDate	2018-11-01
publisher	FRUCT
record_format	Article
series	Proceedings of the XXth Conference of Open Innovations Association FRUCT
spelling	doaj.art-550e272c4c99475098ff07d37c51856f2022-12-22T03:15:20ZengFRUCTProceedings of the XXth Conference of Open Innovations Association FRUCT2305-72542343-07372018-11-0160223538542Deep Clustering With Constant Q Transform For Multi-Talker Single Channel Speech SeparationZiqiang Shi0Huibin Lin1Liu Liu2Rujie Liu3Shoji Hayakawa4Jiqing Han5Fujitsu Research and Development Center Beijing, ChinaFujitsu Research and Development Center Beijing, ChinaFujitsu Research and Development Center Beijing, ChinaFujitsu Research and Development Center Beijing, ChinaFujitsu Laboratories Ltd., Kawasaki, JapanHarbin Institute of Technology Harbin, ChinaDeep clustering technique is a state-of-the-art deep learning-based method for multi-talker speaker-independent speech separation. It solves the label ambiguity problem by mapping time-frequency (TF) bins of the mixed spectrogram to an embedding space, and assigning contrastive embedding vectors to different TF regions in order to predict the mask of the target spectrogram of each speaker. The original deep clustering transforms the speech into the TF domain through a short-time Fourier transform (STFT). Since the frequency component of STFT is linear, while the frequency distribution of human auditory system is nonlinear. Therefore, we propose to use constant Q transform (CQT) instead of STFT to achieve a better simulation of the frequency resolving power of the human auditory system. The ideal upper bound of signal-to-distortion (SDR) of CQT based deep clustering is higher than that based on STFT. In the same experimental setting on WSJ0-mix2 corpus, we gave a detail description in selecting meta-parameters of CQT for speech separation, and finally the SDR improvements of this method achieved about 1dB better performance than the original deep clustering.https://fruct.org/publications/abstract23/files/Shi.pdf Speech separationdeep learningconstant q transformembeddingclustering
spellingShingle	Ziqiang Shi Huibin Lin Liu Liu Rujie Liu Shoji Hayakawa Jiqing Han Deep Clustering With Constant Q Transform For Multi-Talker Single Channel Speech Separation Proceedings of the XXth Conference of Open Innovations Association FRUCT Speech separation deep learning constant q transform embedding clustering
title	Deep Clustering With Constant Q Transform For Multi-Talker Single Channel Speech Separation
title_full	Deep Clustering With Constant Q Transform For Multi-Talker Single Channel Speech Separation
title_fullStr	Deep Clustering With Constant Q Transform For Multi-Talker Single Channel Speech Separation
title_full_unstemmed	Deep Clustering With Constant Q Transform For Multi-Talker Single Channel Speech Separation
title_short	Deep Clustering With Constant Q Transform For Multi-Talker Single Channel Speech Separation
title_sort	deep clustering with constant q transform for multi talker single channel speech separation
topic	Speech separation deep learning constant q transform embedding clustering
url	https://fruct.org/publications/abstract23/files/Shi.pdf
work_keys_str_mv	AT ziqiangshi deepclusteringwithconstantqtransformformultitalkersinglechannelspeechseparation AT huibinlin deepclusteringwithconstantqtransformformultitalkersinglechannelspeechseparation AT liuliu deepclusteringwithconstantqtransformformultitalkersinglechannelspeechseparation AT rujieliu deepclusteringwithconstantqtransformformultitalkersinglechannelspeechseparation AT shojihayakawa deepclusteringwithconstantqtransformformultitalkersinglechannelspeechseparation AT jiqinghan deepclusteringwithconstantqtransformformultitalkersinglechannelspeechseparation

Deep Clustering With Constant Q Transform For Multi-Talker Single Channel Speech Separation

Similar Items