Deep Clustering With Constant Q Transform For Multi-Talker Single Channel Speech Separation
Deep clustering technique is a state-of-the-art deep learning-based method for multi-talker speaker-independent speech separation. It solves the label ambiguity problem by mapping time-frequency (TF) bins of the mixed spectrogram to an embedding space, and assigning contrastive embedding vectors to...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
FRUCT
2018-11-01
|
Series: | Proceedings of the XXth Conference of Open Innovations Association FRUCT |
Subjects: | |
Online Access: | https://fruct.org/publications/abstract23/files/Shi.pdf
|
_version_ | 1811270107909849088 |
---|---|
author | Ziqiang Shi Huibin Lin Liu Liu Rujie Liu Shoji Hayakawa Jiqing Han |
author_facet | Ziqiang Shi Huibin Lin Liu Liu Rujie Liu Shoji Hayakawa Jiqing Han |
author_sort | Ziqiang Shi |
collection | DOAJ |
description | Deep clustering technique is a state-of-the-art deep learning-based method for multi-talker speaker-independent speech separation. It solves the label ambiguity problem by mapping time-frequency (TF) bins of the mixed spectrogram to an embedding space, and assigning contrastive embedding vectors to different TF regions in order to predict the mask of the target spectrogram of each speaker. The original deep clustering transforms the speech into the TF domain through a short-time Fourier transform (STFT). Since the frequency component of STFT is linear, while the frequency distribution of human auditory system is nonlinear. Therefore, we propose to use constant Q transform (CQT) instead of STFT to achieve a better simulation of the frequency resolving power of the human auditory system. The ideal upper bound of signal-to-distortion (SDR) of CQT based deep clustering is higher than that based on STFT. In the same experimental setting on WSJ0-mix2 corpus, we gave a detail description in selecting meta-parameters of CQT for speech separation, and finally the SDR improvements of this method achieved about 1dB better performance than the original deep clustering. |
first_indexed | 2024-04-12T21:55:21Z |
format | Article |
id | doaj.art-550e272c4c99475098ff07d37c51856f |
institution | Directory Open Access Journal |
issn | 2305-7254 2343-0737 |
language | English |
last_indexed | 2024-04-12T21:55:21Z |
publishDate | 2018-11-01 |
publisher | FRUCT |
record_format | Article |
series | Proceedings of the XXth Conference of Open Innovations Association FRUCT |
spelling | doaj.art-550e272c4c99475098ff07d37c51856f2022-12-22T03:15:20ZengFRUCTProceedings of the XXth Conference of Open Innovations Association FRUCT2305-72542343-07372018-11-0160223538542Deep Clustering With Constant Q Transform For Multi-Talker Single Channel Speech SeparationZiqiang Shi0Huibin Lin1Liu Liu2Rujie Liu3Shoji Hayakawa4Jiqing Han5Fujitsu Research and Development Center Beijing, ChinaFujitsu Research and Development Center Beijing, ChinaFujitsu Research and Development Center Beijing, ChinaFujitsu Research and Development Center Beijing, ChinaFujitsu Laboratories Ltd., Kawasaki, JapanHarbin Institute of Technology Harbin, ChinaDeep clustering technique is a state-of-the-art deep learning-based method for multi-talker speaker-independent speech separation. It solves the label ambiguity problem by mapping time-frequency (TF) bins of the mixed spectrogram to an embedding space, and assigning contrastive embedding vectors to different TF regions in order to predict the mask of the target spectrogram of each speaker. The original deep clustering transforms the speech into the TF domain through a short-time Fourier transform (STFT). Since the frequency component of STFT is linear, while the frequency distribution of human auditory system is nonlinear. Therefore, we propose to use constant Q transform (CQT) instead of STFT to achieve a better simulation of the frequency resolving power of the human auditory system. The ideal upper bound of signal-to-distortion (SDR) of CQT based deep clustering is higher than that based on STFT. In the same experimental setting on WSJ0-mix2 corpus, we gave a detail description in selecting meta-parameters of CQT for speech separation, and finally the SDR improvements of this method achieved about 1dB better performance than the original deep clustering.https://fruct.org/publications/abstract23/files/Shi.pdf Speech separationdeep learningconstant q transformembeddingclustering |
spellingShingle | Ziqiang Shi Huibin Lin Liu Liu Rujie Liu Shoji Hayakawa Jiqing Han Deep Clustering With Constant Q Transform For Multi-Talker Single Channel Speech Separation Proceedings of the XXth Conference of Open Innovations Association FRUCT Speech separation deep learning constant q transform embedding clustering |
title | Deep Clustering With Constant Q Transform For Multi-Talker Single Channel Speech Separation |
title_full | Deep Clustering With Constant Q Transform For Multi-Talker Single Channel Speech Separation |
title_fullStr | Deep Clustering With Constant Q Transform For Multi-Talker Single Channel Speech Separation |
title_full_unstemmed | Deep Clustering With Constant Q Transform For Multi-Talker Single Channel Speech Separation |
title_short | Deep Clustering With Constant Q Transform For Multi-Talker Single Channel Speech Separation |
title_sort | deep clustering with constant q transform for multi talker single channel speech separation |
topic | Speech separation deep learning constant q transform embedding clustering |
url | https://fruct.org/publications/abstract23/files/Shi.pdf
|
work_keys_str_mv | AT ziqiangshi deepclusteringwithconstantqtransformformultitalkersinglechannelspeechseparation AT huibinlin deepclusteringwithconstantqtransformformultitalkersinglechannelspeechseparation AT liuliu deepclusteringwithconstantqtransformformultitalkersinglechannelspeechseparation AT rujieliu deepclusteringwithconstantqtransformformultitalkersinglechannelspeechseparation AT shojihayakawa deepclusteringwithconstantqtransformformultitalkersinglechannelspeechseparation AT jiqinghan deepclusteringwithconstantqtransformformultitalkersinglechannelspeechseparation |