Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation

This paper investigates deep neural networks (DNN) based on nonlinear feature mapping and statistical linear feature adaptation approaches for reducing reverberation in speech signals. In the nonlinear feature mapping approach, DNN is trained from parallel clean/distorted speech corpus to map reverb...

Full description

Bibliographic Details
Main Authors:	Xiao, Xiong, Zhao, Shengkui, Nguyen, Duc Hoang Ha, Zhong, Xionghu, Jones, Douglas L., Chng, Eng Siong, Li, Haizhou
Other Authors:	School of Computer Engineering
Format:	Journal Article
Language:	English
Published:	2016
Subjects:	Speech enhancement Deep neural networks Dynamic features Feature adaptation Robust speech recognition Reverberation challenge Beamforming
Online Access:	https://hdl.handle.net/10356/82372 http://hdl.handle.net/10220/39943

_version_	1811691655455047680
author	Xiao, Xiong Zhao, Shengkui Nguyen, Duc Hoang Ha Zhong, Xionghu Jones, Douglas L. Chng, Eng Siong Li, Haizhou
author2	School of Computer Engineering
author_facet	School of Computer Engineering Xiao, Xiong Zhao, Shengkui Nguyen, Duc Hoang Ha Zhong, Xionghu Jones, Douglas L. Chng, Eng Siong Li, Haizhou
author_sort	Xiao, Xiong
collection	NTU
description	This paper investigates deep neural networks (DNN) based on nonlinear feature mapping and statistical linear feature adaptation approaches for reducing reverberation in speech signals. In the nonlinear feature mapping approach, DNN is trained from parallel clean/distorted speech corpus to map reverberant and noisy speech coefficients (such as log magnitude spectrum) to the underlying clean speech coefficients. The constraint imposed by dynamic features (i.e., the time derivatives of the speech coefficients) are used to enhance the smoothness of predicted coefficient trajectories in two ways. One is to obtain the enhanced speech coefficients with a least square estimation from the coefficients and dynamic features predicted by DNN. The other is to incorporate the constraint of dynamic features directly into the DNN training process using a sequential cost function. In the linear feature adaptation approach, a sparse linear transform, called cross transform, is used to transform multiple frames of speech coefficients to a new feature space. The transform is estimated to maximize the likelihood of the transformed coefficients given a model of clean speech coefficients. Unlike the DNN approach, no parallel corpus is used and no assumption on distortion types is made. The two approaches are evaluated on the REVERB Challenge 2014 tasks. Both speech enhancement and automatic speech recognition (ASR) results show that the DNN-based mappings significantly reduce the reverberation in speech and improve both speech quality and ASR performance. For the speech enhancement task, the proposed dynamic feature constraint help to improve cepstral distance, frequency-weighted segmental signal-to-noise ratio (SNR), and log likelihood ratio metrics while moderately degrades the speech-to-reverberation modulation energy ratio. In addition, the cross transform feature adaptation improves the ASR performance significantly for clean-condition trained acoustic models.
first_indexed	2024-10-01T06:23:21Z
format	Journal Article
id	ntu-10356/82372
institution	Nanyang Technological University
language	English
last_indexed	2024-10-01T06:23:21Z
publishDate	2016
record_format	dspace
spelling	ntu-10356/823722020-09-26T22:18:35Z Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation Xiao, Xiong Zhao, Shengkui Nguyen, Duc Hoang Ha Zhong, Xionghu Jones, Douglas L. Chng, Eng Siong Li, Haizhou School of Computer Engineering Temasek Laboratories Speech enhancement Deep neural networks Dynamic features Feature adaptation Robust speech recognition Reverberation challenge Beamforming This paper investigates deep neural networks (DNN) based on nonlinear feature mapping and statistical linear feature adaptation approaches for reducing reverberation in speech signals. In the nonlinear feature mapping approach, DNN is trained from parallel clean/distorted speech corpus to map reverberant and noisy speech coefficients (such as log magnitude spectrum) to the underlying clean speech coefficients. The constraint imposed by dynamic features (i.e., the time derivatives of the speech coefficients) are used to enhance the smoothness of predicted coefficient trajectories in two ways. One is to obtain the enhanced speech coefficients with a least square estimation from the coefficients and dynamic features predicted by DNN. The other is to incorporate the constraint of dynamic features directly into the DNN training process using a sequential cost function. In the linear feature adaptation approach, a sparse linear transform, called cross transform, is used to transform multiple frames of speech coefficients to a new feature space. The transform is estimated to maximize the likelihood of the transformed coefficients given a model of clean speech coefficients. Unlike the DNN approach, no parallel corpus is used and no assumption on distortion types is made. The two approaches are evaluated on the REVERB Challenge 2014 tasks. Both speech enhancement and automatic speech recognition (ASR) results show that the DNN-based mappings significantly reduce the reverberation in speech and improve both speech quality and ASR performance. For the speech enhancement task, the proposed dynamic feature constraint help to improve cepstral distance, frequency-weighted segmental signal-to-noise ratio (SNR), and log likelihood ratio metrics while moderately degrades the speech-to-reverberation modulation energy ratio. In addition, the cross transform feature adaptation improves the ASR performance significantly for clean-condition trained acoustic models. Published version 2016-02-03T08:22:27Z 2019-12-06T14:54:21Z 2016-02-03T08:22:27Z 2019-12-06T14:54:21Z 2016 Journal Article Xiao, X., Zhao, S., Nguyen, D. H. H., Zhong, X., Jones, D. L., Chng, E. S., et al. (2016). Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation. EURASIP Journal on Advances in Signal Processing, 2016, 4-. 1687-6172 https://hdl.handle.net/10356/82372 http://hdl.handle.net/10220/39943 10.1186/s13634-015-0300-4 en EURASIP Journal on Advances in Signal Processing © 2016 Xiao et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. 18 p. application/pdf
spellingShingle	Speech enhancement Deep neural networks Dynamic features Feature adaptation Robust speech recognition Reverberation challenge Beamforming Xiao, Xiong Zhao, Shengkui Nguyen, Duc Hoang Ha Zhong, Xionghu Jones, Douglas L. Chng, Eng Siong Li, Haizhou Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation
title	Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation
title_full	Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation
title_fullStr	Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation
title_full_unstemmed	Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation
title_short	Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation
title_sort	speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation
topic	Speech enhancement Deep neural networks Dynamic features Feature adaptation Robust speech recognition Reverberation challenge Beamforming
url	https://hdl.handle.net/10356/82372 http://hdl.handle.net/10220/39943
work_keys_str_mv	AT xiaoxiong speechdereverberationforenhancementandrecognitionusingdynamicfeaturesconstraineddeepneuralnetworksandfeatureadaptation AT zhaoshengkui speechdereverberationforenhancementandrecognitionusingdynamicfeaturesconstraineddeepneuralnetworksandfeatureadaptation AT nguyenduchoangha speechdereverberationforenhancementandrecognitionusingdynamicfeaturesconstraineddeepneuralnetworksandfeatureadaptation AT zhongxionghu speechdereverberationforenhancementandrecognitionusingdynamicfeaturesconstraineddeepneuralnetworksandfeatureadaptation AT jonesdouglasl speechdereverberationforenhancementandrecognitionusingdynamicfeaturesconstraineddeepneuralnetworksandfeatureadaptation AT chngengsiong speechdereverberationforenhancementandrecognitionusingdynamicfeaturesconstraineddeepneuralnetworksandfeatureadaptation AT lihaizhou speechdereverberationforenhancementandrecognitionusingdynamicfeaturesconstraineddeepneuralnetworksandfeatureadaptation

Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation

Similar Items