Joint spectral and temporal normalization of features for robust recognition of noisy and reverberated speech

In this paper, we propose a framework for joint normalization of spectral and temporal statistics of speech features for robust speech recognition. Current feature normalization approaches normalize the spectral and temporal aspects of feature statistics separately to overcome noise and reverberatio...

Full description

Bibliographic Details
Main Authors: Xiao, Xiong, Chng, Eng Siong, Li, Haizhou
Other Authors: School of Computer Engineering
Format: Conference Paper
Language:English
Published: 2013
Subjects:
Online Access:https://hdl.handle.net/10356/98409
http://hdl.handle.net/10220/13398
_version_ 1811680697667026944
author Xiao, Xiong
Chng, Eng Siong
Li, Haizhou
author2 School of Computer Engineering
author_facet School of Computer Engineering
Xiao, Xiong
Chng, Eng Siong
Li, Haizhou
author_sort Xiao, Xiong
collection NTU
description In this paper, we propose a framework for joint normalization of spectral and temporal statistics of speech features for robust speech recognition. Current feature normalization approaches normalize the spectral and temporal aspects of feature statistics separately to overcome noise and reverberation. As a result, the interaction between the spectral normalization (e.g. mean and variance normalization, MVN) and temporal normalization (e.g. temporal structure normalization, TSN) is ignored. We propose a joint spectral and temporal normalization (JSTN) framework to simultaneously normalize these two aspects of feature statistics. In JSTN, feature trajectories are filtered by linear filters and the filters' coefficients are optimized by maximizing a likelihood-based objective function. Experimental results on Aurora-5 benchmark task show that JSTN consistently out-performs the cascade of MVN and TSN on test data corrupted by both additive noise and reverberation, which validates our proposal. Specifically, JSTN reduces average word error rate by 8-9% relatively over the cascade of MVN and TSN for both artificial and real noisy data.
first_indexed 2024-10-01T03:29:10Z
format Conference Paper
id ntu-10356/98409
institution Nanyang Technological University
language English
last_indexed 2024-10-01T03:29:10Z
publishDate 2013
record_format dspace
spelling ntu-10356/984092020-05-28T07:18:03Z Joint spectral and temporal normalization of features for robust recognition of noisy and reverberated speech Xiao, Xiong Chng, Eng Siong Li, Haizhou School of Computer Engineering IEEE International Conference on Acoustics, Speech and Signal Processing (2012 : Kyoto, Japan) Temasek Laboratories DRNTU::Engineering::Computer science and engineering In this paper, we propose a framework for joint normalization of spectral and temporal statistics of speech features for robust speech recognition. Current feature normalization approaches normalize the spectral and temporal aspects of feature statistics separately to overcome noise and reverberation. As a result, the interaction between the spectral normalization (e.g. mean and variance normalization, MVN) and temporal normalization (e.g. temporal structure normalization, TSN) is ignored. We propose a joint spectral and temporal normalization (JSTN) framework to simultaneously normalize these two aspects of feature statistics. In JSTN, feature trajectories are filtered by linear filters and the filters' coefficients are optimized by maximizing a likelihood-based objective function. Experimental results on Aurora-5 benchmark task show that JSTN consistently out-performs the cascade of MVN and TSN on test data corrupted by both additive noise and reverberation, which validates our proposal. Specifically, JSTN reduces average word error rate by 8-9% relatively over the cascade of MVN and TSN for both artificial and real noisy data. 2013-09-09T06:59:14Z 2019-12-06T19:54:56Z 2013-09-09T06:59:14Z 2019-12-06T19:54:56Z 2012 2012 Conference Paper Xiao, X., Chng, E. S., & Li, H. (2012). Joint spectral and temporal normalization of features for robust recognition of noisy and reverberated speech. 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4325-4328. https://hdl.handle.net/10356/98409 http://hdl.handle.net/10220/13398 10.1109/ICASSP.2012.6288876 en © 2012 IEEE.
spellingShingle DRNTU::Engineering::Computer science and engineering
Xiao, Xiong
Chng, Eng Siong
Li, Haizhou
Joint spectral and temporal normalization of features for robust recognition of noisy and reverberated speech
title Joint spectral and temporal normalization of features for robust recognition of noisy and reverberated speech
title_full Joint spectral and temporal normalization of features for robust recognition of noisy and reverberated speech
title_fullStr Joint spectral and temporal normalization of features for robust recognition of noisy and reverberated speech
title_full_unstemmed Joint spectral and temporal normalization of features for robust recognition of noisy and reverberated speech
title_short Joint spectral and temporal normalization of features for robust recognition of noisy and reverberated speech
title_sort joint spectral and temporal normalization of features for robust recognition of noisy and reverberated speech
topic DRNTU::Engineering::Computer science and engineering
url https://hdl.handle.net/10356/98409
http://hdl.handle.net/10220/13398
work_keys_str_mv AT xiaoxiong jointspectralandtemporalnormalizationoffeaturesforrobustrecognitionofnoisyandreverberatedspeech
AT chngengsiong jointspectralandtemporalnormalizationoffeaturesforrobustrecognitionofnoisyandreverberatedspeech
AT lihaizhou jointspectralandtemporalnormalizationoffeaturesforrobustrecognitionofnoisyandreverberatedspeech