Multi-resolution speech analysis for automatic speech recognition using deep neural networks: Experiments on TIMIT.

Speech Analysis for Automatic Speech Recognition (ASR) systems typically starts with a Short-Time Fourier Transform (STFT) that implies selecting a fixed point in the time-frequency resolution trade-off. This approach, combined with a Mel-frequency scaled filterbank and a Discrete Cosine Transform g...

Full description

Bibliographic Details
Main Authors: Doroteo T Toledano, María Pilar Fernández-Gallego, Alicia Lozano-Diez
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2018-01-01
Series:PLoS ONE
Online Access:http://europepmc.org/articles/PMC6179252?pdf=render
_version_ 1818544872480571392
author Doroteo T Toledano
María Pilar Fernández-Gallego
Alicia Lozano-Diez
author_facet Doroteo T Toledano
María Pilar Fernández-Gallego
Alicia Lozano-Diez
author_sort Doroteo T Toledano
collection DOAJ
description Speech Analysis for Automatic Speech Recognition (ASR) systems typically starts with a Short-Time Fourier Transform (STFT) that implies selecting a fixed point in the time-frequency resolution trade-off. This approach, combined with a Mel-frequency scaled filterbank and a Discrete Cosine Transform give rise to the Mel-Frequency Cepstral Coefficients (MFCC), which have been the most common speech features in speech processing for the last decades. These features were particularly well suited for the previous Hidden Markov Models/Gaussian Mixture Models (HMM/GMM) state of the art in ASR. In particular they produced highly uncorrelated features of small dimensionality (typically 13 coefficients plus deltas and double deltas), which was very convenient for diagonal covariance GMMs, for dealing with the curse of dimensionality and for the limited computing resources of a decade ago. Currently most ASR systems use Deep Neural Networks (DNN) instead of the GMMs for modeling the acoustic features, which provides more flexibility regarding the definition of the features. In particular, acoustic features can be highly correlated and can be much larger in size because the DNNs are very powerful at processing high-dimensionality inputs. Also, the computing hardware has reached a level of evolution that makes computational cost in speech processing a less relevant issue. In this context we have decided to revisit the problem of the time-frequency resolution in speech analysis, and in particular to check if multi-resolution speech analysis (both in time and frequency) can be helpful in improving acoustic modeling using DNNs. Our experiments start with several Kaldi baseline system for the well known TIMIT corpus and modify them by adding multi-resolution speech representations by concatenating different spectra computed using different time-frequency resolutions and different post-processed and speaker-adapted features using different time-frequency resolutions. Our experiments show that using a multi-resolution speech representation tends to improve over results using the baseline single resolution speech representation, which seems to confirm our main hypothesis. However, results combining multi-resolution with the highly post-processed and speaker-adapted features, which provide the best results in Kaldi for TIMIT, yield only very modest improvements.
first_indexed 2024-12-11T22:54:08Z
format Article
id doaj.art-e866b035176b4dada358e619a2d20bdb
institution Directory Open Access Journal
issn 1932-6203
language English
last_indexed 2024-12-11T22:54:08Z
publishDate 2018-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj.art-e866b035176b4dada358e619a2d20bdb2022-12-22T00:47:20ZengPublic Library of Science (PLoS)PLoS ONE1932-62032018-01-011310e020535510.1371/journal.pone.0205355Multi-resolution speech analysis for automatic speech recognition using deep neural networks: Experiments on TIMIT.Doroteo T ToledanoMaría Pilar Fernández-GallegoAlicia Lozano-DiezSpeech Analysis for Automatic Speech Recognition (ASR) systems typically starts with a Short-Time Fourier Transform (STFT) that implies selecting a fixed point in the time-frequency resolution trade-off. This approach, combined with a Mel-frequency scaled filterbank and a Discrete Cosine Transform give rise to the Mel-Frequency Cepstral Coefficients (MFCC), which have been the most common speech features in speech processing for the last decades. These features were particularly well suited for the previous Hidden Markov Models/Gaussian Mixture Models (HMM/GMM) state of the art in ASR. In particular they produced highly uncorrelated features of small dimensionality (typically 13 coefficients plus deltas and double deltas), which was very convenient for diagonal covariance GMMs, for dealing with the curse of dimensionality and for the limited computing resources of a decade ago. Currently most ASR systems use Deep Neural Networks (DNN) instead of the GMMs for modeling the acoustic features, which provides more flexibility regarding the definition of the features. In particular, acoustic features can be highly correlated and can be much larger in size because the DNNs are very powerful at processing high-dimensionality inputs. Also, the computing hardware has reached a level of evolution that makes computational cost in speech processing a less relevant issue. In this context we have decided to revisit the problem of the time-frequency resolution in speech analysis, and in particular to check if multi-resolution speech analysis (both in time and frequency) can be helpful in improving acoustic modeling using DNNs. Our experiments start with several Kaldi baseline system for the well known TIMIT corpus and modify them by adding multi-resolution speech representations by concatenating different spectra computed using different time-frequency resolutions and different post-processed and speaker-adapted features using different time-frequency resolutions. Our experiments show that using a multi-resolution speech representation tends to improve over results using the baseline single resolution speech representation, which seems to confirm our main hypothesis. However, results combining multi-resolution with the highly post-processed and speaker-adapted features, which provide the best results in Kaldi for TIMIT, yield only very modest improvements.http://europepmc.org/articles/PMC6179252?pdf=render
spellingShingle Doroteo T Toledano
María Pilar Fernández-Gallego
Alicia Lozano-Diez
Multi-resolution speech analysis for automatic speech recognition using deep neural networks: Experiments on TIMIT.
PLoS ONE
title Multi-resolution speech analysis for automatic speech recognition using deep neural networks: Experiments on TIMIT.
title_full Multi-resolution speech analysis for automatic speech recognition using deep neural networks: Experiments on TIMIT.
title_fullStr Multi-resolution speech analysis for automatic speech recognition using deep neural networks: Experiments on TIMIT.
title_full_unstemmed Multi-resolution speech analysis for automatic speech recognition using deep neural networks: Experiments on TIMIT.
title_short Multi-resolution speech analysis for automatic speech recognition using deep neural networks: Experiments on TIMIT.
title_sort multi resolution speech analysis for automatic speech recognition using deep neural networks experiments on timit
url http://europepmc.org/articles/PMC6179252?pdf=render
work_keys_str_mv AT doroteottoledano multiresolutionspeechanalysisforautomaticspeechrecognitionusingdeepneuralnetworksexperimentsontimit
AT mariapilarfernandezgallego multiresolutionspeechanalysisforautomaticspeechrecognitionusingdeepneuralnetworksexperimentsontimit
AT alicialozanodiez multiresolutionspeechanalysisforautomaticspeechrecognitionusingdeepneuralnetworksexperimentsontimit