Multi-level acoustic modeling for automatic speech recognition

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.

Bibliographic Details
Main Author:	Chang, Hung-An, Ph. D. Massachusetts Institute of Technology
Other Authors:	James R. Glass.
Format:	Thesis
Language:	eng
Published:	Massachusetts Institute of Technology 2012
Subjects:	Electrical Engineering and Computer Science.
Online Access:	http://hdl.handle.net/1721.1/74981

_version_	1826193072714678272
author	Chang, Hung-An, Ph. D. Massachusetts Institute of Technology
author2	James R. Glass.
author_facet	James R. Glass. Chang, Hung-An, Ph. D. Massachusetts Institute of Technology
author_sort	Chang, Hung-An, Ph. D. Massachusetts Institute of Technology
collection	MIT
description	Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.
first_indexed	2024-09-23T09:33:39Z
format	Thesis
id	mit-1721.1/74981
institution	Massachusetts Institute of Technology
language	eng
last_indexed	2024-09-23T09:33:39Z
publishDate	2012
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/749812019-04-10T23:51:57Z Multi-level acoustic modeling for automatic speech recognition Chang, Hung-An, Ph. D. Massachusetts Institute of Technology James R. Glass. Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. Electrical Engineering and Computer Science. Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012. Cataloged from PDF version of thesis. Includes bibliographical references (p. 183-192). Context-dependent acoustic modeling is commonly used in large-vocabulary Automatic Speech Recognition (ASR) systems as a way to model coarticulatory variations that occur during speech production. Typically, the local phoneme context is used as a means to define context-dependent units. Because the number of possible context-dependent units can grow exponentially with the length of the contexts, many units will not have enough training examples to train a robust model, resulting in a data sparsity problem. For nearly two decades, this data sparsity problem has been dealt with by a clustering-based framework which systematically groups different context-dependent units into clusters such that each cluster can have enough data. Although dealing with the data sparsity issue, the clustering-based approach also makes all context-dependent units within a cluster have the same acoustic score, resulting in a quantization effect that can potentially limit the performance of the context-dependent model. In this work, a multi-level acoustic modeling framework is proposed to address both the data sparsity problem and the quantization effect. Under the multi-level framework, each context-dependent unit is associated with classifiers that target multiple levels of contextual resolution, and the outputs of the classifiers are linearly combined for scoring during recognition. By choosing the classifiers judiciously, both the data sparsity problem and the quantization effect can be dealt with. The proposed multi-level framework can also be integrated into existing large-vocabulary ASR systems, such as FST-based ASR systems, and is compatible with state-of-the-art error reduction techniques for ASR systems, such as discriminative training methods. Multiple sets of experiments have been conducted to compare the performance of the clustering-based acoustic model and the proposed multi-level model. In a phonetic recognition experiment on TIMIT, the multi-level model has about 8% relative improvement in terms of phone error rate, showing that the multi-level framework can help improve phonetic prediction accuracy. In a large-vocabulary transcription task, combining the proposed multi-level modeling framework with discriminative training can provide more than 20% relative improvement over a clustering baseline model in terms of Word Error Rate (WER), showing that the multi-level framework can be integrated into existing large-vocabulary decoding frameworks and that it combines well with discriminative training methods. In speaker adaptive transcription task, the multi-level model has about 14% relative WER improvement, showing that the proposed framework can adapt better to new speakers, and potentially to new environments than the conventional clustering-based approach. by Hung-An Chang. Ph.D. 2012-11-19T19:31:58Z 2012-11-19T19:31:58Z 2012 2012 Thesis http://hdl.handle.net/1721.1/74981 813987957 eng M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. http://dspace.mit.edu/handle/1721.1/7582 192 p. application/pdf Massachusetts Institute of Technology
spellingShingle	Electrical Engineering and Computer Science. Chang, Hung-An, Ph. D. Massachusetts Institute of Technology Multi-level acoustic modeling for automatic speech recognition
title	Multi-level acoustic modeling for automatic speech recognition
title_full	Multi-level acoustic modeling for automatic speech recognition
title_fullStr	Multi-level acoustic modeling for automatic speech recognition
title_full_unstemmed	Multi-level acoustic modeling for automatic speech recognition
title_short	Multi-level acoustic modeling for automatic speech recognition
title_sort	multi level acoustic modeling for automatic speech recognition
topic	Electrical Engineering and Computer Science.
url	http://hdl.handle.net/1721.1/74981
work_keys_str_mv	AT changhunganphdmassachusettsinstituteoftechnology multilevelacousticmodelingforautomaticspeechrecognition

Multi-level acoustic modeling for automatic speech recognition

Similar Items