Corpus-based unit selection for natural-sounding speech synthesis

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2003.

Bibliographic Details
Main Author:	Yi, Jon Rong-Wei, 1975-
Other Authors:	James R. Glass.
Format:	Thesis
Language:	eng
Published:	Massachusetts Institute of Technology 2005
Subjects:	Electrical Engineering and Computer Science.
Online Access:	http://hdl.handle.net/1721.1/16944

_version_	1811081291969331200
author	Yi, Jon Rong-Wei, 1975-
author2	James R. Glass.
author_facet	James R. Glass. Yi, Jon Rong-Wei, 1975-
author_sort	Yi, Jon Rong-Wei, 1975-
collection	MIT
description	Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2003.
first_indexed	2024-09-23T11:44:26Z
format	Thesis
id	mit-1721.1/16944
institution	Massachusetts Institute of Technology
language	eng
last_indexed	2024-09-23T11:44:26Z
publishDate	2005
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/169442019-04-12T09:20:32Z Corpus-based unit selection for natural-sounding speech synthesis Yi, Jon Rong-Wei, 1975- James R. Glass. Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. Electrical Engineering and Computer Science. Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2003. Includes bibliographical references (p. 179-196). This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections. Speech synthesis is an automatic encoding process carried out by machine through which symbols conveying linguistic information are converted into an acoustic waveform. In the past decade or so, a recent trend toward a non-parametric, corpus-based approach has focused on using real human speech as source material for producing novel natural-sounding speech. This work proposes a communication-theoretic formulation in which unit selection is a noisy channel through which an input sequence of symbols passes and an output sequence, possibly corrupted due to the coverage limits of the corpus, emerges. The penalty of approximation is quantified by substitution and concatenation costs which grade what unit contexts are interchangeable and where concatenations are not perceivable. These costs are semi-automatically derived from data and are found to agree with acoustic-phonetic knowledge. The implementation is based on a finite-state transducer (FST) representation that has been successfully used in speech and language processing applications including speech recognition. A proposed constraint kernel topology connects all units in the corpus with associated substitution and concatenation costs and enables an efficient Viterbi search that operates with low latency and scales to large corpora. An A* search can be applied in a second, rescoring pass to incorporate finer acoustic modelling. Extensions to this FST-based search include hierarchical and paralinguistic modelling. The search can also be used in an iterative feedback loop to record new utterances to enhance corpus coverage. This speech synthesis framework has been deployed across various domains and languages in many voices, a testament to its flexibility and rapid prototyping capability. (cont.) Experimental subjects completing tasks in a given air travel planning scenario by interacting in real time with a spoken dialogue system over the telephone have found the system "easiest to understand" out of eight competing systems. In more detailed listening evaluations, subjective opinions garnered from human participants are found to be correlated with objective measures calculable by machine. by Jon Rong-Wei Yi. Ph.D. 2005-05-19T15:22:53Z 2005-05-19T15:22:53Z 2003 2003 Thesis http://hdl.handle.net/1721.1/16944 53246622 eng M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. http://dspace.mit.edu/handle/1721.1/7582 214 p. 3462128 bytes 3461885 bytes application/pdf application/pdf application/pdf Massachusetts Institute of Technology
spellingShingle	Electrical Engineering and Computer Science. Yi, Jon Rong-Wei, 1975- Corpus-based unit selection for natural-sounding speech synthesis
title	Corpus-based unit selection for natural-sounding speech synthesis
title_full	Corpus-based unit selection for natural-sounding speech synthesis
title_fullStr	Corpus-based unit selection for natural-sounding speech synthesis
title_full_unstemmed	Corpus-based unit selection for natural-sounding speech synthesis
title_short	Corpus-based unit selection for natural-sounding speech synthesis
title_sort	corpus based unit selection for natural sounding speech synthesis
topic	Electrical Engineering and Computer Science.
url	http://hdl.handle.net/1721.1/16944
work_keys_str_mv	AT yijonrongwei1975 corpusbasedunitselectionfornaturalsoundingspeechsynthesis

Corpus-based unit selection for natural-sounding speech synthesis

Similar Items