Corpus-based unit selection for natural-sounding speech synthesis

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2003.

Bibliographic Details
Main Author: Yi, Jon Rong-Wei, 1975-
Other Authors: James R. Glass.
Format: Thesis
Language:eng
Published: Massachusetts Institute of Technology 2005
Subjects:
Online Access:http://hdl.handle.net/1721.1/16944
_version_ 1811081291969331200
author Yi, Jon Rong-Wei, 1975-
author2 James R. Glass.
author_facet James R. Glass.
Yi, Jon Rong-Wei, 1975-
author_sort Yi, Jon Rong-Wei, 1975-
collection MIT
description Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2003.
first_indexed 2024-09-23T11:44:26Z
format Thesis
id mit-1721.1/16944
institution Massachusetts Institute of Technology
language eng
last_indexed 2024-09-23T11:44:26Z
publishDate 2005
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/169442019-04-12T09:20:32Z Corpus-based unit selection for natural-sounding speech synthesis Yi, Jon Rong-Wei, 1975- James R. Glass. Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. Electrical Engineering and Computer Science. Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2003. Includes bibliographical references (p. 179-196). This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections. Speech synthesis is an automatic encoding process carried out by machine through which symbols conveying linguistic information are converted into an acoustic waveform. In the past decade or so, a recent trend toward a non-parametric, corpus-based approach has focused on using real human speech as source material for producing novel natural-sounding speech. This work proposes a communication-theoretic formulation in which unit selection is a noisy channel through which an input sequence of symbols passes and an output sequence, possibly corrupted due to the coverage limits of the corpus, emerges. The penalty of approximation is quantified by substitution and concatenation costs which grade what unit contexts are interchangeable and where concatenations are not perceivable. These costs are semi-automatically derived from data and are found to agree with acoustic-phonetic knowledge. The implementation is based on a finite-state transducer (FST) representation that has been successfully used in speech and language processing applications including speech recognition. A proposed constraint kernel topology connects all units in the corpus with associated substitution and concatenation costs and enables an efficient Viterbi search that operates with low latency and scales to large corpora. An A* search can be applied in a second, rescoring pass to incorporate finer acoustic modelling. Extensions to this FST-based search include hierarchical and paralinguistic modelling. The search can also be used in an iterative feedback loop to record new utterances to enhance corpus coverage. This speech synthesis framework has been deployed across various domains and languages in many voices, a testament to its flexibility and rapid prototyping capability. (cont.) Experimental subjects completing tasks in a given air travel planning scenario by interacting in real time with a spoken dialogue system over the telephone have found the system "easiest to understand" out of eight competing systems. In more detailed listening evaluations, subjective opinions garnered from human participants are found to be correlated with objective measures calculable by machine. by Jon Rong-Wei Yi. Ph.D. 2005-05-19T15:22:53Z 2005-05-19T15:22:53Z 2003 2003 Thesis http://hdl.handle.net/1721.1/16944 53246622 eng M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. http://dspace.mit.edu/handle/1721.1/7582 214 p. 3462128 bytes 3461885 bytes application/pdf application/pdf application/pdf Massachusetts Institute of Technology
spellingShingle Electrical Engineering and Computer Science.
Yi, Jon Rong-Wei, 1975-
Corpus-based unit selection for natural-sounding speech synthesis
title Corpus-based unit selection for natural-sounding speech synthesis
title_full Corpus-based unit selection for natural-sounding speech synthesis
title_fullStr Corpus-based unit selection for natural-sounding speech synthesis
title_full_unstemmed Corpus-based unit selection for natural-sounding speech synthesis
title_short Corpus-based unit selection for natural-sounding speech synthesis
title_sort corpus based unit selection for natural sounding speech synthesis
topic Electrical Engineering and Computer Science.
url http://hdl.handle.net/1721.1/16944
work_keys_str_mv AT yijonrongwei1975 corpusbasedunitselectionfornaturalsoundingspeechsynthesis