Toward an interpretive framework of two-dimensional speech-signal processing

Thesis (Ph. D.)--Harvard-MIT Division of Health Sciences and Technology, 2011.

Bibliographic Details
Main Author: Wang, Tianyu Tom
Other Authors: Thomas F. Quatieri.
Format: Thesis
Language:eng
Published: Massachusetts Institute of Technology 2011
Subjects:
Online Access:http://hdl.handle.net/1721.1/65520
_version_ 1826196647946747904
author Wang, Tianyu Tom
author2 Thomas F. Quatieri.
author_facet Thomas F. Quatieri.
Wang, Tianyu Tom
author_sort Wang, Tianyu Tom
collection MIT
description Thesis (Ph. D.)--Harvard-MIT Division of Health Sciences and Technology, 2011.
first_indexed 2024-09-23T10:32:19Z
format Thesis
id mit-1721.1/65520
institution Massachusetts Institute of Technology
language eng
last_indexed 2024-09-23T10:32:19Z
publishDate 2011
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/655202019-04-14T07:24:17Z Toward an interpretive framework of two-dimensional speech-signal processing Wang, Tianyu Tom Thomas F. Quatieri. Harvard University--MIT Division of Health Sciences and Technology. Harvard University--MIT Division of Health Sciences and Technology. Harvard University--MIT Division of Health Sciences and Technology. Thesis (Ph. D.)--Harvard-MIT Division of Health Sciences and Technology, 2011. Cataloged from PDF version of thesis. Includes bibliographical references (p. 177-179). Traditional representations of speech are derived from short-time segments of the signal and result in time-frequency distributions of energy such as the short-time Fourier transform and spectrogram. Speech-signal models of such representations have had utility in a variety of applications such as speech analysis, recognition, and synthesis. Nonetheless, they do not capture spectral, temporal, and joint spectrotemporal energy fluctuations (or "modulations") present in local time-frequency regions of the time-frequency distribution. Inspired by principles from image processing and evidence from auditory neurophysiological models, a variety of twodimensional (2-D) processing techniques have been explored in the literature as alternative representations of speech; however, speech-based models are lacking in this framework. This thesis develops speech-signal models for a particular 2-D processing approach in which 2-D Fourier transforms are computed on local time-frequency regions of the canonical narrowband or wideband spectrogram; we refer to the resulting transformed space as the Grating Compression Transform (GCT). We argue for a 2-D sinusoidal-series amplitude modulation model of speech content in the spectrogram domain that relates to speech production characteristics such as pitch/noise of the source, pitch dynamics, formant structure and dynamics, and offset/onset content. Narrowband- and wideband-based models are shown to exhibit important distinctions in interpretation and oftentimes "dual" behavior. In the transformed GCT space, the modeling results in a novel taxonomy of signal behavior based on the distribution of formant and onset/offset content in the transformed space via source characteristics. Our formulation provides a speech-specific interpretation of the concept of "modulation" in 2-D processing in contrast to existing approaches that have done so either phenomenologically through qualitative analyses and/or implicitly through data-driven machine learning approaches. One implication of the proposed taxonomy is its potential for interpreting transformations of other time-frequency distributions such as the auditory spectrogram which is generally viewed as being "narrowband"/"wideband" in its low/high-frequency regions. The proposed signal model is evaluated in several ways. First, we perform analysis of synthetic speech signals to characterize its properties and limitations. Next, we develop an algorithm for analysis/synthesis of spectrograms using the model and demonstrate its ability to accurately represent real speech content. As an example application, we further apply the models in cochannel speaker separation, exploiting the GCT's ability to distribute speaker-specific content and often recover overlapping information through demodulation and interpolation in the 2-D GCT space. Specifically, in multi-pitch estimation, we demonstrate the GCT's ability to accurately estimate separate and crossing pitch tracks under certain conditions. Finally, we demonstrate the model's ability to separate mixtures of speech signals using both prior and estimated pitch information. Generalization to other speech-signal processing applications is proposed. by Tianyu Tom Wang. Ph.D. 2011-08-30T15:45:31Z 2011-08-30T15:45:31Z 2011 2011 Thesis http://hdl.handle.net/1721.1/65520 746796041 eng M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. http://dspace.mit.edu/handle/1721.1/7582 179 p. application/pdf Massachusetts Institute of Technology
spellingShingle Harvard University--MIT Division of Health Sciences and Technology.
Wang, Tianyu Tom
Toward an interpretive framework of two-dimensional speech-signal processing
title Toward an interpretive framework of two-dimensional speech-signal processing
title_full Toward an interpretive framework of two-dimensional speech-signal processing
title_fullStr Toward an interpretive framework of two-dimensional speech-signal processing
title_full_unstemmed Toward an interpretive framework of two-dimensional speech-signal processing
title_short Toward an interpretive framework of two-dimensional speech-signal processing
title_sort toward an interpretive framework of two dimensional speech signal processing
topic Harvard University--MIT Division of Health Sciences and Technology.
url http://hdl.handle.net/1721.1/65520
work_keys_str_mv AT wangtianyutom towardaninterpretiveframeworkoftwodimensionalspeechsignalprocessing