Speech processing with less supervision : learning from weak labels and multiple modalities

Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, May, 2020

Bibliographic Details
Main Author:	Hsu, Wei-Ning,Ph. D.Massachusetts Institute of Technology.
Other Authors:	James R. Glass.
Format:	Thesis
Language:	eng
Published:	Massachusetts Institute of Technology 2020
Subjects:	Electrical Engineering and Computer Science.
Online Access:	https://hdl.handle.net/1721.1/127021

_version_	1826197657540886528
author	Hsu, Wei-Ning,Ph. D.Massachusetts Institute of Technology.
author2	James R. Glass.
author_facet	James R. Glass. Hsu, Wei-Ning,Ph. D.Massachusetts Institute of Technology.
author_sort	Hsu, Wei-Ning,Ph. D.Massachusetts Institute of Technology.
collection	MIT
description	Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, May, 2020
first_indexed	2024-09-23T10:51:08Z
format	Thesis
id	mit-1721.1/127021
institution	Massachusetts Institute of Technology
language	eng
last_indexed	2024-09-23T10:51:08Z
publishDate	2020
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/1270212020-09-04T03:32:20Z Speech processing with less supervision : learning from weak labels and multiple modalities Hsu, Wei-Ning,Ph. D.Massachusetts Institute of Technology. James R. Glass. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Electrical Engineering and Computer Science. Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, May, 2020 Cataloged from the official PDF of thesis. Includes bibliographical references (pages 191-217). In recent years, supervised learning has achieved great success in speech processing with powerful neural network models and vast quantities of in-domain labeled data. However, collecting a labeled dataset covering all domains can be either expensive due to the diversity of speech or almost impossible for some tasks such as speech-to-speech translation. Such a paradigm limits the applicability of speech technologies to high-resource settings. In sharp contrast, humans are good at reading the training signals from indirect supervision, such as from small amount of explicit labels and from different modalities. This capability enables humans to learn from a wider variety of resources, including better domain coverage. In light of this observation, this thesis focuses on learning algorithms for speech processing that can utilize weak and indirect supervision to overcome the restrictions imposed by the supervised paradigm and make the most out of the data at hand for learning. In the first part of the thesis, we devise a self-training algorithm for speech recognition that distills knowledge from a trained language model, a compact form of external non-speech prior knowledge. The algorithm is inspired by how humans use contextual and prior information to bias speech recognition and produce confident predictions. To distill knowledge within the language model, we implement a beam-search based objective to align the prediction probability with the likelihood of the language model among candidate hypotheses. Experimental results demonstrate state-of-the-art performance that recover word error rates by up to 90% relative to using the same data with ground truth transcripts. Moreover, we show that the proposed algorithm can scale to 60,000 hours of unlabeled speech and yield further reduction in word error rates. In the second part of the thesis, we present several text-to-speech synthesis models that enable fine-grained control of unlabeled non-textual attributes, including voice, prosody, acoustic environment properties and microphone channel effects. We achieve controllability of unlabeled attributes by formulating a text-to-speech system as a generative model with structured latent variables, and learn this generative process along with an efficient approximate inference model by adopting the variational autoencoder framework. We demonstrate that those latent variables can then be used to control the unlabeled variations in speech, making it possible to build a high-quality speech synthesis model using weakly-labeled mixed-quality speech data as the model learns to control the hidden factors. In the last part of the thesis, we extend a cross-modal semantic embedding learning framework proposed in Harwath et al. (2019) to learn hierarchical discrete linguistic units from visually grounded speech, a form of multimodal sensory data. By utilizing a discriminative, multimodal grounding objective, the proposed framework forces the learned units to be useful for semantic image retrieval. In contrast, most of the previous work on linguistic unit discovery do not use multimodal data--they consider a reconstruction objective that encourages the learned units to be useful for reconstructing the speech, and hence those units may also encode non-linguistic factors. Experimental results show that the proposed framework outperforms state-of-the-art phonetic unit discovery frameworks by almost 50% on the ZeroSpeech 2019 ABX phone discriminative task, and learns word detectors that discover over 270 words with an F1 score of greater than 0.5. In addition, the learned units from the proposed framework are also more robust to nuisance variation compared to frameworks that learn from only speech. by Wei-Ning Hsu. Ph. D. Ph.D. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science 2020-09-03T17:42:23Z 2020-09-03T17:42:23Z 2020 2020 Thesis https://hdl.handle.net/1721.1/127021 1191625000 eng MIT theses may be protected by copyright. Please reuse MIT thesis content according to the MIT Libraries Permissions Policy, which is available through the URL provided. http://dspace.mit.edu/handle/1721.1/7582 217 pages application/pdf Massachusetts Institute of Technology
spellingShingle	Electrical Engineering and Computer Science. Hsu, Wei-Ning,Ph. D.Massachusetts Institute of Technology. Speech processing with less supervision : learning from weak labels and multiple modalities
title	Speech processing with less supervision : learning from weak labels and multiple modalities
title_full	Speech processing with less supervision : learning from weak labels and multiple modalities
title_fullStr	Speech processing with less supervision : learning from weak labels and multiple modalities
title_full_unstemmed	Speech processing with less supervision : learning from weak labels and multiple modalities
title_short	Speech processing with less supervision : learning from weak labels and multiple modalities
title_sort	speech processing with less supervision learning from weak labels and multiple modalities
topic	Electrical Engineering and Computer Science.
url	https://hdl.handle.net/1721.1/127021
work_keys_str_mv	AT hsuweiningphdmassachusettsinstituteoftechnology speechprocessingwithlesssupervisionlearningfromweaklabelsandmultiplemodalities

Speech processing with less supervision : learning from weak labels and multiple modalities

Similar Items