Self-Supervised Learning for Speech Processing
Deep neural networks trained with supervised learning algorithms on large amounts of labeled speech data have achieved remarkable performance on various spoken language processing applications, often being the state of the arts on the corresponding leaderboards. However, the fact that training these...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Published: |
Massachusetts Institute of Technology
2022
|
Online Access: | https://hdl.handle.net/1721.1/144761 https://orcid.org/orcid=0000-0001-9451-7956 |
_version_ | 1826189259168546816 |
---|---|
author | Chung, Yu-An |
author2 | Glass, James R. |
author_facet | Glass, James R. Chung, Yu-An |
author_sort | Chung, Yu-An |
collection | MIT |
description | Deep neural networks trained with supervised learning algorithms on large amounts of labeled speech data have achieved remarkable performance on various spoken language processing applications, often being the state of the arts on the corresponding leaderboards. However, the fact that training these systems relies on large amounts of annotated speech poses a scalability bottleneck for the continued advancement of state-of-the-art performance, and an even more fundamental barrier for deployment of deep neural networks in speech domains where labeled data are intrinsically rare, costly, or time-consuming to collect.
In contrast to annotated speech, untranscribed audio is often much cheaper to accumulate. In this thesis, we explore the use of self-supervised learning---a learning paradigm where the learning target is generated from the input itself---for leveraging such easily scalable resources to improve the performance of spoken language technology. Specifically, we propose two self-supervised algorithms, one based on the idea of "future prediction" and the other based on the idea of "predicting the masked from the unmasked," for learning contextualized speech representations from unlabeled speech data. We show that our self-supervised algorithms are capable of learning representations that transform high-level properties of speech signals such as their phonetic contents and speaker characteristics into a more accessible form than traditional acoustic features, and demonstrate their effectiveness in improving the performance of deep neural networks on a wide range of speech processing tasks. In addition to presenting new learning algorithms, we also provide extensive analysis aiming to understand the properties of the learned self-supervised representations, as well as disclosing the design factors that make one self-supervised model different from the other. |
first_indexed | 2024-09-23T08:12:17Z |
format | Thesis |
id | mit-1721.1/144761 |
institution | Massachusetts Institute of Technology |
last_indexed | 2024-09-23T08:12:17Z |
publishDate | 2022 |
publisher | Massachusetts Institute of Technology |
record_format | dspace |
spelling | mit-1721.1/1447612022-08-30T04:06:24Z Self-Supervised Learning for Speech Processing Chung, Yu-An Glass, James R. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Deep neural networks trained with supervised learning algorithms on large amounts of labeled speech data have achieved remarkable performance on various spoken language processing applications, often being the state of the arts on the corresponding leaderboards. However, the fact that training these systems relies on large amounts of annotated speech poses a scalability bottleneck for the continued advancement of state-of-the-art performance, and an even more fundamental barrier for deployment of deep neural networks in speech domains where labeled data are intrinsically rare, costly, or time-consuming to collect. In contrast to annotated speech, untranscribed audio is often much cheaper to accumulate. In this thesis, we explore the use of self-supervised learning---a learning paradigm where the learning target is generated from the input itself---for leveraging such easily scalable resources to improve the performance of spoken language technology. Specifically, we propose two self-supervised algorithms, one based on the idea of "future prediction" and the other based on the idea of "predicting the masked from the unmasked," for learning contextualized speech representations from unlabeled speech data. We show that our self-supervised algorithms are capable of learning representations that transform high-level properties of speech signals such as their phonetic contents and speaker characteristics into a more accessible form than traditional acoustic features, and demonstrate their effectiveness in improving the performance of deep neural networks on a wide range of speech processing tasks. In addition to presenting new learning algorithms, we also provide extensive analysis aiming to understand the properties of the learned self-supervised representations, as well as disclosing the design factors that make one self-supervised model different from the other. Ph.D. 2022-08-29T16:09:57Z 2022-08-29T16:09:57Z 2022-05 2022-06-21T19:15:42.936Z Thesis https://hdl.handle.net/1721.1/144761 https://orcid.org/orcid=0000-0001-9451-7956 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology |
spellingShingle | Chung, Yu-An Self-Supervised Learning for Speech Processing |
title | Self-Supervised Learning for Speech Processing |
title_full | Self-Supervised Learning for Speech Processing |
title_fullStr | Self-Supervised Learning for Speech Processing |
title_full_unstemmed | Self-Supervised Learning for Speech Processing |
title_short | Self-Supervised Learning for Speech Processing |
title_sort | self supervised learning for speech processing |
url | https://hdl.handle.net/1721.1/144761 https://orcid.org/orcid=0000-0001-9451-7956 |
work_keys_str_mv | AT chungyuan selfsupervisedlearningforspeechprocessing |