Finding Sparse Subnetworks in Self-Supervised Speech Recognition and Speech Synthesis
The modern paradigm in speech processing has demonstrated the importance of scale and compute for end-to-end speech recognition and synthesis. For instance, state-of-the-art self-supervised speech representation learning models typically consists of more than 300M model parameters and being trained...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Published: |
Massachusetts Institute of Technology
2022
|
Online Access: | https://hdl.handle.net/1721.1/144615 |
_version_ | 1811086187617583104 |
---|---|
author | Lai, Cheng-I Jeff |
author2 | Glass, James R. |
author_facet | Glass, James R. Lai, Cheng-I Jeff |
author_sort | Lai, Cheng-I Jeff |
collection | MIT |
description | The modern paradigm in speech processing has demonstrated the importance of scale and compute for end-to-end speech recognition and synthesis. For instance, state-of-the-art self-supervised speech representation learning models typically consists of more than 300M model parameters and being trained on 24 GPUs. While such a paradigm has proven to be effective in certain offline settings, it remains unclear the extent to which it can be extended to online and small-device scenarios.
This thesis is a step toward making advanced speech processing models more parameter-efficient. We aim to answer the following: do sparse subnetworks exist in modern speech processing models, and if so, how can we discover them efficiently? The key contribution is a new pruning algorithm termed Prune-Adjust-Re-Prune (PARP), that discovers sparse subnetworks efficiently. PARP is inspired by our observation that subnetworks pruned for pre-training tasks need merely a slight adjustment to achieve a sizeable performance boost in downstream ASR tasks. We first demonstrate its effectiveness for self-supervised ASR in various low-resource settings. In particular, extensive experiments verify (1) sparse subnetworks exist in mono-lingual/multi- lingual pre-trained self-supervised learning representations, and (2) the computational advantage and performance gain of PARP over baseline pruning methods.
In the second study, we extend PARP to end-to-end TTS, including both spectrogram prediction networks and vocoders. We thoroughly investigate the tradeoffs between sparsity and its subsequent effects on synthetic speech. The findings suggest that not only are end-to-end TTS models highly prunable, but also, perhaps surprisingly, pruned TTS models can produce synthetic speech with equal or higher naturalness and intelligibility, with similar prosody. |
first_indexed | 2024-09-23T13:22:11Z |
format | Thesis |
id | mit-1721.1/144615 |
institution | Massachusetts Institute of Technology |
last_indexed | 2024-09-23T13:22:11Z |
publishDate | 2022 |
publisher | Massachusetts Institute of Technology |
record_format | dspace |
spelling | mit-1721.1/1446152022-08-30T03:29:20Z Finding Sparse Subnetworks in Self-Supervised Speech Recognition and Speech Synthesis Lai, Cheng-I Jeff Glass, James R. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science The modern paradigm in speech processing has demonstrated the importance of scale and compute for end-to-end speech recognition and synthesis. For instance, state-of-the-art self-supervised speech representation learning models typically consists of more than 300M model parameters and being trained on 24 GPUs. While such a paradigm has proven to be effective in certain offline settings, it remains unclear the extent to which it can be extended to online and small-device scenarios. This thesis is a step toward making advanced speech processing models more parameter-efficient. We aim to answer the following: do sparse subnetworks exist in modern speech processing models, and if so, how can we discover them efficiently? The key contribution is a new pruning algorithm termed Prune-Adjust-Re-Prune (PARP), that discovers sparse subnetworks efficiently. PARP is inspired by our observation that subnetworks pruned for pre-training tasks need merely a slight adjustment to achieve a sizeable performance boost in downstream ASR tasks. We first demonstrate its effectiveness for self-supervised ASR in various low-resource settings. In particular, extensive experiments verify (1) sparse subnetworks exist in mono-lingual/multi- lingual pre-trained self-supervised learning representations, and (2) the computational advantage and performance gain of PARP over baseline pruning methods. In the second study, we extend PARP to end-to-end TTS, including both spectrogram prediction networks and vocoders. We thoroughly investigate the tradeoffs between sparsity and its subsequent effects on synthetic speech. The findings suggest that not only are end-to-end TTS models highly prunable, but also, perhaps surprisingly, pruned TTS models can produce synthetic speech with equal or higher naturalness and intelligibility, with similar prosody. S.M. 2022-08-29T15:59:46Z 2022-08-29T15:59:46Z 2022-05 2022-06-21T19:25:48.481Z Thesis https://hdl.handle.net/1721.1/144615 0000-0002-2343-8596 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology |
spellingShingle | Lai, Cheng-I Jeff Finding Sparse Subnetworks in Self-Supervised Speech Recognition and Speech Synthesis |
title | Finding Sparse Subnetworks in Self-Supervised Speech Recognition and Speech Synthesis |
title_full | Finding Sparse Subnetworks in Self-Supervised Speech Recognition and Speech Synthesis |
title_fullStr | Finding Sparse Subnetworks in Self-Supervised Speech Recognition and Speech Synthesis |
title_full_unstemmed | Finding Sparse Subnetworks in Self-Supervised Speech Recognition and Speech Synthesis |
title_short | Finding Sparse Subnetworks in Self-Supervised Speech Recognition and Speech Synthesis |
title_sort | finding sparse subnetworks in self supervised speech recognition and speech synthesis |
url | https://hdl.handle.net/1721.1/144615 |
work_keys_str_mv | AT laichengijeff findingsparsesubnetworksinselfsupervisedspeechrecognitionandspeechsynthesis |