Transfer Learning For Spoken Language Processing
This thesis develops transfer learning paradigms for spoken language processing applications. In particular, we tackle domain adaptation in the context of Automatic Speech Recognition (ASR) and Cross-Lingual Learning in Automatic Speech Translation (AST). The first part of the thesis develops an...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Published: |
Massachusetts Institute of Technology
2023
|
Online Access: | https://hdl.handle.net/1721.1/151674 |
_version_ | 1826215543888150528 |
---|---|
author | Khurana, Sameer |
author2 | Glass, James R. |
author_facet | Glass, James R. Khurana, Sameer |
author_sort | Khurana, Sameer |
collection | MIT |
description | This thesis develops transfer learning paradigms for spoken language processing applications. In particular, we tackle domain adaptation in the context of Automatic Speech Recognition (ASR) and Cross-Lingual Learning in Automatic Speech Translation (AST).
The first part of the thesis develops an algorithm for unsupervised domain adaptation of End-to-End ASR models. In recent years, ASR performance has improved dramatically owing to the availability of large annotated corpora and novel neural network architectures. However, the ASR performance drops considerably when the training data distribution does not match the distribution that the model encounters during deployment (target domain). A straightforward remedy is collecting labeled data in the target domain and re-training the source domain ASR model. However, it is often expensive to collect labeled examples, while unlabeled data is more accessible. Hence, there is a need for unsupervised domain adaptation methods. To that end, we develop a simple but effective adaptation algorithm called the Dropout Uncertainty-Driven Self-Training (DUST). DUST repurposes the classic Self-Training (ST) algorithm to make it suitable for the domain adaptation problem.
The second part of the thesis develops a transformer neural network encoder that embeds speech from several languages into a shared semantically aligned joint speech-text embedding space. To learn the multimodal semantic embedding space, we propose a teacher/student learning framework where we fine-tune a pre-trained multilingual speech encoder (student) using semantic supervision from a pre-trained multilingual semantic text encoder (teacher). We show that by building multilingual speech-to-text translation technology using the semantic representations learned by our speech encoder, we could achieve a significant \textit{zero-shot} cross-lingual task transfer from seen (during training) high-resource spoken languages to unseen (during training) low-resource spoken languages. |
first_indexed | 2024-09-23T16:34:23Z |
format | Thesis |
id | mit-1721.1/151674 |
institution | Massachusetts Institute of Technology |
last_indexed | 2024-09-23T16:34:23Z |
publishDate | 2023 |
publisher | Massachusetts Institute of Technology |
record_format | dspace |
spelling | mit-1721.1/1516742023-08-01T03:04:09Z Transfer Learning For Spoken Language Processing Khurana, Sameer Glass, James R. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science This thesis develops transfer learning paradigms for spoken language processing applications. In particular, we tackle domain adaptation in the context of Automatic Speech Recognition (ASR) and Cross-Lingual Learning in Automatic Speech Translation (AST). The first part of the thesis develops an algorithm for unsupervised domain adaptation of End-to-End ASR models. In recent years, ASR performance has improved dramatically owing to the availability of large annotated corpora and novel neural network architectures. However, the ASR performance drops considerably when the training data distribution does not match the distribution that the model encounters during deployment (target domain). A straightforward remedy is collecting labeled data in the target domain and re-training the source domain ASR model. However, it is often expensive to collect labeled examples, while unlabeled data is more accessible. Hence, there is a need for unsupervised domain adaptation methods. To that end, we develop a simple but effective adaptation algorithm called the Dropout Uncertainty-Driven Self-Training (DUST). DUST repurposes the classic Self-Training (ST) algorithm to make it suitable for the domain adaptation problem. The second part of the thesis develops a transformer neural network encoder that embeds speech from several languages into a shared semantically aligned joint speech-text embedding space. To learn the multimodal semantic embedding space, we propose a teacher/student learning framework where we fine-tune a pre-trained multilingual speech encoder (student) using semantic supervision from a pre-trained multilingual semantic text encoder (teacher). We show that by building multilingual speech-to-text translation technology using the semantic representations learned by our speech encoder, we could achieve a significant \textit{zero-shot} cross-lingual task transfer from seen (during training) high-resource spoken languages to unseen (during training) low-resource spoken languages. Ph.D. 2023-07-31T19:58:04Z 2023-07-31T19:58:04Z 2023-06 2023-07-13T14:22:23.611Z Thesis https://hdl.handle.net/1721.1/151674 Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-sa/4.0/ application/pdf Massachusetts Institute of Technology |
spellingShingle | Khurana, Sameer Transfer Learning For Spoken Language Processing |
title | Transfer Learning For Spoken Language Processing |
title_full | Transfer Learning For Spoken Language Processing |
title_fullStr | Transfer Learning For Spoken Language Processing |
title_full_unstemmed | Transfer Learning For Spoken Language Processing |
title_short | Transfer Learning For Spoken Language Processing |
title_sort | transfer learning for spoken language processing |
url | https://hdl.handle.net/1721.1/151674 |
work_keys_str_mv | AT khuranasameer transferlearningforspokenlanguageprocessing |