JSUM: A Multitask Learning Speech Recognition Model for Jointly Supervised and Unsupervised Learning

In recent years, the end-to-end speech recognition model has emerged as a popular alternative to the traditional Deep Neural Network—Hidden Markov Model (DNN-HMM). This approach maps acoustic features directly onto text sequences via a single network architecture, significantly streamlining the mode...

Full description

Bibliographic Details
Main Authors: Nurmemet Yolwas, Weijing Meng
Format: Article
Language:English
Published: MDPI AG 2023-04-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/13/9/5239
_version_ 1827743373155368960
author Nurmemet Yolwas
Weijing Meng
author_facet Nurmemet Yolwas
Weijing Meng
author_sort Nurmemet Yolwas
collection DOAJ
description In recent years, the end-to-end speech recognition model has emerged as a popular alternative to the traditional Deep Neural Network—Hidden Markov Model (DNN-HMM). This approach maps acoustic features directly onto text sequences via a single network architecture, significantly streamlining the model construction process. However, the training of end-to-end speech recognition models typically necessitates a significant quantity of supervised data to achieve good performance, which poses a challenge in low-resource conditions. The use of unsupervised representation significantly reduces this necessity. Recent research has focused on end-to-end techniques employing joint Connectionist Temporal Classification (CTC) and attention mechanisms, with some also concentrating on unsupervised presentation learning. This paper proposes a joint supervised and unsupervised multi-task learning model (JSUM). Our approach leverages the unsupervised pre-trained wav2vec 2.0 model as a shared encoder that integrates the joint CTC-Attention network and the generative adversarial network into a unified end-to-end architecture. Our method provides a new low-resource language speech recognition solution that optimally utilizes supervised and unsupervised datasets by combining CTC, attention, and generative adversarial losses. Furthermore, our proposed approach is suitable for both monolingual and cross-lingual scenarios.
first_indexed 2024-03-11T04:25:08Z
format Article
id doaj.art-6c9b4f2900364c99b46ccc77b0dd0ee0
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-11T04:25:08Z
publishDate 2023-04-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-6c9b4f2900364c99b46ccc77b0dd0ee02023-11-17T22:31:16ZengMDPI AGApplied Sciences2076-34172023-04-01139523910.3390/app13095239JSUM: A Multitask Learning Speech Recognition Model for Jointly Supervised and Unsupervised LearningNurmemet Yolwas0Weijing Meng1Xinjiang Multilingual Information Technology Laboratory, Urumqi 830017, ChinaXinjiang Multilingual Information Technology Laboratory, Urumqi 830017, ChinaIn recent years, the end-to-end speech recognition model has emerged as a popular alternative to the traditional Deep Neural Network—Hidden Markov Model (DNN-HMM). This approach maps acoustic features directly onto text sequences via a single network architecture, significantly streamlining the model construction process. However, the training of end-to-end speech recognition models typically necessitates a significant quantity of supervised data to achieve good performance, which poses a challenge in low-resource conditions. The use of unsupervised representation significantly reduces this necessity. Recent research has focused on end-to-end techniques employing joint Connectionist Temporal Classification (CTC) and attention mechanisms, with some also concentrating on unsupervised presentation learning. This paper proposes a joint supervised and unsupervised multi-task learning model (JSUM). Our approach leverages the unsupervised pre-trained wav2vec 2.0 model as a shared encoder that integrates the joint CTC-Attention network and the generative adversarial network into a unified end-to-end architecture. Our method provides a new low-resource language speech recognition solution that optimally utilizes supervised and unsupervised datasets by combining CTC, attention, and generative adversarial losses. Furthermore, our proposed approach is suitable for both monolingual and cross-lingual scenarios.https://www.mdpi.com/2076-3417/13/9/5239end-to-end speech recognitionmultitasking learningsupervised learningunsupervised learning
spellingShingle Nurmemet Yolwas
Weijing Meng
JSUM: A Multitask Learning Speech Recognition Model for Jointly Supervised and Unsupervised Learning
Applied Sciences
end-to-end speech recognition
multitasking learning
supervised learning
unsupervised learning
title JSUM: A Multitask Learning Speech Recognition Model for Jointly Supervised and Unsupervised Learning
title_full JSUM: A Multitask Learning Speech Recognition Model for Jointly Supervised and Unsupervised Learning
title_fullStr JSUM: A Multitask Learning Speech Recognition Model for Jointly Supervised and Unsupervised Learning
title_full_unstemmed JSUM: A Multitask Learning Speech Recognition Model for Jointly Supervised and Unsupervised Learning
title_short JSUM: A Multitask Learning Speech Recognition Model for Jointly Supervised and Unsupervised Learning
title_sort jsum a multitask learning speech recognition model for jointly supervised and unsupervised learning
topic end-to-end speech recognition
multitasking learning
supervised learning
unsupervised learning
url https://www.mdpi.com/2076-3417/13/9/5239
work_keys_str_mv AT nurmemetyolwas jsumamultitasklearningspeechrecognitionmodelforjointlysupervisedandunsupervisedlearning
AT weijingmeng jsumamultitasklearningspeechrecognitionmodelforjointlysupervisedandunsupervisedlearning