Improving Transformer Based End-to-End Code-Switching Speech Recognition Using Language Identification

A Recurrent Neural Networks (RNN) based attention model has been used in code-switching speech recognition (CSSR). However, due to the sequential computation constraint of RNN, there are stronger short-range dependencies and weaker long-range dependencies, which makes it hard to immediately switch l...

Full description

Bibliographic Details
Main Authors: Zheying Huang, Pei Wang, Jian Wang, Haoran Miao, Ji Xu, Pengyuan Zhang
Format: Article
Language:English
Published: MDPI AG 2021-09-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/11/19/9106
Description
Summary:A Recurrent Neural Networks (RNN) based attention model has been used in code-switching speech recognition (CSSR). However, due to the sequential computation constraint of RNN, there are stronger short-range dependencies and weaker long-range dependencies, which makes it hard to immediately switch languages in CSSR. Firstly, to deal with this problem, we introduce the CTC-Transformer, relying entirely on a self-attention mechanism to draw global dependencies and adopting connectionist temporal classification (CTC) as an auxiliary task for better convergence. Secondly, we proposed two multi-task learning recipes, where a language identification (LID) auxiliary task is learned in addition to the CTC-Transformer automatic speech recognition (ASR) task. Thirdly, we study a decoding strategy to combine the LID into an ASR task. Experiments on the SEAME corpus demonstrate the effects of the proposed methods, achieving a mixed error rate (MER) of <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>30.95</mn><mo>%</mo></mrow></semantics></math></inline-formula>. It obtains up to <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>19.35</mn><mo>%</mo></mrow></semantics></math></inline-formula> relative MER reduction compared to the baseline RNN-based CTC-Attention system, and <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>8.86</mn><mo>%</mo></mrow></semantics></math></inline-formula> relative MER reduction compared to the baseline CTC-Transformer system.
ISSN:2076-3417