LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition

Recently, Transformer-based models have shown promising results in automatic speech recognition (ASR), outperforming models based on recurrent neural networks (RNNs) and convolutional neural networks (CNNs). However, directly applying a Transformer to the ASR task does not exploit the correlation am...

Full description

Bibliographic Details
Main Authors: Pengbin Fu, Daxing Liu, Huirong Yang
Format: Article
Language:English
Published: MDPI AG 2022-05-01
Series:Information
Subjects:
Online Access:https://www.mdpi.com/2078-2489/13/5/250
_version_ 1797499011601530880
author Pengbin Fu
Daxing Liu
Huirong Yang
author_facet Pengbin Fu
Daxing Liu
Huirong Yang
author_sort Pengbin Fu
collection DOAJ
description Recently, Transformer-based models have shown promising results in automatic speech recognition (ASR), outperforming models based on recurrent neural networks (RNNs) and convolutional neural networks (CNNs). However, directly applying a Transformer to the ASR task does not exploit the correlation among speech frames effectively, leaving the model trapped in a sub-optimal solution. To this end, we propose a local attention Transformer model for speech recognition that combines the high correlation among speech frames. Specifically, we use relative positional embedding, rather than absolute positional embedding, to improve the generalization of the Transformer for speech sequences of different lengths. Secondly, we add local attention based on parametric positional relations to the self-attentive module and explicitly incorporate prior knowledge into the self-attentive module to make the training process insensitive to hyperparameters, thus improving the performance. Experiments carried out on the LibriSpeech dataset show that our proposed approach achieves a word error rate of 2.3/5.5% by language model fusion without any external data and reduces the word error rate by 17.8/9.8% compared to the baseline. The results are also close to, or better than, other state-of-the-art end-to-end models.
first_indexed 2024-03-10T03:41:22Z
format Article
id doaj.art-467ea8b0c00242c28b20da12806c6331
institution Directory Open Access Journal
issn 2078-2489
language English
last_indexed 2024-03-10T03:41:22Z
publishDate 2022-05-01
publisher MDPI AG
record_format Article
series Information
spelling doaj.art-467ea8b0c00242c28b20da12806c63312023-11-23T11:30:19ZengMDPI AGInformation2078-24892022-05-0113525010.3390/info13050250LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech RecognitionPengbin Fu0Daxing Liu1Huirong Yang2Faculty of Information Technology, Beijing University of Technology, Beijing 100124, ChinaFaculty of Information Technology, Beijing University of Technology, Beijing 100124, ChinaFaculty of Information Technology, Beijing University of Technology, Beijing 100124, ChinaRecently, Transformer-based models have shown promising results in automatic speech recognition (ASR), outperforming models based on recurrent neural networks (RNNs) and convolutional neural networks (CNNs). However, directly applying a Transformer to the ASR task does not exploit the correlation among speech frames effectively, leaving the model trapped in a sub-optimal solution. To this end, we propose a local attention Transformer model for speech recognition that combines the high correlation among speech frames. Specifically, we use relative positional embedding, rather than absolute positional embedding, to improve the generalization of the Transformer for speech sequences of different lengths. Secondly, we add local attention based on parametric positional relations to the self-attentive module and explicitly incorporate prior knowledge into the self-attentive module to make the training process insensitive to hyperparameters, thus improving the performance. Experiments carried out on the LibriSpeech dataset show that our proposed approach achieves a word error rate of 2.3/5.5% by language model fusion without any external data and reduces the word error rate by 17.8/9.8% compared to the baseline. The results are also close to, or better than, other state-of-the-art end-to-end models.https://www.mdpi.com/2078-2489/13/5/250end-to-end modelspeech recognitionTransformerlocal attention
spellingShingle Pengbin Fu
Daxing Liu
Huirong Yang
LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition
Information
end-to-end model
speech recognition
Transformer
local attention
title LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition
title_full LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition
title_fullStr LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition
title_full_unstemmed LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition
title_short LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition
title_sort las transformer an enhanced transformer based on the local attention mechanism for speech recognition
topic end-to-end model
speech recognition
Transformer
local attention
url https://www.mdpi.com/2078-2489/13/5/250
work_keys_str_mv AT pengbinfu lastransformeranenhancedtransformerbasedonthelocalattentionmechanismforspeechrecognition
AT daxingliu lastransformeranenhancedtransformerbasedonthelocalattentionmechanismforspeechrecognition
AT huirongyang lastransformeranenhancedtransformerbasedonthelocalattentionmechanismforspeechrecognition