LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition
Recently, Transformer-based models have shown promising results in automatic speech recognition (ASR), outperforming models based on recurrent neural networks (RNNs) and convolutional neural networks (CNNs). However, directly applying a Transformer to the ASR task does not exploit the correlation am...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2022-05-01
|
Series: | Information |
Subjects: | |
Online Access: | https://www.mdpi.com/2078-2489/13/5/250 |
_version_ | 1797499011601530880 |
---|---|
author | Pengbin Fu Daxing Liu Huirong Yang |
author_facet | Pengbin Fu Daxing Liu Huirong Yang |
author_sort | Pengbin Fu |
collection | DOAJ |
description | Recently, Transformer-based models have shown promising results in automatic speech recognition (ASR), outperforming models based on recurrent neural networks (RNNs) and convolutional neural networks (CNNs). However, directly applying a Transformer to the ASR task does not exploit the correlation among speech frames effectively, leaving the model trapped in a sub-optimal solution. To this end, we propose a local attention Transformer model for speech recognition that combines the high correlation among speech frames. Specifically, we use relative positional embedding, rather than absolute positional embedding, to improve the generalization of the Transformer for speech sequences of different lengths. Secondly, we add local attention based on parametric positional relations to the self-attentive module and explicitly incorporate prior knowledge into the self-attentive module to make the training process insensitive to hyperparameters, thus improving the performance. Experiments carried out on the LibriSpeech dataset show that our proposed approach achieves a word error rate of 2.3/5.5% by language model fusion without any external data and reduces the word error rate by 17.8/9.8% compared to the baseline. The results are also close to, or better than, other state-of-the-art end-to-end models. |
first_indexed | 2024-03-10T03:41:22Z |
format | Article |
id | doaj.art-467ea8b0c00242c28b20da12806c6331 |
institution | Directory Open Access Journal |
issn | 2078-2489 |
language | English |
last_indexed | 2024-03-10T03:41:22Z |
publishDate | 2022-05-01 |
publisher | MDPI AG |
record_format | Article |
series | Information |
spelling | doaj.art-467ea8b0c00242c28b20da12806c63312023-11-23T11:30:19ZengMDPI AGInformation2078-24892022-05-0113525010.3390/info13050250LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech RecognitionPengbin Fu0Daxing Liu1Huirong Yang2Faculty of Information Technology, Beijing University of Technology, Beijing 100124, ChinaFaculty of Information Technology, Beijing University of Technology, Beijing 100124, ChinaFaculty of Information Technology, Beijing University of Technology, Beijing 100124, ChinaRecently, Transformer-based models have shown promising results in automatic speech recognition (ASR), outperforming models based on recurrent neural networks (RNNs) and convolutional neural networks (CNNs). However, directly applying a Transformer to the ASR task does not exploit the correlation among speech frames effectively, leaving the model trapped in a sub-optimal solution. To this end, we propose a local attention Transformer model for speech recognition that combines the high correlation among speech frames. Specifically, we use relative positional embedding, rather than absolute positional embedding, to improve the generalization of the Transformer for speech sequences of different lengths. Secondly, we add local attention based on parametric positional relations to the self-attentive module and explicitly incorporate prior knowledge into the self-attentive module to make the training process insensitive to hyperparameters, thus improving the performance. Experiments carried out on the LibriSpeech dataset show that our proposed approach achieves a word error rate of 2.3/5.5% by language model fusion without any external data and reduces the word error rate by 17.8/9.8% compared to the baseline. The results are also close to, or better than, other state-of-the-art end-to-end models.https://www.mdpi.com/2078-2489/13/5/250end-to-end modelspeech recognitionTransformerlocal attention |
spellingShingle | Pengbin Fu Daxing Liu Huirong Yang LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition Information end-to-end model speech recognition Transformer local attention |
title | LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition |
title_full | LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition |
title_fullStr | LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition |
title_full_unstemmed | LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition |
title_short | LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition |
title_sort | las transformer an enhanced transformer based on the local attention mechanism for speech recognition |
topic | end-to-end model speech recognition Transformer local attention |
url | https://www.mdpi.com/2078-2489/13/5/250 |
work_keys_str_mv | AT pengbinfu lastransformeranenhancedtransformerbasedonthelocalattentionmechanismforspeechrecognition AT daxingliu lastransformeranenhancedtransformerbasedonthelocalattentionmechanismforspeechrecognition AT huirongyang lastransformeranenhancedtransformerbasedonthelocalattentionmechanismforspeechrecognition |