LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition

Recently, Transformer-based models have shown promising results in automatic speech recognition (ASR), outperforming models based on recurrent neural networks (RNNs) and convolutional neural networks (CNNs). However, directly applying a Transformer to the ASR task does not exploit the correlation am...

Full description

Bibliographic Details
Main Authors:	Pengbin Fu, Daxing Liu, Huirong Yang
Format:	Article
Language:	English
Published:	MDPI AG 2022-05-01
Series:	Information
Subjects:	end-to-end model speech recognition Transformer local attention
Online Access:	https://www.mdpi.com/2078-2489/13/5/250

_version_	1797499011601530880
author	Pengbin Fu Daxing Liu Huirong Yang
author_facet	Pengbin Fu Daxing Liu Huirong Yang
author_sort	Pengbin Fu
collection	DOAJ
description	Recently, Transformer-based models have shown promising results in automatic speech recognition (ASR), outperforming models based on recurrent neural networks (RNNs) and convolutional neural networks (CNNs). However, directly applying a Transformer to the ASR task does not exploit the correlation among speech frames effectively, leaving the model trapped in a sub-optimal solution. To this end, we propose a local attention Transformer model for speech recognition that combines the high correlation among speech frames. Specifically, we use relative positional embedding, rather than absolute positional embedding, to improve the generalization of the Transformer for speech sequences of different lengths. Secondly, we add local attention based on parametric positional relations to the self-attentive module and explicitly incorporate prior knowledge into the self-attentive module to make the training process insensitive to hyperparameters, thus improving the performance. Experiments carried out on the LibriSpeech dataset show that our proposed approach achieves a word error rate of 2.3/5.5% by language model fusion without any external data and reduces the word error rate by 17.8/9.8% compared to the baseline. The results are also close to, or better than, other state-of-the-art end-to-end models.
first_indexed	2024-03-10T03:41:22Z
format	Article
id	doaj.art-467ea8b0c00242c28b20da12806c6331
institution	Directory Open Access Journal
issn	2078-2489
language	English
last_indexed	2024-03-10T03:41:22Z
publishDate	2022-05-01
publisher	MDPI AG
record_format	Article
series	Information
spelling	doaj.art-467ea8b0c00242c28b20da12806c63312023-11-23T11:30:19ZengMDPI AGInformation2078-24892022-05-0113525010.3390/info13050250LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech RecognitionPengbin Fu0Daxing Liu1Huirong Yang2Faculty of Information Technology, Beijing University of Technology, Beijing 100124, ChinaFaculty of Information Technology, Beijing University of Technology, Beijing 100124, ChinaFaculty of Information Technology, Beijing University of Technology, Beijing 100124, ChinaRecently, Transformer-based models have shown promising results in automatic speech recognition (ASR), outperforming models based on recurrent neural networks (RNNs) and convolutional neural networks (CNNs). However, directly applying a Transformer to the ASR task does not exploit the correlation among speech frames effectively, leaving the model trapped in a sub-optimal solution. To this end, we propose a local attention Transformer model for speech recognition that combines the high correlation among speech frames. Specifically, we use relative positional embedding, rather than absolute positional embedding, to improve the generalization of the Transformer for speech sequences of different lengths. Secondly, we add local attention based on parametric positional relations to the self-attentive module and explicitly incorporate prior knowledge into the self-attentive module to make the training process insensitive to hyperparameters, thus improving the performance. Experiments carried out on the LibriSpeech dataset show that our proposed approach achieves a word error rate of 2.3/5.5% by language model fusion without any external data and reduces the word error rate by 17.8/9.8% compared to the baseline. The results are also close to, or better than, other state-of-the-art end-to-end models.https://www.mdpi.com/2078-2489/13/5/250end-to-end modelspeech recognitionTransformerlocal attention
spellingShingle	Pengbin Fu Daxing Liu Huirong Yang LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition Information end-to-end model speech recognition Transformer local attention
title	LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition
title_full	LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition
title_fullStr	LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition
title_full_unstemmed	LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition
title_short	LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition
title_sort	las transformer an enhanced transformer based on the local attention mechanism for speech recognition
topic	end-to-end model speech recognition Transformer local attention
url	https://www.mdpi.com/2078-2489/13/5/250
work_keys_str_mv	AT pengbinfu lastransformeranenhancedtransformerbasedonthelocalattentionmechanismforspeechrecognition AT daxingliu lastransformeranenhancedtransformerbasedonthelocalattentionmechanismforspeechrecognition AT huirongyang lastransformeranenhancedtransformerbasedonthelocalattentionmechanismforspeechrecognition

LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition

Similar Items