A Bidirectional Context Embedding Transformer for Automatic Speech Recognition

Transformers have become popular in building end-to-end automatic speech recognition (ASR) systems. However, transformer ASR systems are usually trained to give output sequences in the left-to-right order, disregarding the right-to-left context. Currently, the existing transformer-based ASR systems...

Full description

Bibliographic Details
Main Authors:	Lyuchao Liao, Francis Afedzie Kwofie, Zhifeng Chen, Guangjie Han, Yongqiang Wang, Yuyuan Lin, Dongmei Hu
Format:	Article
Language:	English
Published:	MDPI AG 2022-01-01
Series:	Information
Subjects:	automatic speech recognition (ASR) speech transformer bidirectional decoder bidirectional embedding end-to-end model attention
Online Access:	https://www.mdpi.com/2078-2489/13/2/69

_version_	1797479237212438528
author	Lyuchao Liao Francis Afedzie Kwofie Zhifeng Chen Guangjie Han Yongqiang Wang Yuyuan Lin Dongmei Hu
author_facet	Lyuchao Liao Francis Afedzie Kwofie Zhifeng Chen Guangjie Han Yongqiang Wang Yuyuan Lin Dongmei Hu
author_sort	Lyuchao Liao
collection	DOAJ
description	Transformers have become popular in building end-to-end automatic speech recognition (ASR) systems. However, transformer ASR systems are usually trained to give output sequences in the left-to-right order, disregarding the right-to-left context. Currently, the existing transformer-based ASR systems that employ two decoders for bidirectional decoding are complex in terms of computation and optimization. The existing ASR transformer with a single decoder for bidirectional decoding requires extra methods (such as a self-mask) to resolve the problem of information leakage in the attention mechanism This paper explores different options for the development of a speech transformer that utilizes a single decoder equipped with bidirectional context embedding (BCE) for bidirectional decoding. The decoding direction, which is set up at the input level, enables the model to attend to different directional contexts without extra decoders and also alleviates any information leakage. The effectiveness of this method was verified with a bidirectional beam search method that generates bidirectional output sequences and determines the best hypothesis according to the output score. We achieved a word error rate (WER) of 7.65%/18.97% on the clean/other LibriSpeech test set, outperforming the left-to-right decoding style in our work by 3.17%/3.47%. The results are also close to, or better than, other state-of-the-art end-to-end models.
first_indexed	2024-03-09T21:42:56Z
format	Article
id	doaj.art-43a55fa782c240eda16c130c4a454474
institution	Directory Open Access Journal
issn	2078-2489
language	English
last_indexed	2024-03-09T21:42:56Z
publishDate	2022-01-01
publisher	MDPI AG
record_format	Article
series	Information
spelling	doaj.art-43a55fa782c240eda16c130c4a4544742023-11-23T20:25:20ZengMDPI AGInformation2078-24892022-01-011326910.3390/info13020069A Bidirectional Context Embedding Transformer for Automatic Speech RecognitionLyuchao Liao0Francis Afedzie Kwofie1Zhifeng Chen2Guangjie Han3Yongqiang Wang4Yuyuan Lin5Dongmei Hu6Fujian Key Laboratory of Automotive Electronics and Electric Drive, Fujian University of Technology, Fuzhou 350118, ChinaFujian Key Laboratory of Automotive Electronics and Electric Drive, Fujian University of Technology, Fuzhou 350118, ChinaFujian Key Laboratory of Automotive Electronics and Electric Drive, Fujian University of Technology, Fuzhou 350118, ChinaFujian Provincial Universities Engineering Research Center for Intelligent Driving Technology, Fujian University of Technology, Fuzhou 350118, ChinaFujian Key Laboratory of Automotive Electronics and Electric Drive, Fujian University of Technology, Fuzhou 350118, ChinaFujian Key Laboratory of Automotive Electronics and Electric Drive, Fujian University of Technology, Fuzhou 350118, ChinaFujian Key Laboratory of Automotive Electronics and Electric Drive, Fujian University of Technology, Fuzhou 350118, ChinaTransformers have become popular in building end-to-end automatic speech recognition (ASR) systems. However, transformer ASR systems are usually trained to give output sequences in the left-to-right order, disregarding the right-to-left context. Currently, the existing transformer-based ASR systems that employ two decoders for bidirectional decoding are complex in terms of computation and optimization. The existing ASR transformer with a single decoder for bidirectional decoding requires extra methods (such as a self-mask) to resolve the problem of information leakage in the attention mechanism This paper explores different options for the development of a speech transformer that utilizes a single decoder equipped with bidirectional context embedding (BCE) for bidirectional decoding. The decoding direction, which is set up at the input level, enables the model to attend to different directional contexts without extra decoders and also alleviates any information leakage. The effectiveness of this method was verified with a bidirectional beam search method that generates bidirectional output sequences and determines the best hypothesis according to the output score. We achieved a word error rate (WER) of 7.65%/18.97% on the clean/other LibriSpeech test set, outperforming the left-to-right decoding style in our work by 3.17%/3.47%. The results are also close to, or better than, other state-of-the-art end-to-end models.https://www.mdpi.com/2078-2489/13/2/69automatic speech recognition (ASR)speech transformerbidirectional decoderbidirectional embeddingend-to-end modelattention
spellingShingle	Lyuchao Liao Francis Afedzie Kwofie Zhifeng Chen Guangjie Han Yongqiang Wang Yuyuan Lin Dongmei Hu A Bidirectional Context Embedding Transformer for Automatic Speech Recognition Information automatic speech recognition (ASR) speech transformer bidirectional decoder bidirectional embedding end-to-end model attention
title	A Bidirectional Context Embedding Transformer for Automatic Speech Recognition
title_full	A Bidirectional Context Embedding Transformer for Automatic Speech Recognition
title_fullStr	A Bidirectional Context Embedding Transformer for Automatic Speech Recognition
title_full_unstemmed	A Bidirectional Context Embedding Transformer for Automatic Speech Recognition
title_short	A Bidirectional Context Embedding Transformer for Automatic Speech Recognition
title_sort	bidirectional context embedding transformer for automatic speech recognition
topic	automatic speech recognition (ASR) speech transformer bidirectional decoder bidirectional embedding end-to-end model attention
url	https://www.mdpi.com/2078-2489/13/2/69
work_keys_str_mv	AT lyuchaoliao abidirectionalcontextembeddingtransformerforautomaticspeechrecognition AT francisafedziekwofie abidirectionalcontextembeddingtransformerforautomaticspeechrecognition AT zhifengchen abidirectionalcontextembeddingtransformerforautomaticspeechrecognition AT guangjiehan abidirectionalcontextembeddingtransformerforautomaticspeechrecognition AT yongqiangwang abidirectionalcontextembeddingtransformerforautomaticspeechrecognition AT yuyuanlin abidirectionalcontextembeddingtransformerforautomaticspeechrecognition AT dongmeihu abidirectionalcontextembeddingtransformerforautomaticspeechrecognition AT lyuchaoliao bidirectionalcontextembeddingtransformerforautomaticspeechrecognition AT francisafedziekwofie bidirectionalcontextembeddingtransformerforautomaticspeechrecognition AT zhifengchen bidirectionalcontextembeddingtransformerforautomaticspeechrecognition AT guangjiehan bidirectionalcontextembeddingtransformerforautomaticspeechrecognition AT yongqiangwang bidirectionalcontextembeddingtransformerforautomaticspeechrecognition AT yuyuanlin bidirectionalcontextembeddingtransformerforautomaticspeechrecognition AT dongmeihu bidirectionalcontextembeddingtransformerforautomaticspeechrecognition

A Bidirectional Context Embedding Transformer for Automatic Speech Recognition

Similar Items