A Light-Weight Autoregressive CNN-Based Frame Level Transducer Decoder for End-to-End ASR

A convolutional neural network (CNN) transducer decoder was proposed to reduce the decoding time of an end-to-end automatic speech recognition (ASR) system while maintaining accuracy. The CNN of 177 k parameters and a kernel size of 6 generates the probabilities of the current token at the token lev...

Full description

Bibliographic Details
Main Authors:	Hyeon-Kyu Noh, Hong-June Park
Format:	Article
Language:	English
Published:	MDPI AG 2024-02-01
Series:	Applied Sciences
Subjects:	speech recognition autoregressive speech recognition end-to-end CNN transducer decoder
Online Access:	https://www.mdpi.com/2076-3417/14/3/1300

_version_	1827354882062942208
author	Hyeon-Kyu Noh Hong-June Park
author_facet	Hyeon-Kyu Noh Hong-June Park
author_sort	Hyeon-Kyu Noh
collection	DOAJ
description	A convolutional neural network (CNN) transducer decoder was proposed to reduce the decoding time of an end-to-end automatic speech recognition (ASR) system while maintaining accuracy. The CNN of 177 k parameters and a kernel size of 6 generates the probabilities of the current token at the token level, at the token transition of the output token sequence. Two probabilities of the current token, one from the encoder and the other from the CNN are added to the frame level to reduce the decoding step to the number of input frames. An encoder composed of an 18-layer conformer was combined with the proposed decoder for training with the Librispeech data set. The forward-backward algorithm was used for training. The space and re-appearance tokens are added to the 300-word piece tokens to represent the token string. A space token appears at a frame between two words. A comparison with the autoregressive decoders such as transformer and RNN-T decoders demonstrates that this work provides comparable WERs with much less decoding time. A comparison with non-autoregressive decoders such as CTC indicates that this work enhanced WERs.
first_indexed	2024-03-08T04:00:59Z
format	Article
id	doaj.art-d5b7dddd4da540a58d3a8f095d1a8641
institution	Directory Open Access Journal
issn	2076-3417
language	English
last_indexed	2024-03-08T04:00:59Z
publishDate	2024-02-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj.art-d5b7dddd4da540a58d3a8f095d1a86412024-02-09T15:08:33ZengMDPI AGApplied Sciences2076-34172024-02-01143130010.3390/app14031300A Light-Weight Autoregressive CNN-Based Frame Level Transducer Decoder for End-to-End ASRHyeon-Kyu Noh0Hong-June Park1Department of Electronic and Electrical Engineering, Pohang University of Science and Technology (POSTECH), Pohang 37673, Republic of KoreaDepartment of Electronic and Electrical Engineering, Pohang University of Science and Technology (POSTECH), Pohang 37673, Republic of KoreaA convolutional neural network (CNN) transducer decoder was proposed to reduce the decoding time of an end-to-end automatic speech recognition (ASR) system while maintaining accuracy. The CNN of 177 k parameters and a kernel size of 6 generates the probabilities of the current token at the token level, at the token transition of the output token sequence. Two probabilities of the current token, one from the encoder and the other from the CNN are added to the frame level to reduce the decoding step to the number of input frames. An encoder composed of an 18-layer conformer was combined with the proposed decoder for training with the Librispeech data set. The forward-backward algorithm was used for training. The space and re-appearance tokens are added to the 300-word piece tokens to represent the token string. A space token appears at a frame between two words. A comparison with the autoregressive decoders such as transformer and RNN-T decoders demonstrates that this work provides comparable WERs with much less decoding time. A comparison with non-autoregressive decoders such as CTC indicates that this work enhanced WERs.https://www.mdpi.com/2076-3417/14/3/1300speech recognitionautoregressive speech recognitionend-to-endCNNtransducer decoder
spellingShingle	Hyeon-Kyu Noh Hong-June Park A Light-Weight Autoregressive CNN-Based Frame Level Transducer Decoder for End-to-End ASR Applied Sciences speech recognition autoregressive speech recognition end-to-end CNN transducer decoder
title	A Light-Weight Autoregressive CNN-Based Frame Level Transducer Decoder for End-to-End ASR
title_full	A Light-Weight Autoregressive CNN-Based Frame Level Transducer Decoder for End-to-End ASR
title_fullStr	A Light-Weight Autoregressive CNN-Based Frame Level Transducer Decoder for End-to-End ASR
title_full_unstemmed	A Light-Weight Autoregressive CNN-Based Frame Level Transducer Decoder for End-to-End ASR
title_short	A Light-Weight Autoregressive CNN-Based Frame Level Transducer Decoder for End-to-End ASR
title_sort	light weight autoregressive cnn based frame level transducer decoder for end to end asr
topic	speech recognition autoregressive speech recognition end-to-end CNN transducer decoder
url	https://www.mdpi.com/2076-3417/14/3/1300
work_keys_str_mv	AT hyeonkyunoh alightweightautoregressivecnnbasedframeleveltransducerdecoderforendtoendasr AT hongjunepark alightweightautoregressivecnnbasedframeleveltransducerdecoderforendtoendasr AT hyeonkyunoh lightweightautoregressivecnnbasedframeleveltransducerdecoderforendtoendasr AT hongjunepark lightweightautoregressivecnnbasedframeleveltransducerdecoderforendtoendasr

A Light-Weight Autoregressive CNN-Based Frame Level Transducer Decoder for End-to-End ASR

Similar Items