Improving Amharic Speech Recognition System Using Connectionist Temporal Classification with Attention Model and Phoneme-Based Byte-Pair-Encodings

Out-of-vocabulary (OOV) words are the most challenging problem in automatic speech recognition (ASR), especially for morphologically rich languages. Most end-to-end speech recognition systems are performed at word and character levels of a language. Amharic is a poorly resourced but morphologically...

Full description

Bibliographic Details
Main Authors: Eshete Derb Emiru, Shengwu Xiong, Yaxing Li, Awet Fesseha, Moussa Diallo
Format: Article
Language:English
Published: MDPI AG 2021-02-01
Series:Information
Subjects:
Online Access:https://www.mdpi.com/2078-2489/12/2/62
_version_ 1827604337316069376
author Eshete Derb Emiru
Shengwu Xiong
Yaxing Li
Awet Fesseha
Moussa Diallo
author_facet Eshete Derb Emiru
Shengwu Xiong
Yaxing Li
Awet Fesseha
Moussa Diallo
author_sort Eshete Derb Emiru
collection DOAJ
description Out-of-vocabulary (OOV) words are the most challenging problem in automatic speech recognition (ASR), especially for morphologically rich languages. Most end-to-end speech recognition systems are performed at word and character levels of a language. Amharic is a poorly resourced but morphologically rich language. This paper proposes hybrid connectionist temporal classification with attention end-to-end architecture and a syllabification algorithm for Amharic automatic speech recognition system (AASR) using its phoneme-based subword units. This algorithm helps to insert the epithetic vowel እ[ɨ], which is not included in our Grapheme-to-Phoneme (G2P) conversion algorithm developed using consonant–vowel (CV) representations of Amharic graphemes. The proposed end-to-end model was trained in various Amharic subwords, namely characters, phonemes, character-based subwords, and phoneme-based subwords generated by the byte-pair-encoding (BPE) segmentation algorithm. Experimental results showed that context-dependent phoneme-based subwords tend to result in more accurate speech recognition systems than the character-based, phoneme-based, and character-based subword counterparts. Further improvement was also obtained in proposed phoneme-based subwords with the syllabification algorithm and SpecAugment data augmentation technique. The word error rate (WER) reduction was 18.38% compared to character-based acoustic modeling with the word-based recurrent neural network language modeling (RNNLM) baseline. These phoneme-based subword models are also useful to improve machine and speech translation tasks.
first_indexed 2024-03-09T05:58:38Z
format Article
id doaj.art-c5dea3e50f95403bbe981c2f464cb5a8
institution Directory Open Access Journal
issn 2078-2489
language English
last_indexed 2024-03-09T05:58:38Z
publishDate 2021-02-01
publisher MDPI AG
record_format Article
series Information
spelling doaj.art-c5dea3e50f95403bbe981c2f464cb5a82023-12-03T12:11:57ZengMDPI AGInformation2078-24892021-02-011226210.3390/info12020062Improving Amharic Speech Recognition System Using Connectionist Temporal Classification with Attention Model and Phoneme-Based Byte-Pair-EncodingsEshete Derb Emiru0Shengwu Xiong1Yaxing Li2Awet Fesseha3Moussa Diallo4School of Computer Science and Technology, Wuhan University of Technology, Wuhan 430070, ChinaSchool of Computer Science and Technology, Wuhan University of Technology, Wuhan 430070, ChinaSchool of Computer Science and Technology, Wuhan University of Technology, Wuhan 430070, ChinaSchool of Computer Science and Technology, Wuhan University of Technology, Wuhan 430070, ChinaSchool of Computer Science and Technology, Wuhan University of Technology, Wuhan 430070, ChinaOut-of-vocabulary (OOV) words are the most challenging problem in automatic speech recognition (ASR), especially for morphologically rich languages. Most end-to-end speech recognition systems are performed at word and character levels of a language. Amharic is a poorly resourced but morphologically rich language. This paper proposes hybrid connectionist temporal classification with attention end-to-end architecture and a syllabification algorithm for Amharic automatic speech recognition system (AASR) using its phoneme-based subword units. This algorithm helps to insert the epithetic vowel እ[ɨ], which is not included in our Grapheme-to-Phoneme (G2P) conversion algorithm developed using consonant–vowel (CV) representations of Amharic graphemes. The proposed end-to-end model was trained in various Amharic subwords, namely characters, phonemes, character-based subwords, and phoneme-based subwords generated by the byte-pair-encoding (BPE) segmentation algorithm. Experimental results showed that context-dependent phoneme-based subwords tend to result in more accurate speech recognition systems than the character-based, phoneme-based, and character-based subword counterparts. Further improvement was also obtained in proposed phoneme-based subwords with the syllabification algorithm and SpecAugment data augmentation technique. The word error rate (WER) reduction was 18.38% compared to character-based acoustic modeling with the word-based recurrent neural network language modeling (RNNLM) baseline. These phoneme-based subword models are also useful to improve machine and speech translation tasks.https://www.mdpi.com/2078-2489/12/2/62Amharicautomatic speech recognitionconnectionist temporal classification with attentionnatural language processinglow resource languageout-of-vocabulary
spellingShingle Eshete Derb Emiru
Shengwu Xiong
Yaxing Li
Awet Fesseha
Moussa Diallo
Improving Amharic Speech Recognition System Using Connectionist Temporal Classification with Attention Model and Phoneme-Based Byte-Pair-Encodings
Information
Amharic
automatic speech recognition
connectionist temporal classification with attention
natural language processing
low resource language
out-of-vocabulary
title Improving Amharic Speech Recognition System Using Connectionist Temporal Classification with Attention Model and Phoneme-Based Byte-Pair-Encodings
title_full Improving Amharic Speech Recognition System Using Connectionist Temporal Classification with Attention Model and Phoneme-Based Byte-Pair-Encodings
title_fullStr Improving Amharic Speech Recognition System Using Connectionist Temporal Classification with Attention Model and Phoneme-Based Byte-Pair-Encodings
title_full_unstemmed Improving Amharic Speech Recognition System Using Connectionist Temporal Classification with Attention Model and Phoneme-Based Byte-Pair-Encodings
title_short Improving Amharic Speech Recognition System Using Connectionist Temporal Classification with Attention Model and Phoneme-Based Byte-Pair-Encodings
title_sort improving amharic speech recognition system using connectionist temporal classification with attention model and phoneme based byte pair encodings
topic Amharic
automatic speech recognition
connectionist temporal classification with attention
natural language processing
low resource language
out-of-vocabulary
url https://www.mdpi.com/2078-2489/12/2/62
work_keys_str_mv AT eshetederbemiru improvingamharicspeechrecognitionsystemusingconnectionisttemporalclassificationwithattentionmodelandphonemebasedbytepairencodings
AT shengwuxiong improvingamharicspeechrecognitionsystemusingconnectionisttemporalclassificationwithattentionmodelandphonemebasedbytepairencodings
AT yaxingli improvingamharicspeechrecognitionsystemusingconnectionisttemporalclassificationwithattentionmodelandphonemebasedbytepairencodings
AT awetfesseha improvingamharicspeechrecognitionsystemusingconnectionisttemporalclassificationwithattentionmodelandphonemebasedbytepairencodings
AT moussadiallo improvingamharicspeechrecognitionsystemusingconnectionisttemporalclassificationwithattentionmodelandphonemebasedbytepairencodings