Improving Amharic Speech Recognition System Using Connectionist Temporal Classification with Attention Model and Phoneme-Based Byte-Pair-Encodings
Out-of-vocabulary (OOV) words are the most challenging problem in automatic speech recognition (ASR), especially for morphologically rich languages. Most end-to-end speech recognition systems are performed at word and character levels of a language. Amharic is a poorly resourced but morphologically...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2021-02-01
|
Series: | Information |
Subjects: | |
Online Access: | https://www.mdpi.com/2078-2489/12/2/62 |
_version_ | 1827604337316069376 |
---|---|
author | Eshete Derb Emiru Shengwu Xiong Yaxing Li Awet Fesseha Moussa Diallo |
author_facet | Eshete Derb Emiru Shengwu Xiong Yaxing Li Awet Fesseha Moussa Diallo |
author_sort | Eshete Derb Emiru |
collection | DOAJ |
description | Out-of-vocabulary (OOV) words are the most challenging problem in automatic speech recognition (ASR), especially for morphologically rich languages. Most end-to-end speech recognition systems are performed at word and character levels of a language. Amharic is a poorly resourced but morphologically rich language. This paper proposes hybrid connectionist temporal classification with attention end-to-end architecture and a syllabification algorithm for Amharic automatic speech recognition system (AASR) using its phoneme-based subword units. This algorithm helps to insert the epithetic vowel እ[ɨ], which is not included in our Grapheme-to-Phoneme (G2P) conversion algorithm developed using consonant–vowel (CV) representations of Amharic graphemes. The proposed end-to-end model was trained in various Amharic subwords, namely characters, phonemes, character-based subwords, and phoneme-based subwords generated by the byte-pair-encoding (BPE) segmentation algorithm. Experimental results showed that context-dependent phoneme-based subwords tend to result in more accurate speech recognition systems than the character-based, phoneme-based, and character-based subword counterparts. Further improvement was also obtained in proposed phoneme-based subwords with the syllabification algorithm and SpecAugment data augmentation technique. The word error rate (WER) reduction was 18.38% compared to character-based acoustic modeling with the word-based recurrent neural network language modeling (RNNLM) baseline. These phoneme-based subword models are also useful to improve machine and speech translation tasks. |
first_indexed | 2024-03-09T05:58:38Z |
format | Article |
id | doaj.art-c5dea3e50f95403bbe981c2f464cb5a8 |
institution | Directory Open Access Journal |
issn | 2078-2489 |
language | English |
last_indexed | 2024-03-09T05:58:38Z |
publishDate | 2021-02-01 |
publisher | MDPI AG |
record_format | Article |
series | Information |
spelling | doaj.art-c5dea3e50f95403bbe981c2f464cb5a82023-12-03T12:11:57ZengMDPI AGInformation2078-24892021-02-011226210.3390/info12020062Improving Amharic Speech Recognition System Using Connectionist Temporal Classification with Attention Model and Phoneme-Based Byte-Pair-EncodingsEshete Derb Emiru0Shengwu Xiong1Yaxing Li2Awet Fesseha3Moussa Diallo4School of Computer Science and Technology, Wuhan University of Technology, Wuhan 430070, ChinaSchool of Computer Science and Technology, Wuhan University of Technology, Wuhan 430070, ChinaSchool of Computer Science and Technology, Wuhan University of Technology, Wuhan 430070, ChinaSchool of Computer Science and Technology, Wuhan University of Technology, Wuhan 430070, ChinaSchool of Computer Science and Technology, Wuhan University of Technology, Wuhan 430070, ChinaOut-of-vocabulary (OOV) words are the most challenging problem in automatic speech recognition (ASR), especially for morphologically rich languages. Most end-to-end speech recognition systems are performed at word and character levels of a language. Amharic is a poorly resourced but morphologically rich language. This paper proposes hybrid connectionist temporal classification with attention end-to-end architecture and a syllabification algorithm for Amharic automatic speech recognition system (AASR) using its phoneme-based subword units. This algorithm helps to insert the epithetic vowel እ[ɨ], which is not included in our Grapheme-to-Phoneme (G2P) conversion algorithm developed using consonant–vowel (CV) representations of Amharic graphemes. The proposed end-to-end model was trained in various Amharic subwords, namely characters, phonemes, character-based subwords, and phoneme-based subwords generated by the byte-pair-encoding (BPE) segmentation algorithm. Experimental results showed that context-dependent phoneme-based subwords tend to result in more accurate speech recognition systems than the character-based, phoneme-based, and character-based subword counterparts. Further improvement was also obtained in proposed phoneme-based subwords with the syllabification algorithm and SpecAugment data augmentation technique. The word error rate (WER) reduction was 18.38% compared to character-based acoustic modeling with the word-based recurrent neural network language modeling (RNNLM) baseline. These phoneme-based subword models are also useful to improve machine and speech translation tasks.https://www.mdpi.com/2078-2489/12/2/62Amharicautomatic speech recognitionconnectionist temporal classification with attentionnatural language processinglow resource languageout-of-vocabulary |
spellingShingle | Eshete Derb Emiru Shengwu Xiong Yaxing Li Awet Fesseha Moussa Diallo Improving Amharic Speech Recognition System Using Connectionist Temporal Classification with Attention Model and Phoneme-Based Byte-Pair-Encodings Information Amharic automatic speech recognition connectionist temporal classification with attention natural language processing low resource language out-of-vocabulary |
title | Improving Amharic Speech Recognition System Using Connectionist Temporal Classification with Attention Model and Phoneme-Based Byte-Pair-Encodings |
title_full | Improving Amharic Speech Recognition System Using Connectionist Temporal Classification with Attention Model and Phoneme-Based Byte-Pair-Encodings |
title_fullStr | Improving Amharic Speech Recognition System Using Connectionist Temporal Classification with Attention Model and Phoneme-Based Byte-Pair-Encodings |
title_full_unstemmed | Improving Amharic Speech Recognition System Using Connectionist Temporal Classification with Attention Model and Phoneme-Based Byte-Pair-Encodings |
title_short | Improving Amharic Speech Recognition System Using Connectionist Temporal Classification with Attention Model and Phoneme-Based Byte-Pair-Encodings |
title_sort | improving amharic speech recognition system using connectionist temporal classification with attention model and phoneme based byte pair encodings |
topic | Amharic automatic speech recognition connectionist temporal classification with attention natural language processing low resource language out-of-vocabulary |
url | https://www.mdpi.com/2078-2489/12/2/62 |
work_keys_str_mv | AT eshetederbemiru improvingamharicspeechrecognitionsystemusingconnectionisttemporalclassificationwithattentionmodelandphonemebasedbytepairencodings AT shengwuxiong improvingamharicspeechrecognitionsystemusingconnectionisttemporalclassificationwithattentionmodelandphonemebasedbytepairencodings AT yaxingli improvingamharicspeechrecognitionsystemusingconnectionisttemporalclassificationwithattentionmodelandphonemebasedbytepairencodings AT awetfesseha improvingamharicspeechrecognitionsystemusingconnectionisttemporalclassificationwithattentionmodelandphonemebasedbytepairencodings AT moussadiallo improvingamharicspeechrecognitionsystemusingconnectionisttemporalclassificationwithattentionmodelandphonemebasedbytepairencodings |