Advanced Convolutional Neural Network-Based Hybrid Acoustic Models for Low-Resource Speech Recognition

Deep neural networks (DNNs) have shown a great achievement in acoustic modeling for speech recognition task. Of these networks, convolutional neural network (CNN) is an effective network for representing the local properties of the speech formants. However, CNN is not suitable for modeling the long-...

Full description

Bibliographic Details
Main Authors: Tessfu Geteye Fantaye, Junqing Yu, Tulu Tilahun Hailu
Format: Article
Language:English
Published: MDPI AG 2020-05-01
Series:Computers
Subjects:
Online Access:https://www.mdpi.com/2073-431X/9/2/36
_version_ 1797569085929684992
author Tessfu Geteye Fantaye
Junqing Yu
Tulu Tilahun Hailu
author_facet Tessfu Geteye Fantaye
Junqing Yu
Tulu Tilahun Hailu
author_sort Tessfu Geteye Fantaye
collection DOAJ
description Deep neural networks (DNNs) have shown a great achievement in acoustic modeling for speech recognition task. Of these networks, convolutional neural network (CNN) is an effective network for representing the local properties of the speech formants. However, CNN is not suitable for modeling the long-term context dependencies between speech signal frames. Recently, the recurrent neural networks (RNNs) have shown great abilities for modeling long-term context dependencies. However, the performance of RNNs is not good for low-resource speech recognition tasks, and is even worse than the conventional feed-forward neural networks. Moreover, these networks often overfit severely on the training corpus in the low-resource speech recognition tasks. This paper presents the results of our contributions to combine CNN and conventional RNN with gate, highway, and residual networks to reduce the above problems. The optimal neural network structures and training strategies for the proposed neural network models are explored. Experiments were conducted on the Amharic and Chaha datasets, as well as on the limited language packages (10-h) of the benchmark datasets released under the Intelligence Advanced Research Projects Activity (IARPA) Babel Program. The proposed neural network models achieve 0.1–42.79% relative performance improvements over their corresponding feed-forward DNN, CNN, bidirectional RNN (BRNN), or bidirectional gated recurrent unit (BGRU) baselines across six language collections. These approaches are promising candidates for developing better performance acoustic models for low-resource speech recognition tasks.
first_indexed 2024-03-10T20:06:00Z
format Article
id doaj.art-e89bf199a8e648edbb1e37bd621d6bd9
institution Directory Open Access Journal
issn 2073-431X
language English
last_indexed 2024-03-10T20:06:00Z
publishDate 2020-05-01
publisher MDPI AG
record_format Article
series Computers
spelling doaj.art-e89bf199a8e648edbb1e37bd621d6bd92023-11-19T23:18:49ZengMDPI AGComputers2073-431X2020-05-01923610.3390/computers9020036Advanced Convolutional Neural Network-Based Hybrid Acoustic Models for Low-Resource Speech RecognitionTessfu Geteye Fantaye0Junqing Yu1Tulu Tilahun Hailu2School of Computer Science & Technology, Huazhong University of Science & Technology, Wuhan 430074, ChinaSchool of Computer Science & Technology, Huazhong University of Science & Technology, Wuhan 430074, ChinaSchool of Computer Science & Technology, Huazhong University of Science & Technology, Wuhan 430074, ChinaDeep neural networks (DNNs) have shown a great achievement in acoustic modeling for speech recognition task. Of these networks, convolutional neural network (CNN) is an effective network for representing the local properties of the speech formants. However, CNN is not suitable for modeling the long-term context dependencies between speech signal frames. Recently, the recurrent neural networks (RNNs) have shown great abilities for modeling long-term context dependencies. However, the performance of RNNs is not good for low-resource speech recognition tasks, and is even worse than the conventional feed-forward neural networks. Moreover, these networks often overfit severely on the training corpus in the low-resource speech recognition tasks. This paper presents the results of our contributions to combine CNN and conventional RNN with gate, highway, and residual networks to reduce the above problems. The optimal neural network structures and training strategies for the proposed neural network models are explored. Experiments were conducted on the Amharic and Chaha datasets, as well as on the limited language packages (10-h) of the benchmark datasets released under the Intelligence Advanced Research Projects Activity (IARPA) Babel Program. The proposed neural network models achieve 0.1–42.79% relative performance improvements over their corresponding feed-forward DNN, CNN, bidirectional RNN (BRNN), or bidirectional gated recurrent unit (BGRU) baselines across six language collections. These approaches are promising candidates for developing better performance acoustic models for low-resource speech recognition tasks.https://www.mdpi.com/2073-431X/9/2/36speech recognitionlow-resource languagesacoustic modelsneural network models
spellingShingle Tessfu Geteye Fantaye
Junqing Yu
Tulu Tilahun Hailu
Advanced Convolutional Neural Network-Based Hybrid Acoustic Models for Low-Resource Speech Recognition
Computers
speech recognition
low-resource languages
acoustic models
neural network models
title Advanced Convolutional Neural Network-Based Hybrid Acoustic Models for Low-Resource Speech Recognition
title_full Advanced Convolutional Neural Network-Based Hybrid Acoustic Models for Low-Resource Speech Recognition
title_fullStr Advanced Convolutional Neural Network-Based Hybrid Acoustic Models for Low-Resource Speech Recognition
title_full_unstemmed Advanced Convolutional Neural Network-Based Hybrid Acoustic Models for Low-Resource Speech Recognition
title_short Advanced Convolutional Neural Network-Based Hybrid Acoustic Models for Low-Resource Speech Recognition
title_sort advanced convolutional neural network based hybrid acoustic models for low resource speech recognition
topic speech recognition
low-resource languages
acoustic models
neural network models
url https://www.mdpi.com/2073-431X/9/2/36
work_keys_str_mv AT tessfugeteyefantaye advancedconvolutionalneuralnetworkbasedhybridacousticmodelsforlowresourcespeechrecognition
AT junqingyu advancedconvolutionalneuralnetworkbasedhybridacousticmodelsforlowresourcespeechrecognition
AT tulutilahunhailu advancedconvolutionalneuralnetworkbasedhybridacousticmodelsforlowresourcespeechrecognition