Time-Distributed Attention-Layered Convolution Neural Network with Ensemble Learning using Random Forest Classifier for Speech Emotion Recognition

Speech Emotion Detection (SER) is a field of identifying human emotions from human speech utterances. Human speech utterances are a combination of linguistic and non-linguistic information. Nonlinguistic SER provides a generalized solution in human–computer interaction applications as it overcomes...

Full description

Bibliographic Details
Main Authors: Yalamanchili Bhanusree, Samayamantula Srinivas Kumar, Anne Koteswara Rao
Format: Article
Language:English
Published: UUM Press 2023-01-01
Series:Journal of ICT
Subjects:
Online Access:https://e-journal.uum.edu.my/index.php/jict/article/view/14982
_version_ 1797948322216935424
author Yalamanchili Bhanusree
Samayamantula Srinivas Kumar
Anne Koteswara Rao
author_facet Yalamanchili Bhanusree
Samayamantula Srinivas Kumar
Anne Koteswara Rao
author_sort Yalamanchili Bhanusree
collection DOAJ
description Speech Emotion Detection (SER) is a field of identifying human emotions from human speech utterances. Human speech utterances are a combination of linguistic and non-linguistic information. Nonlinguistic SER provides a generalized solution in human–computer interaction applications as it overcomes the language barrier. Machine learning and deep learning techniques were previously proposed for classifying emotions using handpicked features. To achieve effective and generalized SER, feature extraction can be performed using deep neural networks and ensemble learning for classification. The proposed model employed a time-distributed attention-layered convolution neural network (TDACNN) for extracting spatiotemporal features at the first stage and a random forest (RF) classifier, which is an ensemble classifier for efficient and generalized classification of emotions, at the second stage. The proposed model was implemented on the RAVDESS and IEMOCAP data corpora and compared with the CNN-SVM and CNN-RF models for SER. The TDACNN-RF model exhibited test classification accuracies of 92.19 percent and 90.27 percent on the RAVDESS and IEMOCAP data corpora, respectively. The experimental results proved that the proposed model is efficient in extracting spatiotemporal features from time-series speech signals and can classify emotions with good accuracy. The class confusion among the emotions was reduced for both data corpora, proving that the model achieved generalization.
first_indexed 2024-04-10T21:41:32Z
format Article
id doaj.art-7fd2819fb6764ee28aa451adb4b53a13
institution Directory Open Access Journal
issn 1675-414X
2180-3862
language English
last_indexed 2024-04-10T21:41:32Z
publishDate 2023-01-01
publisher UUM Press
record_format Article
series Journal of ICT
spelling doaj.art-7fd2819fb6764ee28aa451adb4b53a132023-01-19T01:50:45ZengUUM PressJournal of ICT1675-414X2180-38622023-01-0122110.32890/jict2023.22.1.3Time-Distributed Attention-Layered Convolution Neural Network with Ensemble Learning using Random Forest Classifier for Speech Emotion RecognitionYalamanchili Bhanusree0Samayamantula Srinivas Kumar1Anne Koteswara Rao2Department of Computer Science Engineering, Vallurupalli Nageswara Rao Vignana Jyothi Institute of Engineering and Technology, IndiaDepartment of Electronics and Communications Engineering, Jawaharlal Nehru Technological University Kakinada, IndiaDepartment of Computer Science Engineering, Kalasalingam Academy of Research and Education, India Speech Emotion Detection (SER) is a field of identifying human emotions from human speech utterances. Human speech utterances are a combination of linguistic and non-linguistic information. Nonlinguistic SER provides a generalized solution in human–computer interaction applications as it overcomes the language barrier. Machine learning and deep learning techniques were previously proposed for classifying emotions using handpicked features. To achieve effective and generalized SER, feature extraction can be performed using deep neural networks and ensemble learning for classification. The proposed model employed a time-distributed attention-layered convolution neural network (TDACNN) for extracting spatiotemporal features at the first stage and a random forest (RF) classifier, which is an ensemble classifier for efficient and generalized classification of emotions, at the second stage. The proposed model was implemented on the RAVDESS and IEMOCAP data corpora and compared with the CNN-SVM and CNN-RF models for SER. The TDACNN-RF model exhibited test classification accuracies of 92.19 percent and 90.27 percent on the RAVDESS and IEMOCAP data corpora, respectively. The experimental results proved that the proposed model is efficient in extracting spatiotemporal features from time-series speech signals and can classify emotions with good accuracy. The class confusion among the emotions was reduced for both data corpora, proving that the model achieved generalization. https://e-journal.uum.edu.my/index.php/jict/article/view/14982Ensemble classifiersRandom ForestSpeech Emotion RecognitionHuman Computer Interactiontime-distributed layersspatiotemporal features
spellingShingle Yalamanchili Bhanusree
Samayamantula Srinivas Kumar
Anne Koteswara Rao
Time-Distributed Attention-Layered Convolution Neural Network with Ensemble Learning using Random Forest Classifier for Speech Emotion Recognition
Journal of ICT
Ensemble classifiers
Random Forest
Speech Emotion Recognition
Human Computer Interaction
time-distributed layers
spatiotemporal features
title Time-Distributed Attention-Layered Convolution Neural Network with Ensemble Learning using Random Forest Classifier for Speech Emotion Recognition
title_full Time-Distributed Attention-Layered Convolution Neural Network with Ensemble Learning using Random Forest Classifier for Speech Emotion Recognition
title_fullStr Time-Distributed Attention-Layered Convolution Neural Network with Ensemble Learning using Random Forest Classifier for Speech Emotion Recognition
title_full_unstemmed Time-Distributed Attention-Layered Convolution Neural Network with Ensemble Learning using Random Forest Classifier for Speech Emotion Recognition
title_short Time-Distributed Attention-Layered Convolution Neural Network with Ensemble Learning using Random Forest Classifier for Speech Emotion Recognition
title_sort time distributed attention layered convolution neural network with ensemble learning using random forest classifier for speech emotion recognition
topic Ensemble classifiers
Random Forest
Speech Emotion Recognition
Human Computer Interaction
time-distributed layers
spatiotemporal features
url https://e-journal.uum.edu.my/index.php/jict/article/view/14982
work_keys_str_mv AT yalamanchilibhanusree timedistributedattentionlayeredconvolutionneuralnetworkwithensemblelearningusingrandomforestclassifierforspeechemotionrecognition
AT samayamantulasrinivaskumar timedistributedattentionlayeredconvolutionneuralnetworkwithensemblelearningusingrandomforestclassifierforspeechemotionrecognition
AT annekoteswararao timedistributedattentionlayeredconvolutionneuralnetworkwithensemblelearningusingrandomforestclassifierforspeechemotionrecognition