Time-Distributed Attention-Layered Convolution Neural Network with Ensemble Learning using Random Forest Classifier for Speech Emotion Recognition
Speech Emotion Detection (SER) is a field of identifying human emotions from human speech utterances. Human speech utterances are a combination of linguistic and non-linguistic information. Nonlinguistic SER provides a generalized solution in human–computer interaction applications as it overcomes...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
UUM Press
2023-01-01
|
Series: | Journal of ICT |
Subjects: | |
Online Access: | https://e-journal.uum.edu.my/index.php/jict/article/view/14982 |
_version_ | 1797948322216935424 |
---|---|
author | Yalamanchili Bhanusree Samayamantula Srinivas Kumar Anne Koteswara Rao |
author_facet | Yalamanchili Bhanusree Samayamantula Srinivas Kumar Anne Koteswara Rao |
author_sort | Yalamanchili Bhanusree |
collection | DOAJ |
description |
Speech Emotion Detection (SER) is a field of identifying human emotions from human speech utterances. Human speech utterances
are a combination of linguistic and non-linguistic information. Nonlinguistic SER provides a generalized solution in human–computer
interaction applications as it overcomes the language barrier. Machine learning and deep learning techniques were previously proposed for classifying emotions using handpicked features. To achieve effective and generalized SER, feature extraction can be performed using deep neural networks and ensemble learning for classification. The proposed model employed a time-distributed attention-layered convolution neural network (TDACNN) for extracting spatiotemporal features at the first stage and a random forest (RF) classifier, which is an ensemble classifier for efficient and generalized classification of emotions, at the second stage. The proposed model was implemented on the RAVDESS and IEMOCAP data corpora and compared with the CNN-SVM and CNN-RF models for SER. The TDACNN-RF model exhibited test classification accuracies of 92.19 percent and 90.27 percent on the RAVDESS and IEMOCAP data corpora, respectively. The experimental results proved that the proposed model is efficient in extracting spatiotemporal features from time-series speech signals and can classify emotions with good accuracy. The class confusion among the emotions was reduced for both data corpora, proving that the model achieved generalization.
|
first_indexed | 2024-04-10T21:41:32Z |
format | Article |
id | doaj.art-7fd2819fb6764ee28aa451adb4b53a13 |
institution | Directory Open Access Journal |
issn | 1675-414X 2180-3862 |
language | English |
last_indexed | 2024-04-10T21:41:32Z |
publishDate | 2023-01-01 |
publisher | UUM Press |
record_format | Article |
series | Journal of ICT |
spelling | doaj.art-7fd2819fb6764ee28aa451adb4b53a132023-01-19T01:50:45ZengUUM PressJournal of ICT1675-414X2180-38622023-01-0122110.32890/jict2023.22.1.3Time-Distributed Attention-Layered Convolution Neural Network with Ensemble Learning using Random Forest Classifier for Speech Emotion RecognitionYalamanchili Bhanusree0Samayamantula Srinivas Kumar1Anne Koteswara Rao2Department of Computer Science Engineering, Vallurupalli Nageswara Rao Vignana Jyothi Institute of Engineering and Technology, IndiaDepartment of Electronics and Communications Engineering, Jawaharlal Nehru Technological University Kakinada, IndiaDepartment of Computer Science Engineering, Kalasalingam Academy of Research and Education, India Speech Emotion Detection (SER) is a field of identifying human emotions from human speech utterances. Human speech utterances are a combination of linguistic and non-linguistic information. Nonlinguistic SER provides a generalized solution in human–computer interaction applications as it overcomes the language barrier. Machine learning and deep learning techniques were previously proposed for classifying emotions using handpicked features. To achieve effective and generalized SER, feature extraction can be performed using deep neural networks and ensemble learning for classification. The proposed model employed a time-distributed attention-layered convolution neural network (TDACNN) for extracting spatiotemporal features at the first stage and a random forest (RF) classifier, which is an ensemble classifier for efficient and generalized classification of emotions, at the second stage. The proposed model was implemented on the RAVDESS and IEMOCAP data corpora and compared with the CNN-SVM and CNN-RF models for SER. The TDACNN-RF model exhibited test classification accuracies of 92.19 percent and 90.27 percent on the RAVDESS and IEMOCAP data corpora, respectively. The experimental results proved that the proposed model is efficient in extracting spatiotemporal features from time-series speech signals and can classify emotions with good accuracy. The class confusion among the emotions was reduced for both data corpora, proving that the model achieved generalization. https://e-journal.uum.edu.my/index.php/jict/article/view/14982Ensemble classifiersRandom ForestSpeech Emotion RecognitionHuman Computer Interactiontime-distributed layersspatiotemporal features |
spellingShingle | Yalamanchili Bhanusree Samayamantula Srinivas Kumar Anne Koteswara Rao Time-Distributed Attention-Layered Convolution Neural Network with Ensemble Learning using Random Forest Classifier for Speech Emotion Recognition Journal of ICT Ensemble classifiers Random Forest Speech Emotion Recognition Human Computer Interaction time-distributed layers spatiotemporal features |
title | Time-Distributed Attention-Layered Convolution Neural Network with Ensemble Learning using Random Forest Classifier for Speech Emotion Recognition |
title_full | Time-Distributed Attention-Layered Convolution Neural Network with Ensemble Learning using Random Forest Classifier for Speech Emotion Recognition |
title_fullStr | Time-Distributed Attention-Layered Convolution Neural Network with Ensemble Learning using Random Forest Classifier for Speech Emotion Recognition |
title_full_unstemmed | Time-Distributed Attention-Layered Convolution Neural Network with Ensemble Learning using Random Forest Classifier for Speech Emotion Recognition |
title_short | Time-Distributed Attention-Layered Convolution Neural Network with Ensemble Learning using Random Forest Classifier for Speech Emotion Recognition |
title_sort | time distributed attention layered convolution neural network with ensemble learning using random forest classifier for speech emotion recognition |
topic | Ensemble classifiers Random Forest Speech Emotion Recognition Human Computer Interaction time-distributed layers spatiotemporal features |
url | https://e-journal.uum.edu.my/index.php/jict/article/view/14982 |
work_keys_str_mv | AT yalamanchilibhanusree timedistributedattentionlayeredconvolutionneuralnetworkwithensemblelearningusingrandomforestclassifierforspeechemotionrecognition AT samayamantulasrinivaskumar timedistributedattentionlayeredconvolutionneuralnetworkwithensemblelearningusingrandomforestclassifierforspeechemotionrecognition AT annekoteswararao timedistributedattentionlayeredconvolutionneuralnetworkwithensemblelearningusingrandomforestclassifierforspeechemotionrecognition |