Human–Computer Interaction with a Real-Time Speech Emotion Recognition with Ensembling Techniques 1D Convolution Neural Network and Attention

Emotions have a crucial function in the mental existence of humans. They are vital for identifying a person’s behaviour and mental condition. Speech Emotion Recognition (SER) is extracting a speaker’s emotional state from their speech signal. SER is a growing discipline in human–computer interaction...

Full description

Bibliographic Details
Main Author:	Waleed Alsabhan
Format:	Article
Language:	English
Published:	MDPI AG 2023-01-01
Series:	Sensors
Subjects:	human–computer Interaction 1D and 2D Convolution Neural Networks (CNN) speech emotion recognition (SER) EMO-DB SAVEE ANAD
Online Access:	https://www.mdpi.com/1424-8220/23/3/1386

_version_	1797623167819186176
author	Waleed Alsabhan
author_facet	Waleed Alsabhan
author_sort	Waleed Alsabhan
collection	DOAJ
description	Emotions have a crucial function in the mental existence of humans. They are vital for identifying a person’s behaviour and mental condition. Speech Emotion Recognition (SER) is extracting a speaker’s emotional state from their speech signal. SER is a growing discipline in human–computer interaction, and it has recently attracted more significant interest. This is because there are not so many universal emotions; therefore, any intelligent system with enough computational capacity can educate itself to recognise them. However, the issue is that human speech is immensely diverse, making it difficult to create a single, standardised recipe for detecting hidden emotions. This work attempted to solve this research difficulty by combining a multilingual emotional dataset with building a more generalised and effective model for recognising human emotions. A two-step process was used to develop the model. The first stage involved the extraction of features, and the second stage involved the classification of the features that were extracted. ZCR, RMSE, and the renowned MFC coefficients were retrieved as features. Two proposed models, 1D CNN combined with LSTM and attention and a proprietary 2D CNN architecture, were used for classification. The outcomes demonstrated that the suggested 1D CNN with LSTM and attention performed better than the 2D CNN. For the EMO-DB, SAVEE, ANAD, and BAVED datasets, the model’s accuracy was 96.72%, 97.13%, 96.72%, and 88.39%, respectively. The model beat several earlier efforts on the same datasets, demonstrating the generality and efficacy of recognising multiple emotions from various languages.
first_indexed	2024-03-11T09:24:49Z
format	Article
id	doaj.art-18532e2b615a4518af2f9683c1594ef9
institution	Directory Open Access Journal
issn	1424-8220
language	English
last_indexed	2024-03-11T09:24:49Z
publishDate	2023-01-01
publisher	MDPI AG
record_format	Article
series	Sensors
spelling	doaj.art-18532e2b615a4518af2f9683c1594ef92023-11-16T18:00:21ZengMDPI AGSensors1424-82202023-01-01233138610.3390/s23031386Human–Computer Interaction with a Real-Time Speech Emotion Recognition with Ensembling Techniques 1D Convolution Neural Network and AttentionWaleed Alsabhan0College of Engineering, Al Faisal University, P.O. Box 50927, Riyadh 11533, Saudi ArabiaEmotions have a crucial function in the mental existence of humans. They are vital for identifying a person’s behaviour and mental condition. Speech Emotion Recognition (SER) is extracting a speaker’s emotional state from their speech signal. SER is a growing discipline in human–computer interaction, and it has recently attracted more significant interest. This is because there are not so many universal emotions; therefore, any intelligent system with enough computational capacity can educate itself to recognise them. However, the issue is that human speech is immensely diverse, making it difficult to create a single, standardised recipe for detecting hidden emotions. This work attempted to solve this research difficulty by combining a multilingual emotional dataset with building a more generalised and effective model for recognising human emotions. A two-step process was used to develop the model. The first stage involved the extraction of features, and the second stage involved the classification of the features that were extracted. ZCR, RMSE, and the renowned MFC coefficients were retrieved as features. Two proposed models, 1D CNN combined with LSTM and attention and a proprietary 2D CNN architecture, were used for classification. The outcomes demonstrated that the suggested 1D CNN with LSTM and attention performed better than the 2D CNN. For the EMO-DB, SAVEE, ANAD, and BAVED datasets, the model’s accuracy was 96.72%, 97.13%, 96.72%, and 88.39%, respectively. The model beat several earlier efforts on the same datasets, demonstrating the generality and efficacy of recognising multiple emotions from various languages.https://www.mdpi.com/1424-8220/23/3/1386human–computer Interaction1D and 2D Convolution Neural Networks (CNN)speech emotion recognition (SER)EMO-DBSAVEEANAD
spellingShingle	Waleed Alsabhan Human–Computer Interaction with a Real-Time Speech Emotion Recognition with Ensembling Techniques 1D Convolution Neural Network and Attention Sensors human–computer Interaction 1D and 2D Convolution Neural Networks (CNN) speech emotion recognition (SER) EMO-DB SAVEE ANAD
title	Human–Computer Interaction with a Real-Time Speech Emotion Recognition with Ensembling Techniques 1D Convolution Neural Network and Attention
title_full	Human–Computer Interaction with a Real-Time Speech Emotion Recognition with Ensembling Techniques 1D Convolution Neural Network and Attention
title_fullStr	Human–Computer Interaction with a Real-Time Speech Emotion Recognition with Ensembling Techniques 1D Convolution Neural Network and Attention
title_full_unstemmed	Human–Computer Interaction with a Real-Time Speech Emotion Recognition with Ensembling Techniques 1D Convolution Neural Network and Attention
title_short	Human–Computer Interaction with a Real-Time Speech Emotion Recognition with Ensembling Techniques 1D Convolution Neural Network and Attention
title_sort	human computer interaction with a real time speech emotion recognition with ensembling techniques 1d convolution neural network and attention
topic	human–computer Interaction 1D and 2D Convolution Neural Networks (CNN) speech emotion recognition (SER) EMO-DB SAVEE ANAD
url	https://www.mdpi.com/1424-8220/23/3/1386
work_keys_str_mv	AT waleedalsabhan humancomputerinteractionwitharealtimespeechemotionrecognitionwithensemblingtechniques1dconvolutionneuralnetworkandattention

Human–Computer Interaction with a Real-Time Speech Emotion Recognition with Ensembling Techniques 1D Convolution Neural Network and Attention

Similar Items