A BiLSTM–Transformer and 2D CNN Architecture for Emotion Recognition from Speech

The significance of emotion recognition technology is continuing to grow, and research in this field enables artificial intelligence to accurately understand and react to human emotions. This study aims to enhance the efficacy of emotion recognition from speech by using dimensionality reduction algo...

Full description

Bibliographic Details
Main Authors:	Sera Kim, Seok-Pil Lee
Format:	Article
Language:	English
Published:	MDPI AG 2023-09-01
Series:	Electronics
Subjects:	emotion recognition from speech transformer attention mechanism bidirectional LSTM convolutional neural network audio feature extraction
Online Access:	https://www.mdpi.com/2079-9292/12/19/4034

_version_	1797575993604440064
author	Sera Kim Seok-Pil Lee
author_facet	Sera Kim Seok-Pil Lee
author_sort	Sera Kim
collection	DOAJ
description	The significance of emotion recognition technology is continuing to grow, and research in this field enables artificial intelligence to accurately understand and react to human emotions. This study aims to enhance the efficacy of emotion recognition from speech by using dimensionality reduction algorithms for visualization, effectively outlining emotion-specific audio features. As a model for emotion recognition, we propose a new model architecture that combines the bidirectional long short-term memory (BiLSTM)–Transformer and a 2D convolutional neural network (CNN). The BiLSTM–Transformer processes audio features to capture the sequence of speech patterns, while the 2D CNN handles Mel-Spectrograms to capture the spatial details of audio. To validate the proficiency of the model, the 10-fold cross-validation method is used. The methodology proposed in this study was applied to Emo-DB and RAVDESS, two major emotion recognition from speech databases, and achieved high unweighted accuracy rates of 95.65% and 80.19%, respectively. These results indicate that the use of the proposed transformer-based deep learning model with appropriate feature selection can enhance performance in emotion recognition from speech.
first_indexed	2024-03-10T21:46:51Z
format	Article
id	doaj.art-9bb5e7c28b9a42a79783a7e59aee6b22
institution	Directory Open Access Journal
issn	2079-9292
language	English
last_indexed	2024-03-10T21:46:51Z
publishDate	2023-09-01
publisher	MDPI AG
record_format	Article
series	Electronics
spelling	doaj.art-9bb5e7c28b9a42a79783a7e59aee6b222023-11-19T14:16:18ZengMDPI AGElectronics2079-92922023-09-011219403410.3390/electronics12194034A BiLSTM–Transformer and 2D CNN Architecture for Emotion Recognition from SpeechSera Kim0Seok-Pil Lee1Department of Computer Science, Graduate School, Sangmyung University, Seoul 03016, Republic of KoreaDepartment of Intelligent IoT, Sangmyung University, Seoul 03016, Republic of KoreaThe significance of emotion recognition technology is continuing to grow, and research in this field enables artificial intelligence to accurately understand and react to human emotions. This study aims to enhance the efficacy of emotion recognition from speech by using dimensionality reduction algorithms for visualization, effectively outlining emotion-specific audio features. As a model for emotion recognition, we propose a new model architecture that combines the bidirectional long short-term memory (BiLSTM)–Transformer and a 2D convolutional neural network (CNN). The BiLSTM–Transformer processes audio features to capture the sequence of speech patterns, while the 2D CNN handles Mel-Spectrograms to capture the spatial details of audio. To validate the proficiency of the model, the 10-fold cross-validation method is used. The methodology proposed in this study was applied to Emo-DB and RAVDESS, two major emotion recognition from speech databases, and achieved high unweighted accuracy rates of 95.65% and 80.19%, respectively. These results indicate that the use of the proposed transformer-based deep learning model with appropriate feature selection can enhance performance in emotion recognition from speech.https://www.mdpi.com/2079-9292/12/19/4034emotion recognition from speechtransformerattention mechanismbidirectional LSTMconvolutional neural networkaudio feature extraction
spellingShingle	Sera Kim Seok-Pil Lee A BiLSTM–Transformer and 2D CNN Architecture for Emotion Recognition from Speech Electronics emotion recognition from speech transformer attention mechanism bidirectional LSTM convolutional neural network audio feature extraction
title	A BiLSTM–Transformer and 2D CNN Architecture for Emotion Recognition from Speech
title_full	A BiLSTM–Transformer and 2D CNN Architecture for Emotion Recognition from Speech
title_fullStr	A BiLSTM–Transformer and 2D CNN Architecture for Emotion Recognition from Speech
title_full_unstemmed	A BiLSTM–Transformer and 2D CNN Architecture for Emotion Recognition from Speech
title_short	A BiLSTM–Transformer and 2D CNN Architecture for Emotion Recognition from Speech
title_sort	bilstm transformer and 2d cnn architecture for emotion recognition from speech
topic	emotion recognition from speech transformer attention mechanism bidirectional LSTM convolutional neural network audio feature extraction
url	https://www.mdpi.com/2079-9292/12/19/4034
work_keys_str_mv	AT serakim abilstmtransformerand2dcnnarchitectureforemotionrecognitionfromspeech AT seokpillee abilstmtransformerand2dcnnarchitectureforemotionrecognitionfromspeech AT serakim bilstmtransformerand2dcnnarchitectureforemotionrecognitionfromspeech AT seokpillee bilstmtransformerand2dcnnarchitectureforemotionrecognitionfromspeech

A BiLSTM–Transformer and 2D CNN Architecture for Emotion Recognition from Speech

Similar Items