Speech Emotion Recognition Based on Two-Stream Deep Learning Model Using Korean Audio Information

Identifying a person’s emotions is an important element in communication. In particular, voice is a means of communication for easily and naturally expressing emotions. Speech emotion recognition technology is a crucial component of human–computer interaction (HCI), in which accurately identifying e...

Full description

Bibliographic Details
Main Authors:	A-Hyeon Jo, Keun-Chang Kwak
Format:	Article
Language:	English
Published:	MDPI AG 2023-02-01
Series:	Applied Sciences
Subjects:	speech emotion recognition human–computer interaction two-stream bidirectional long-short term memory convolutional neural network
Online Access:	https://www.mdpi.com/2076-3417/13/4/2167

_version_	1827758881953021952
author	A-Hyeon Jo Keun-Chang Kwak
author_facet	A-Hyeon Jo Keun-Chang Kwak
author_sort	A-Hyeon Jo
collection	DOAJ
description	Identifying a person’s emotions is an important element in communication. In particular, voice is a means of communication for easily and naturally expressing emotions. Speech emotion recognition technology is a crucial component of human–computer interaction (HCI), in which accurately identifying emotions is key. Therefore, this study presents a two-stream-based emotion recognition model based on bidirectional long short-term memory (Bi-LSTM) and convolutional neural networks (CNNs) using a Korean speech emotion database, and the performance is comparatively analyzed. The data used in the experiment were obtained from the Korean speech emotion recognition database built by Chosun University. Two deep learning models, Bi-LSTM and YAMNet, which is a CNN-based transfer learning model, were connected in a two-stream architecture to design an emotion recognition model. Various speech feature extraction methods and deep learning models were compared in terms of performance. Consequently, the speech emotion recognition performance of Bi-LSTM and YAMNet was 90.38% and 94.91%, respectively. However, the performance of the two-stream model was 96%, which was a minimum of 1.09% and up to 5.62% improved compared with a single model.
first_indexed	2024-03-11T09:12:48Z
format	Article
id	doaj.art-33fd04da5ca94008b234853881718abd
institution	Directory Open Access Journal
issn	2076-3417
language	English
last_indexed	2024-03-11T09:12:48Z
publishDate	2023-02-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj.art-33fd04da5ca94008b234853881718abd2023-11-16T18:51:52ZengMDPI AGApplied Sciences2076-34172023-02-01134216710.3390/app13042167Speech Emotion Recognition Based on Two-Stream Deep Learning Model Using Korean Audio InformationA-Hyeon Jo0Keun-Chang Kwak1Electronic Engineering IT-Bio Convergence System Major, Chosun University, Gwangju 61452, Republic of KoreaElectronic Engineering, Chosun University, Gwangju 61452, Republic of KoreaIdentifying a person’s emotions is an important element in communication. In particular, voice is a means of communication for easily and naturally expressing emotions. Speech emotion recognition technology is a crucial component of human–computer interaction (HCI), in which accurately identifying emotions is key. Therefore, this study presents a two-stream-based emotion recognition model based on bidirectional long short-term memory (Bi-LSTM) and convolutional neural networks (CNNs) using a Korean speech emotion database, and the performance is comparatively analyzed. The data used in the experiment were obtained from the Korean speech emotion recognition database built by Chosun University. Two deep learning models, Bi-LSTM and YAMNet, which is a CNN-based transfer learning model, were connected in a two-stream architecture to design an emotion recognition model. Various speech feature extraction methods and deep learning models were compared in terms of performance. Consequently, the speech emotion recognition performance of Bi-LSTM and YAMNet was 90.38% and 94.91%, respectively. However, the performance of the two-stream model was 96%, which was a minimum of 1.09% and up to 5.62% improved compared with a single model.https://www.mdpi.com/2076-3417/13/4/2167speech emotion recognitionhuman–computer interactiontwo-streambidirectional long-short term memoryconvolutional neural network
spellingShingle	A-Hyeon Jo Keun-Chang Kwak Speech Emotion Recognition Based on Two-Stream Deep Learning Model Using Korean Audio Information Applied Sciences speech emotion recognition human–computer interaction two-stream bidirectional long-short term memory convolutional neural network
title	Speech Emotion Recognition Based on Two-Stream Deep Learning Model Using Korean Audio Information
title_full	Speech Emotion Recognition Based on Two-Stream Deep Learning Model Using Korean Audio Information
title_fullStr	Speech Emotion Recognition Based on Two-Stream Deep Learning Model Using Korean Audio Information
title_full_unstemmed	Speech Emotion Recognition Based on Two-Stream Deep Learning Model Using Korean Audio Information
title_short	Speech Emotion Recognition Based on Two-Stream Deep Learning Model Using Korean Audio Information
title_sort	speech emotion recognition based on two stream deep learning model using korean audio information
topic	speech emotion recognition human–computer interaction two-stream bidirectional long-short term memory convolutional neural network
url	https://www.mdpi.com/2076-3417/13/4/2167
work_keys_str_mv	AT ahyeonjo speechemotionrecognitionbasedontwostreamdeeplearningmodelusingkoreanaudioinformation AT keunchangkwak speechemotionrecognitionbasedontwostreamdeeplearningmodelusingkoreanaudioinformation

Speech Emotion Recognition Based on Two-Stream Deep Learning Model Using Korean Audio Information

Similar Items