Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data Augmentation

In recent years, the increasing popularity of smart mobile devices has made the interaction between devices and users, particularly through voice interaction, more crucial. By enabling smart devices to better understand users’ emotional states through voice data, it becomes possible to provide more...

Full description

Bibliographic Details
Main Authors:	Shing-Tai Pan, Han-Jui Wu
Format:	Article
Language:	English
Published:	MDPI AG 2023-05-01
Series:	Electronics
Subjects:	speech emotion recognition one-dimensional neural network LSTM CNN MFCCs
Online Access:	https://www.mdpi.com/2079-9292/12/11/2436

_version_	1797597669199183872
author	Shing-Tai Pan Han-Jui Wu
author_facet	Shing-Tai Pan Han-Jui Wu
author_sort	Shing-Tai Pan
collection	DOAJ
description	In recent years, the increasing popularity of smart mobile devices has made the interaction between devices and users, particularly through voice interaction, more crucial. By enabling smart devices to better understand users’ emotional states through voice data, it becomes possible to provide more personalized services. This paper proposes a novel machine learning model for speech emotion recognition called CLDNN, which combines convolutional neural networks (CNN), long short-term memory neural networks (LSTM), and deep neural networks (DNN). To design a system that closely resembles the human auditory system in recognizing audio signals, this article uses the Mel-frequency cepstral coefficients (MFCCs) of audio data as the input of the machine learning model. First, the MFCCs of the voice signal are extracted as the input of the model. Local feature learning blocks (LFLBs) composed of one-dimensional CNNs are employed to calculate the feature values of the data. As audio signals are time-series data, the resulting feature values from LFLBs are then fed into the LSTM layer to enhance learning on the time-series level. Finally, fully connected layers are used for classification and prediction. The experimental evaluation of the proposed model utilizes three databases: RAVDESS, EMO-DB, and IEMOCAP. The results demonstrate that the LSTM model effectively models the features extracted from the 1D CNN due to the time-series characteristics of speech signals. Additionally, the data augmentation method applied in this paper proves beneficial in improving the recognition accuracy and stability of the systems for different databases. Furthermore, according to the experimental results, the proposed system achieves superior recognition rates compared to related research in speech emotion recognition.
first_indexed	2024-03-11T03:08:49Z
format	Article
id	doaj.art-a65e96c5298a4755b7b9541203c91fdf
institution	Directory Open Access Journal
issn	2079-9292
language	English
last_indexed	2024-03-11T03:08:49Z
publishDate	2023-05-01
publisher	MDPI AG
record_format	Article
series	Electronics
spelling	doaj.art-a65e96c5298a4755b7b9541203c91fdf2023-11-18T07:44:54ZengMDPI AGElectronics2079-92922023-05-011211243610.3390/electronics12112436Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data AugmentationShing-Tai Pan0Han-Jui Wu1Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung 811, TaiwanDepartment of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung 811, TaiwanIn recent years, the increasing popularity of smart mobile devices has made the interaction between devices and users, particularly through voice interaction, more crucial. By enabling smart devices to better understand users’ emotional states through voice data, it becomes possible to provide more personalized services. This paper proposes a novel machine learning model for speech emotion recognition called CLDNN, which combines convolutional neural networks (CNN), long short-term memory neural networks (LSTM), and deep neural networks (DNN). To design a system that closely resembles the human auditory system in recognizing audio signals, this article uses the Mel-frequency cepstral coefficients (MFCCs) of audio data as the input of the machine learning model. First, the MFCCs of the voice signal are extracted as the input of the model. Local feature learning blocks (LFLBs) composed of one-dimensional CNNs are employed to calculate the feature values of the data. As audio signals are time-series data, the resulting feature values from LFLBs are then fed into the LSTM layer to enhance learning on the time-series level. Finally, fully connected layers are used for classification and prediction. The experimental evaluation of the proposed model utilizes three databases: RAVDESS, EMO-DB, and IEMOCAP. The results demonstrate that the LSTM model effectively models the features extracted from the 1D CNN due to the time-series characteristics of speech signals. Additionally, the data augmentation method applied in this paper proves beneficial in improving the recognition accuracy and stability of the systems for different databases. Furthermore, according to the experimental results, the proposed system achieves superior recognition rates compared to related research in speech emotion recognition.https://www.mdpi.com/2079-9292/12/11/2436speech emotion recognitionone-dimensional neural networkLSTMCNNMFCCs
spellingShingle	Shing-Tai Pan Han-Jui Wu Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data Augmentation Electronics speech emotion recognition one-dimensional neural network LSTM CNN MFCCs
title	Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data Augmentation
title_full	Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data Augmentation
title_fullStr	Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data Augmentation
title_full_unstemmed	Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data Augmentation
title_short	Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data Augmentation
title_sort	performance improvement of speech emotion recognition systems by combining 1d cnn and lstm with data augmentation
topic	speech emotion recognition one-dimensional neural network LSTM CNN MFCCs
url	https://www.mdpi.com/2079-9292/12/11/2436
work_keys_str_mv	AT shingtaipan performanceimprovementofspeechemotionrecognitionsystemsbycombining1dcnnandlstmwithdataaugmentation AT hanjuiwu performanceimprovementofspeechemotionrecognitionsystemsbycombining1dcnnandlstmwithdataaugmentation

Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data Augmentation

Similar Items