Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data Augmentation

In recent years, the increasing popularity of smart mobile devices has made the interaction between devices and users, particularly through voice interaction, more crucial. By enabling smart devices to better understand users’ emotional states through voice data, it becomes possible to provide more...

Full description

Bibliographic Details
Main Authors: Shing-Tai Pan, Han-Jui Wu
Format: Article
Language:English
Published: MDPI AG 2023-05-01
Series:Electronics
Subjects:
Online Access:https://www.mdpi.com/2079-9292/12/11/2436
_version_ 1797597669199183872
author Shing-Tai Pan
Han-Jui Wu
author_facet Shing-Tai Pan
Han-Jui Wu
author_sort Shing-Tai Pan
collection DOAJ
description In recent years, the increasing popularity of smart mobile devices has made the interaction between devices and users, particularly through voice interaction, more crucial. By enabling smart devices to better understand users’ emotional states through voice data, it becomes possible to provide more personalized services. This paper proposes a novel machine learning model for speech emotion recognition called CLDNN, which combines convolutional neural networks (CNN), long short-term memory neural networks (LSTM), and deep neural networks (DNN). To design a system that closely resembles the human auditory system in recognizing audio signals, this article uses the Mel-frequency cepstral coefficients (MFCCs) of audio data as the input of the machine learning model. First, the MFCCs of the voice signal are extracted as the input of the model. Local feature learning blocks (LFLBs) composed of one-dimensional CNNs are employed to calculate the feature values of the data. As audio signals are time-series data, the resulting feature values from LFLBs are then fed into the LSTM layer to enhance learning on the time-series level. Finally, fully connected layers are used for classification and prediction. The experimental evaluation of the proposed model utilizes three databases: RAVDESS, EMO-DB, and IEMOCAP. The results demonstrate that the LSTM model effectively models the features extracted from the 1D CNN due to the time-series characteristics of speech signals. Additionally, the data augmentation method applied in this paper proves beneficial in improving the recognition accuracy and stability of the systems for different databases. Furthermore, according to the experimental results, the proposed system achieves superior recognition rates compared to related research in speech emotion recognition.
first_indexed 2024-03-11T03:08:49Z
format Article
id doaj.art-a65e96c5298a4755b7b9541203c91fdf
institution Directory Open Access Journal
issn 2079-9292
language English
last_indexed 2024-03-11T03:08:49Z
publishDate 2023-05-01
publisher MDPI AG
record_format Article
series Electronics
spelling doaj.art-a65e96c5298a4755b7b9541203c91fdf2023-11-18T07:44:54ZengMDPI AGElectronics2079-92922023-05-011211243610.3390/electronics12112436Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data AugmentationShing-Tai Pan0Han-Jui Wu1Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung 811, TaiwanDepartment of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung 811, TaiwanIn recent years, the increasing popularity of smart mobile devices has made the interaction between devices and users, particularly through voice interaction, more crucial. By enabling smart devices to better understand users’ emotional states through voice data, it becomes possible to provide more personalized services. This paper proposes a novel machine learning model for speech emotion recognition called CLDNN, which combines convolutional neural networks (CNN), long short-term memory neural networks (LSTM), and deep neural networks (DNN). To design a system that closely resembles the human auditory system in recognizing audio signals, this article uses the Mel-frequency cepstral coefficients (MFCCs) of audio data as the input of the machine learning model. First, the MFCCs of the voice signal are extracted as the input of the model. Local feature learning blocks (LFLBs) composed of one-dimensional CNNs are employed to calculate the feature values of the data. As audio signals are time-series data, the resulting feature values from LFLBs are then fed into the LSTM layer to enhance learning on the time-series level. Finally, fully connected layers are used for classification and prediction. The experimental evaluation of the proposed model utilizes three databases: RAVDESS, EMO-DB, and IEMOCAP. The results demonstrate that the LSTM model effectively models the features extracted from the 1D CNN due to the time-series characteristics of speech signals. Additionally, the data augmentation method applied in this paper proves beneficial in improving the recognition accuracy and stability of the systems for different databases. Furthermore, according to the experimental results, the proposed system achieves superior recognition rates compared to related research in speech emotion recognition.https://www.mdpi.com/2079-9292/12/11/2436speech emotion recognitionone-dimensional neural networkLSTMCNNMFCCs
spellingShingle Shing-Tai Pan
Han-Jui Wu
Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data Augmentation
Electronics
speech emotion recognition
one-dimensional neural network
LSTM
CNN
MFCCs
title Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data Augmentation
title_full Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data Augmentation
title_fullStr Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data Augmentation
title_full_unstemmed Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data Augmentation
title_short Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data Augmentation
title_sort performance improvement of speech emotion recognition systems by combining 1d cnn and lstm with data augmentation
topic speech emotion recognition
one-dimensional neural network
LSTM
CNN
MFCCs
url https://www.mdpi.com/2079-9292/12/11/2436
work_keys_str_mv AT shingtaipan performanceimprovementofspeechemotionrecognitionsystemsbycombining1dcnnandlstmwithdataaugmentation
AT hanjuiwu performanceimprovementofspeechemotionrecognitionsystemsbycombining1dcnnandlstmwithdataaugmentation