Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features

Artificial intelligence (AI) and machine learning (ML) are employed to make systems smarter. Today, the speech emotion recognition (SER) system evaluates the emotional state of the speaker by investigating his/her speech signal. Emotion recognition is a challenging task for a machine. In addition, m...

Full description

Bibliographic Details
Main Authors: Tursunov Anvarjon, Mustaqeem, Soonil Kwon
Format: Article
Language:English
Published: MDPI AG 2020-09-01
Series:Sensors
Subjects:
Online Access:https://www.mdpi.com/1424-8220/20/18/5212
_version_ 1797553881808371712
author Tursunov Anvarjon
Mustaqeem
Soonil Kwon
author_facet Tursunov Anvarjon
Mustaqeem
Soonil Kwon
author_sort Tursunov Anvarjon
collection DOAJ
description Artificial intelligence (AI) and machine learning (ML) are employed to make systems smarter. Today, the speech emotion recognition (SER) system evaluates the emotional state of the speaker by investigating his/her speech signal. Emotion recognition is a challenging task for a machine. In addition, making it smarter so that the emotions are efficiently recognized by AI is equally challenging. The speech signal is quite hard to examine using signal processing methods because it consists of different frequencies and features that vary according to emotions, such as anger, fear, sadness, happiness, boredom, disgust, and surprise. Even though different algorithms are being developed for the SER, the success rates are very low according to the languages, the emotions, and the databases. In this paper, we propose a new lightweight effective SER model that has a low computational complexity and a high recognition accuracy. The suggested method uses the convolutional neural network (CNN) approach to learn the deep frequency features by using a plain rectangular filter with a modified pooling strategy that have more discriminative power for the SER. The proposed CNN model was trained on the extracted frequency features from the speech data and was then tested to predict the emotions. The proposed SER model was evaluated over two benchmarks, which included the interactive emotional dyadic motion capture (IEMOCAP) and the berlin emotional speech database (EMO-DB) speech datasets, and it obtained 77.01% and 92.02% recognition results. The experimental results demonstrated that the proposed CNN-based SER system can achieve a better recognition performance than the state-of-the-art SER systems.
first_indexed 2024-03-10T16:22:54Z
format Article
id doaj.art-ce0971595d2b4cf3a936c57557e84284
institution Directory Open Access Journal
issn 1424-8220
language English
last_indexed 2024-03-10T16:22:54Z
publishDate 2020-09-01
publisher MDPI AG
record_format Article
series Sensors
spelling doaj.art-ce0971595d2b4cf3a936c57557e842842023-11-20T13:33:28ZengMDPI AGSensors1424-82202020-09-012018521210.3390/s20185212Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency FeaturesTursunov Anvarjon0Mustaqeem1Soonil Kwon2Interaction Technology Laboratory, Department of Software, Sejong University, Seoul 05006, KoreaInteraction Technology Laboratory, Department of Software, Sejong University, Seoul 05006, KoreaInteraction Technology Laboratory, Department of Software, Sejong University, Seoul 05006, KoreaArtificial intelligence (AI) and machine learning (ML) are employed to make systems smarter. Today, the speech emotion recognition (SER) system evaluates the emotional state of the speaker by investigating his/her speech signal. Emotion recognition is a challenging task for a machine. In addition, making it smarter so that the emotions are efficiently recognized by AI is equally challenging. The speech signal is quite hard to examine using signal processing methods because it consists of different frequencies and features that vary according to emotions, such as anger, fear, sadness, happiness, boredom, disgust, and surprise. Even though different algorithms are being developed for the SER, the success rates are very low according to the languages, the emotions, and the databases. In this paper, we propose a new lightweight effective SER model that has a low computational complexity and a high recognition accuracy. The suggested method uses the convolutional neural network (CNN) approach to learn the deep frequency features by using a plain rectangular filter with a modified pooling strategy that have more discriminative power for the SER. The proposed CNN model was trained on the extracted frequency features from the speech data and was then tested to predict the emotions. The proposed SER model was evaluated over two benchmarks, which included the interactive emotional dyadic motion capture (IEMOCAP) and the berlin emotional speech database (EMO-DB) speech datasets, and it obtained 77.01% and 92.02% recognition results. The experimental results demonstrated that the proposed CNN-based SER system can achieve a better recognition performance than the state-of-the-art SER systems.https://www.mdpi.com/1424-8220/20/18/5212artificial intelligencedeep learningdeep frequency features extractionspeech emotion recognitionspeech spectrograms
spellingShingle Tursunov Anvarjon
Mustaqeem
Soonil Kwon
Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features
Sensors
artificial intelligence
deep learning
deep frequency features extraction
speech emotion recognition
speech spectrograms
title Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features
title_full Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features
title_fullStr Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features
title_full_unstemmed Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features
title_short Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features
title_sort deep net a lightweight cnn based speech emotion recognition system using deep frequency features
topic artificial intelligence
deep learning
deep frequency features extraction
speech emotion recognition
speech spectrograms
url https://www.mdpi.com/1424-8220/20/18/5212
work_keys_str_mv AT tursunovanvarjon deepnetalightweightcnnbasedspeechemotionrecognitionsystemusingdeepfrequencyfeatures
AT mustaqeem deepnetalightweightcnnbasedspeechemotionrecognitionsystemusingdeepfrequencyfeatures
AT soonilkwon deepnetalightweightcnnbasedspeechemotionrecognitionsystemusingdeepfrequencyfeatures