Multi-branch feature learning based speech emotion recognition using SCAR-NET

Speech emotion recognition (SER) is an active research area in affective computing. Recognizing emotions from speech signals helps to assess human behaviour, which has promising applications in the area of human-computer interaction. The performance of deep learning-based SER methods relies heavily...

Full description

Bibliographic Details
Main Authors: Keji Mao, Yuxiang Wang, Ligang Ren, Jinhong Zhang, Jiefan Qiu, Guanglin Dai
Format: Article
Language:English
Published: Taylor & Francis Group 2023-12-01
Series:Connection Science
Subjects:
Online Access:http://dx.doi.org/10.1080/09540091.2023.2189217
_version_ 1797684015412543488
author Keji Mao
Yuxiang Wang
Ligang Ren
Jinhong Zhang
Jiefan Qiu
Guanglin Dai
author_facet Keji Mao
Yuxiang Wang
Ligang Ren
Jinhong Zhang
Jiefan Qiu
Guanglin Dai
author_sort Keji Mao
collection DOAJ
description Speech emotion recognition (SER) is an active research area in affective computing. Recognizing emotions from speech signals helps to assess human behaviour, which has promising applications in the area of human-computer interaction. The performance of deep learning-based SER methods relies heavily on feature learning. In this paper, we propose SCAR-NET, an improved convolutional neural network, to extract emotional features from speech signals and implement classification. This work includes two main parts: First, we extract spectral, temporal, and spectral-temporal correlation features through three parallel paths; and then split-convolve-aggregate residual blocks are designed for multi-branch deep feature learning. The features are refined by global average pooling (GAP) and pass through a softmax classifier to generate predictions for different emotions. We also conduct a series of experiments to evaluate the robustness and effectiveness of SCAR-NET which can achieve 96.45%, 83.13%, and 89.93% accuracy on the speech emotion datasets EMO-DB, SAVEE, and RAVDESS. These results show the outperformance of SCAR-NET.
first_indexed 2024-03-12T00:24:29Z
format Article
id doaj.art-98092f9487a74ff8a2072b5677cf3603
institution Directory Open Access Journal
issn 0954-0091
1360-0494
language English
last_indexed 2024-03-12T00:24:29Z
publishDate 2023-12-01
publisher Taylor & Francis Group
record_format Article
series Connection Science
spelling doaj.art-98092f9487a74ff8a2072b5677cf36032023-09-15T10:48:01ZengTaylor & Francis GroupConnection Science0954-00911360-04942023-12-0135110.1080/09540091.2023.21892172189217Multi-branch feature learning based speech emotion recognition using SCAR-NETKeji Mao0Yuxiang Wang1Ligang Ren2Jinhong Zhang3Jiefan Qiu4Guanglin Dai5Zhejiang University of TechnologyZhejiang University of TechnologyZhejiang University of TechnologyZhejiang University of TechnologyZhejiang University of TechnologyZhejiang University of TechnologySpeech emotion recognition (SER) is an active research area in affective computing. Recognizing emotions from speech signals helps to assess human behaviour, which has promising applications in the area of human-computer interaction. The performance of deep learning-based SER methods relies heavily on feature learning. In this paper, we propose SCAR-NET, an improved convolutional neural network, to extract emotional features from speech signals and implement classification. This work includes two main parts: First, we extract spectral, temporal, and spectral-temporal correlation features through three parallel paths; and then split-convolve-aggregate residual blocks are designed for multi-branch deep feature learning. The features are refined by global average pooling (GAP) and pass through a softmax classifier to generate predictions for different emotions. We also conduct a series of experiments to evaluate the robustness and effectiveness of SCAR-NET which can achieve 96.45%, 83.13%, and 89.93% accuracy on the speech emotion datasets EMO-DB, SAVEE, and RAVDESS. These results show the outperformance of SCAR-NET.http://dx.doi.org/10.1080/09540091.2023.2189217affective computingspeech emotion recognitionconvolutional neural networkparallel pathsfeature learning
spellingShingle Keji Mao
Yuxiang Wang
Ligang Ren
Jinhong Zhang
Jiefan Qiu
Guanglin Dai
Multi-branch feature learning based speech emotion recognition using SCAR-NET
Connection Science
affective computing
speech emotion recognition
convolutional neural network
parallel paths
feature learning
title Multi-branch feature learning based speech emotion recognition using SCAR-NET
title_full Multi-branch feature learning based speech emotion recognition using SCAR-NET
title_fullStr Multi-branch feature learning based speech emotion recognition using SCAR-NET
title_full_unstemmed Multi-branch feature learning based speech emotion recognition using SCAR-NET
title_short Multi-branch feature learning based speech emotion recognition using SCAR-NET
title_sort multi branch feature learning based speech emotion recognition using scar net
topic affective computing
speech emotion recognition
convolutional neural network
parallel paths
feature learning
url http://dx.doi.org/10.1080/09540091.2023.2189217
work_keys_str_mv AT kejimao multibranchfeaturelearningbasedspeechemotionrecognitionusingscarnet
AT yuxiangwang multibranchfeaturelearningbasedspeechemotionrecognitionusingscarnet
AT ligangren multibranchfeaturelearningbasedspeechemotionrecognitionusingscarnet
AT jinhongzhang multibranchfeaturelearningbasedspeechemotionrecognitionusingscarnet
AT jiefanqiu multibranchfeaturelearningbasedspeechemotionrecognitionusingscarnet
AT guanglindai multibranchfeaturelearningbasedspeechemotionrecognitionusingscarnet