A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition

The Speech Emotion Recognition (SER) algorithm, which aims to analyze the expressed emotion from a speech, has always been an important topic in speech acoustic tasks. In recent years, the application of deep-learning methods has made great progress in SER. However, the small scale of the emotional...

Full description

Bibliographic Details
Main Authors:	Zhongwen Tu, Bin Liu, Wei Zhao, Raoxin Yan, Yang Zou
Format:	Article
Language:	English
Published:	MDPI AG 2023-03-01
Series:	Applied Sciences
Subjects:	speech emotion recognition data augmentation feature selection multi-head attention features fusion
Online Access:	https://www.mdpi.com/2076-3417/13/7/4124

_version_	1797608376993054720
author	Zhongwen Tu Bin Liu Wei Zhao Raoxin Yan Yang Zou
author_facet	Zhongwen Tu Bin Liu Wei Zhao Raoxin Yan Yang Zou
author_sort	Zhongwen Tu
collection	DOAJ
description	The Speech Emotion Recognition (SER) algorithm, which aims to analyze the expressed emotion from a speech, has always been an important topic in speech acoustic tasks. In recent years, the application of deep-learning methods has made great progress in SER. However, the small scale of the emotional speech dataset and the lack of effective emotional feature representation still limit the development of research. In this paper, a novel SER method, combining data augmentation, feature selection and feature fusion, is proposed. First, aiming at the problem that there are inadequate samples in the speech emotion dataset and the number of samples in each category is unbalanced, a speech data augmentation method, Mix-wav, is proposed which is applied to the audio of the same emotion category. Then, on the one hand, a Multi-Head Attention mechanism-based Convolutional Recurrent Neural Network (MHA-CRNN) model is proposed to further extract the spectrum vector from the Log-Mel spectrum. On the other hand, Light Gradient Boosting Machine (LightGBM) is used for feature set selection and feature dimensionality reduction in four emotion global feature sets, and more effective emotion statistical features are extracted for feature fusion with the previously extracted spectrum vector. Experiments are carried out on the public dataset Interactive Emotional Dyadic Motion Capture (IEMOCAP) and Chinese Hierarchical Speech Emotion Dataset of Broadcasting (CHSE-DB). The experiments show that the proposed method achieves 66.44% and 93.47% of the unweighted average test accuracy, respectively. Our research shows that the global feature set after feature selection can supplement the features extracted by a single deep-learning model through feature fusion to achieve better classification accuracy.
first_indexed	2024-03-11T05:43:34Z
format	Article
id	doaj.art-273b22aee6a14ba589e4a3e9824e2ebe
institution	Directory Open Access Journal
issn	2076-3417
language	English
last_indexed	2024-03-11T05:43:34Z
publishDate	2023-03-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj.art-273b22aee6a14ba589e4a3e9824e2ebe2023-11-17T16:15:51ZengMDPI AGApplied Sciences2076-34172023-03-01137412410.3390/app13074124A Feature Fusion Model with Data Augmentation for Speech Emotion RecognitionZhongwen Tu0Bin Liu1Wei Zhao2Raoxin Yan3Yang Zou4Educational Service Center, Communication University of China, Beijing 100024, ChinaSchool of Information and Engineering, Communication University of China, Beijing 100024, ChinaSchool of Data and Intelligence, Communication University of China, Beijing 100024, ChinaSchool of Information and Engineering, Communication University of China, Beijing 100024, ChinaSchool of Information and Engineering, Communication University of China, Beijing 100024, ChinaThe Speech Emotion Recognition (SER) algorithm, which aims to analyze the expressed emotion from a speech, has always been an important topic in speech acoustic tasks. In recent years, the application of deep-learning methods has made great progress in SER. However, the small scale of the emotional speech dataset and the lack of effective emotional feature representation still limit the development of research. In this paper, a novel SER method, combining data augmentation, feature selection and feature fusion, is proposed. First, aiming at the problem that there are inadequate samples in the speech emotion dataset and the number of samples in each category is unbalanced, a speech data augmentation method, Mix-wav, is proposed which is applied to the audio of the same emotion category. Then, on the one hand, a Multi-Head Attention mechanism-based Convolutional Recurrent Neural Network (MHA-CRNN) model is proposed to further extract the spectrum vector from the Log-Mel spectrum. On the other hand, Light Gradient Boosting Machine (LightGBM) is used for feature set selection and feature dimensionality reduction in four emotion global feature sets, and more effective emotion statistical features are extracted for feature fusion with the previously extracted spectrum vector. Experiments are carried out on the public dataset Interactive Emotional Dyadic Motion Capture (IEMOCAP) and Chinese Hierarchical Speech Emotion Dataset of Broadcasting (CHSE-DB). The experiments show that the proposed method achieves 66.44% and 93.47% of the unweighted average test accuracy, respectively. Our research shows that the global feature set after feature selection can supplement the features extracted by a single deep-learning model through feature fusion to achieve better classification accuracy.https://www.mdpi.com/2076-3417/13/7/4124speech emotion recognitiondata augmentationfeature selectionmulti-head attentionfeatures fusion
spellingShingle	Zhongwen Tu Bin Liu Wei Zhao Raoxin Yan Yang Zou A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition Applied Sciences speech emotion recognition data augmentation feature selection multi-head attention features fusion
title	A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition
title_full	A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition
title_fullStr	A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition
title_full_unstemmed	A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition
title_short	A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition
title_sort	feature fusion model with data augmentation for speech emotion recognition
topic	speech emotion recognition data augmentation feature selection multi-head attention features fusion
url	https://www.mdpi.com/2076-3417/13/7/4124
work_keys_str_mv	AT zhongwentu afeaturefusionmodelwithdataaugmentationforspeechemotionrecognition AT binliu afeaturefusionmodelwithdataaugmentationforspeechemotionrecognition AT weizhao afeaturefusionmodelwithdataaugmentationforspeechemotionrecognition AT raoxinyan afeaturefusionmodelwithdataaugmentationforspeechemotionrecognition AT yangzou afeaturefusionmodelwithdataaugmentationforspeechemotionrecognition AT zhongwentu featurefusionmodelwithdataaugmentationforspeechemotionrecognition AT binliu featurefusionmodelwithdataaugmentationforspeechemotionrecognition AT weizhao featurefusionmodelwithdataaugmentationforspeechemotionrecognition AT raoxinyan featurefusionmodelwithdataaugmentationforspeechemotionrecognition AT yangzou featurefusionmodelwithdataaugmentationforspeechemotionrecognition

A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition

Similar Items