A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition
The Speech Emotion Recognition (SER) algorithm, which aims to analyze the expressed emotion from a speech, has always been an important topic in speech acoustic tasks. In recent years, the application of deep-learning methods has made great progress in SER. However, the small scale of the emotional...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2023-03-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/13/7/4124 |
_version_ | 1797608376993054720 |
---|---|
author | Zhongwen Tu Bin Liu Wei Zhao Raoxin Yan Yang Zou |
author_facet | Zhongwen Tu Bin Liu Wei Zhao Raoxin Yan Yang Zou |
author_sort | Zhongwen Tu |
collection | DOAJ |
description | The Speech Emotion Recognition (SER) algorithm, which aims to analyze the expressed emotion from a speech, has always been an important topic in speech acoustic tasks. In recent years, the application of deep-learning methods has made great progress in SER. However, the small scale of the emotional speech dataset and the lack of effective emotional feature representation still limit the development of research. In this paper, a novel SER method, combining data augmentation, feature selection and feature fusion, is proposed. First, aiming at the problem that there are inadequate samples in the speech emotion dataset and the number of samples in each category is unbalanced, a speech data augmentation method, Mix-wav, is proposed which is applied to the audio of the same emotion category. Then, on the one hand, a Multi-Head Attention mechanism-based Convolutional Recurrent Neural Network (MHA-CRNN) model is proposed to further extract the spectrum vector from the Log-Mel spectrum. On the other hand, Light Gradient Boosting Machine (LightGBM) is used for feature set selection and feature dimensionality reduction in four emotion global feature sets, and more effective emotion statistical features are extracted for feature fusion with the previously extracted spectrum vector. Experiments are carried out on the public dataset Interactive Emotional Dyadic Motion Capture (IEMOCAP) and Chinese Hierarchical Speech Emotion Dataset of Broadcasting (CHSE-DB). The experiments show that the proposed method achieves 66.44% and 93.47% of the unweighted average test accuracy, respectively. Our research shows that the global feature set after feature selection can supplement the features extracted by a single deep-learning model through feature fusion to achieve better classification accuracy. |
first_indexed | 2024-03-11T05:43:34Z |
format | Article |
id | doaj.art-273b22aee6a14ba589e4a3e9824e2ebe |
institution | Directory Open Access Journal |
issn | 2076-3417 |
language | English |
last_indexed | 2024-03-11T05:43:34Z |
publishDate | 2023-03-01 |
publisher | MDPI AG |
record_format | Article |
series | Applied Sciences |
spelling | doaj.art-273b22aee6a14ba589e4a3e9824e2ebe2023-11-17T16:15:51ZengMDPI AGApplied Sciences2076-34172023-03-01137412410.3390/app13074124A Feature Fusion Model with Data Augmentation for Speech Emotion RecognitionZhongwen Tu0Bin Liu1Wei Zhao2Raoxin Yan3Yang Zou4Educational Service Center, Communication University of China, Beijing 100024, ChinaSchool of Information and Engineering, Communication University of China, Beijing 100024, ChinaSchool of Data and Intelligence, Communication University of China, Beijing 100024, ChinaSchool of Information and Engineering, Communication University of China, Beijing 100024, ChinaSchool of Information and Engineering, Communication University of China, Beijing 100024, ChinaThe Speech Emotion Recognition (SER) algorithm, which aims to analyze the expressed emotion from a speech, has always been an important topic in speech acoustic tasks. In recent years, the application of deep-learning methods has made great progress in SER. However, the small scale of the emotional speech dataset and the lack of effective emotional feature representation still limit the development of research. In this paper, a novel SER method, combining data augmentation, feature selection and feature fusion, is proposed. First, aiming at the problem that there are inadequate samples in the speech emotion dataset and the number of samples in each category is unbalanced, a speech data augmentation method, Mix-wav, is proposed which is applied to the audio of the same emotion category. Then, on the one hand, a Multi-Head Attention mechanism-based Convolutional Recurrent Neural Network (MHA-CRNN) model is proposed to further extract the spectrum vector from the Log-Mel spectrum. On the other hand, Light Gradient Boosting Machine (LightGBM) is used for feature set selection and feature dimensionality reduction in four emotion global feature sets, and more effective emotion statistical features are extracted for feature fusion with the previously extracted spectrum vector. Experiments are carried out on the public dataset Interactive Emotional Dyadic Motion Capture (IEMOCAP) and Chinese Hierarchical Speech Emotion Dataset of Broadcasting (CHSE-DB). The experiments show that the proposed method achieves 66.44% and 93.47% of the unweighted average test accuracy, respectively. Our research shows that the global feature set after feature selection can supplement the features extracted by a single deep-learning model through feature fusion to achieve better classification accuracy.https://www.mdpi.com/2076-3417/13/7/4124speech emotion recognitiondata augmentationfeature selectionmulti-head attentionfeatures fusion |
spellingShingle | Zhongwen Tu Bin Liu Wei Zhao Raoxin Yan Yang Zou A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition Applied Sciences speech emotion recognition data augmentation feature selection multi-head attention features fusion |
title | A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition |
title_full | A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition |
title_fullStr | A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition |
title_full_unstemmed | A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition |
title_short | A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition |
title_sort | feature fusion model with data augmentation for speech emotion recognition |
topic | speech emotion recognition data augmentation feature selection multi-head attention features fusion |
url | https://www.mdpi.com/2076-3417/13/7/4124 |
work_keys_str_mv | AT zhongwentu afeaturefusionmodelwithdataaugmentationforspeechemotionrecognition AT binliu afeaturefusionmodelwithdataaugmentationforspeechemotionrecognition AT weizhao afeaturefusionmodelwithdataaugmentationforspeechemotionrecognition AT raoxinyan afeaturefusionmodelwithdataaugmentationforspeechemotionrecognition AT yangzou afeaturefusionmodelwithdataaugmentationforspeechemotionrecognition AT zhongwentu featurefusionmodelwithdataaugmentationforspeechemotionrecognition AT binliu featurefusionmodelwithdataaugmentationforspeechemotionrecognition AT weizhao featurefusionmodelwithdataaugmentationforspeechemotionrecognition AT raoxinyan featurefusionmodelwithdataaugmentationforspeechemotionrecognition AT yangzou featurefusionmodelwithdataaugmentationforspeechemotionrecognition |