Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning

Recognition of various human emotions holds significant value in numerous real-world scenarios. This paper focuses on the multimodal fusion of speech and text for emotion recognition. A 39-dimensional Mel-frequency cepstral coefficient (MFCC) was used as a feature for speech emotion. A 300-dimension...

Description complète

Détails bibliographiques
Auteurs principaux:	Yanan Shang, Tianqi Fu
Format:	Article
Langue:	English
Publié:	Elsevier 2024-12-01
Collection:	Intelligent Systems with Applications
Sujets:	Multimodal fusion Deep learning Glove model BiGRU Emotion recognition
Accès en ligne:	http://www.sciencedirect.com/science/article/pii/S2667305324001108

_version_	1826935423368167424
author	Yanan Shang Tianqi Fu
author_facet	Yanan Shang Tianqi Fu
author_sort	Yanan Shang
collection	DOAJ
description	Recognition of various human emotions holds significant value in numerous real-world scenarios. This paper focuses on the multimodal fusion of speech and text for emotion recognition. A 39-dimensional Mel-frequency cepstral coefficient (MFCC) was used as a feature for speech emotion. A 300-dimensional word vector obtained through the Glove algorithm was used as the feature for text emotion. The bidirectional gate recurrent unit (BiGRU) method in deep learning was added for extracting deep features. Subsequently, it was combined with the multi-head self-attention (MHA) mechanism and the improved sparrow search algorithm (ISSA) to obtain the ISSA-BiGRU-MHA method for emotion recognition. It was validated on the IEMOCAP and MELD datasets. It was found that MFCC and Glove word vectors exhibited superior recognition effects as features. Comparisons with the support vector machine and convolutional neural network methods revealed that the ISSA-BiGRU-MHA method demonstrated the highest weighted accuracy and unweighted accuracy. Multimodal fusion achieved weighted accuracies of 76.52 %, 71.84 %, 66.72 %, and 62.12 % on the IEMOCAP, MELD, MOSI, and MOSEI datasets, suggesting better performance than unimodal fusion. These results affirm the reliability of the multimodal fusion recognition method, showing its practical applicability.
first_indexed	2025-02-17T18:05:09Z
format	Article
id	doaj.art-7f847b55ec6a44fa801c13f5b51a8f74
institution	Directory Open Access Journal
issn	2667-3053
language	English
last_indexed	2025-02-17T18:05:09Z
publishDate	2024-12-01
publisher	Elsevier
record_format	Article
series	Intelligent Systems with Applications
spelling	doaj.art-7f847b55ec6a44fa801c13f5b51a8f742024-12-13T11:07:23ZengElsevierIntelligent Systems with Applications2667-30532024-12-0124200436Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learningYanan Shang0Tianqi Fu1Corresponding author.; Cangzhou Normal University, Cangzhou, Hebei 061001, ChinaCangzhou Normal University, Cangzhou, Hebei 061001, ChinaRecognition of various human emotions holds significant value in numerous real-world scenarios. This paper focuses on the multimodal fusion of speech and text for emotion recognition. A 39-dimensional Mel-frequency cepstral coefficient (MFCC) was used as a feature for speech emotion. A 300-dimensional word vector obtained through the Glove algorithm was used as the feature for text emotion. The bidirectional gate recurrent unit (BiGRU) method in deep learning was added for extracting deep features. Subsequently, it was combined with the multi-head self-attention (MHA) mechanism and the improved sparrow search algorithm (ISSA) to obtain the ISSA-BiGRU-MHA method for emotion recognition. It was validated on the IEMOCAP and MELD datasets. It was found that MFCC and Glove word vectors exhibited superior recognition effects as features. Comparisons with the support vector machine and convolutional neural network methods revealed that the ISSA-BiGRU-MHA method demonstrated the highest weighted accuracy and unweighted accuracy. Multimodal fusion achieved weighted accuracies of 76.52 %, 71.84 %, 66.72 %, and 62.12 % on the IEMOCAP, MELD, MOSI, and MOSEI datasets, suggesting better performance than unimodal fusion. These results affirm the reliability of the multimodal fusion recognition method, showing its practical applicability.http://www.sciencedirect.com/science/article/pii/S2667305324001108Multimodal fusionDeep learningGlove modelBiGRUEmotion recognition
spellingShingle	Yanan Shang Tianqi Fu Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning Intelligent Systems with Applications Multimodal fusion Deep learning Glove model BiGRU Emotion recognition
title	Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning
title_full	Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning
title_fullStr	Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning
title_full_unstemmed	Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning
title_short	Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning
title_sort	multimodal fusion a study on speech text emotion recognition with the integration of deep learning
topic	Multimodal fusion Deep learning Glove model BiGRU Emotion recognition
url	http://www.sciencedirect.com/science/article/pii/S2667305324001108
work_keys_str_mv	AT yananshang multimodalfusionastudyonspeechtextemotionrecognitionwiththeintegrationofdeeplearning AT tianqifu multimodalfusionastudyonspeechtextemotionrecognitionwiththeintegrationofdeeplearning

Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning

Documents similaires