Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning

Recognition of various human emotions holds significant value in numerous real-world scenarios. This paper focuses on the multimodal fusion of speech and text for emotion recognition. A 39-dimensional Mel-frequency cepstral coefficient (MFCC) was used as a feature for speech emotion. A 300-dimension...

Description complète

Détails bibliographiques
Auteurs principaux: Yanan Shang, Tianqi Fu
Format: Article
Langue:English
Publié: Elsevier 2024-12-01
Collection:Intelligent Systems with Applications
Sujets:
Accès en ligne:http://www.sciencedirect.com/science/article/pii/S2667305324001108
_version_ 1826935423368167424
author Yanan Shang
Tianqi Fu
author_facet Yanan Shang
Tianqi Fu
author_sort Yanan Shang
collection DOAJ
description Recognition of various human emotions holds significant value in numerous real-world scenarios. This paper focuses on the multimodal fusion of speech and text for emotion recognition. A 39-dimensional Mel-frequency cepstral coefficient (MFCC) was used as a feature for speech emotion. A 300-dimensional word vector obtained through the Glove algorithm was used as the feature for text emotion. The bidirectional gate recurrent unit (BiGRU) method in deep learning was added for extracting deep features. Subsequently, it was combined with the multi-head self-attention (MHA) mechanism and the improved sparrow search algorithm (ISSA) to obtain the ISSA-BiGRU-MHA method for emotion recognition. It was validated on the IEMOCAP and MELD datasets. It was found that MFCC and Glove word vectors exhibited superior recognition effects as features. Comparisons with the support vector machine and convolutional neural network methods revealed that the ISSA-BiGRU-MHA method demonstrated the highest weighted accuracy and unweighted accuracy. Multimodal fusion achieved weighted accuracies of 76.52 %, 71.84 %, 66.72 %, and 62.12 % on the IEMOCAP, MELD, MOSI, and MOSEI datasets, suggesting better performance than unimodal fusion. These results affirm the reliability of the multimodal fusion recognition method, showing its practical applicability.
first_indexed 2025-02-17T18:05:09Z
format Article
id doaj.art-7f847b55ec6a44fa801c13f5b51a8f74
institution Directory Open Access Journal
issn 2667-3053
language English
last_indexed 2025-02-17T18:05:09Z
publishDate 2024-12-01
publisher Elsevier
record_format Article
series Intelligent Systems with Applications
spelling doaj.art-7f847b55ec6a44fa801c13f5b51a8f742024-12-13T11:07:23ZengElsevierIntelligent Systems with Applications2667-30532024-12-0124200436Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learningYanan Shang0Tianqi Fu1Corresponding author.; Cangzhou Normal University, Cangzhou, Hebei 061001, ChinaCangzhou Normal University, Cangzhou, Hebei 061001, ChinaRecognition of various human emotions holds significant value in numerous real-world scenarios. This paper focuses on the multimodal fusion of speech and text for emotion recognition. A 39-dimensional Mel-frequency cepstral coefficient (MFCC) was used as a feature for speech emotion. A 300-dimensional word vector obtained through the Glove algorithm was used as the feature for text emotion. The bidirectional gate recurrent unit (BiGRU) method in deep learning was added for extracting deep features. Subsequently, it was combined with the multi-head self-attention (MHA) mechanism and the improved sparrow search algorithm (ISSA) to obtain the ISSA-BiGRU-MHA method for emotion recognition. It was validated on the IEMOCAP and MELD datasets. It was found that MFCC and Glove word vectors exhibited superior recognition effects as features. Comparisons with the support vector machine and convolutional neural network methods revealed that the ISSA-BiGRU-MHA method demonstrated the highest weighted accuracy and unweighted accuracy. Multimodal fusion achieved weighted accuracies of 76.52 %, 71.84 %, 66.72 %, and 62.12 % on the IEMOCAP, MELD, MOSI, and MOSEI datasets, suggesting better performance than unimodal fusion. These results affirm the reliability of the multimodal fusion recognition method, showing its practical applicability.http://www.sciencedirect.com/science/article/pii/S2667305324001108Multimodal fusionDeep learningGlove modelBiGRUEmotion recognition
spellingShingle Yanan Shang
Tianqi Fu
Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning
Intelligent Systems with Applications
Multimodal fusion
Deep learning
Glove model
BiGRU
Emotion recognition
title Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning
title_full Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning
title_fullStr Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning
title_full_unstemmed Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning
title_short Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning
title_sort multimodal fusion a study on speech text emotion recognition with the integration of deep learning
topic Multimodal fusion
Deep learning
Glove model
BiGRU
Emotion recognition
url http://www.sciencedirect.com/science/article/pii/S2667305324001108
work_keys_str_mv AT yananshang multimodalfusionastudyonspeechtextemotionrecognitionwiththeintegrationofdeeplearning
AT tianqifu multimodalfusionastudyonspeechtextemotionrecognitionwiththeintegrationofdeeplearning