Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning
Recognition of various human emotions holds significant value in numerous real-world scenarios. This paper focuses on the multimodal fusion of speech and text for emotion recognition. A 39-dimensional Mel-frequency cepstral coefficient (MFCC) was used as a feature for speech emotion. A 300-dimension...
Auteurs principaux: | , |
---|---|
Format: | Article |
Langue: | English |
Publié: |
Elsevier
2024-12-01
|
Collection: | Intelligent Systems with Applications |
Sujets: | |
Accès en ligne: | http://www.sciencedirect.com/science/article/pii/S2667305324001108 |
_version_ | 1826935423368167424 |
---|---|
author | Yanan Shang Tianqi Fu |
author_facet | Yanan Shang Tianqi Fu |
author_sort | Yanan Shang |
collection | DOAJ |
description | Recognition of various human emotions holds significant value in numerous real-world scenarios. This paper focuses on the multimodal fusion of speech and text for emotion recognition. A 39-dimensional Mel-frequency cepstral coefficient (MFCC) was used as a feature for speech emotion. A 300-dimensional word vector obtained through the Glove algorithm was used as the feature for text emotion. The bidirectional gate recurrent unit (BiGRU) method in deep learning was added for extracting deep features. Subsequently, it was combined with the multi-head self-attention (MHA) mechanism and the improved sparrow search algorithm (ISSA) to obtain the ISSA-BiGRU-MHA method for emotion recognition. It was validated on the IEMOCAP and MELD datasets. It was found that MFCC and Glove word vectors exhibited superior recognition effects as features. Comparisons with the support vector machine and convolutional neural network methods revealed that the ISSA-BiGRU-MHA method demonstrated the highest weighted accuracy and unweighted accuracy. Multimodal fusion achieved weighted accuracies of 76.52 %, 71.84 %, 66.72 %, and 62.12 % on the IEMOCAP, MELD, MOSI, and MOSEI datasets, suggesting better performance than unimodal fusion. These results affirm the reliability of the multimodal fusion recognition method, showing its practical applicability. |
first_indexed | 2025-02-17T18:05:09Z |
format | Article |
id | doaj.art-7f847b55ec6a44fa801c13f5b51a8f74 |
institution | Directory Open Access Journal |
issn | 2667-3053 |
language | English |
last_indexed | 2025-02-17T18:05:09Z |
publishDate | 2024-12-01 |
publisher | Elsevier |
record_format | Article |
series | Intelligent Systems with Applications |
spelling | doaj.art-7f847b55ec6a44fa801c13f5b51a8f742024-12-13T11:07:23ZengElsevierIntelligent Systems with Applications2667-30532024-12-0124200436Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learningYanan Shang0Tianqi Fu1Corresponding author.; Cangzhou Normal University, Cangzhou, Hebei 061001, ChinaCangzhou Normal University, Cangzhou, Hebei 061001, ChinaRecognition of various human emotions holds significant value in numerous real-world scenarios. This paper focuses on the multimodal fusion of speech and text for emotion recognition. A 39-dimensional Mel-frequency cepstral coefficient (MFCC) was used as a feature for speech emotion. A 300-dimensional word vector obtained through the Glove algorithm was used as the feature for text emotion. The bidirectional gate recurrent unit (BiGRU) method in deep learning was added for extracting deep features. Subsequently, it was combined with the multi-head self-attention (MHA) mechanism and the improved sparrow search algorithm (ISSA) to obtain the ISSA-BiGRU-MHA method for emotion recognition. It was validated on the IEMOCAP and MELD datasets. It was found that MFCC and Glove word vectors exhibited superior recognition effects as features. Comparisons with the support vector machine and convolutional neural network methods revealed that the ISSA-BiGRU-MHA method demonstrated the highest weighted accuracy and unweighted accuracy. Multimodal fusion achieved weighted accuracies of 76.52 %, 71.84 %, 66.72 %, and 62.12 % on the IEMOCAP, MELD, MOSI, and MOSEI datasets, suggesting better performance than unimodal fusion. These results affirm the reliability of the multimodal fusion recognition method, showing its practical applicability.http://www.sciencedirect.com/science/article/pii/S2667305324001108Multimodal fusionDeep learningGlove modelBiGRUEmotion recognition |
spellingShingle | Yanan Shang Tianqi Fu Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning Intelligent Systems with Applications Multimodal fusion Deep learning Glove model BiGRU Emotion recognition |
title | Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning |
title_full | Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning |
title_fullStr | Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning |
title_full_unstemmed | Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning |
title_short | Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning |
title_sort | multimodal fusion a study on speech text emotion recognition with the integration of deep learning |
topic | Multimodal fusion Deep learning Glove model BiGRU Emotion recognition |
url | http://www.sciencedirect.com/science/article/pii/S2667305324001108 |
work_keys_str_mv | AT yananshang multimodalfusionastudyonspeechtextemotionrecognitionwiththeintegrationofdeeplearning AT tianqifu multimodalfusionastudyonspeechtextemotionrecognitionwiththeintegrationofdeeplearning |