LGCCT: A Light Gated and Crossed Complementation Transformer for Multimodal Speech Emotion Recognition

Semantic-rich speech emotion recognition has a high degree of popularity in a range of areas. Speech emotion recognition aims to recognize human emotional states from utterances containing both acoustic and linguistic information. Since both textual and audio patterns play essential roles in speech...

Full description

Bibliographic Details
Main Authors:	Feng Liu, Si-Yuan Shen, Zi-Wang Fu, Han-Yang Wang, Ai-Min Zhou, Jia-Yin Qi
Format:	Article
Language:	English
Published:	MDPI AG 2022-07-01
Series:	Entropy
Subjects:	entropy invariance multimodal speech emotion recognition cross-attention gate control lightweight model computational affection
Online Access:	https://www.mdpi.com/1099-4300/24/7/1010

_version_	1827618930768740352
author	Feng Liu Si-Yuan Shen Zi-Wang Fu Han-Yang Wang Ai-Min Zhou Jia-Yin Qi
author_facet	Feng Liu Si-Yuan Shen Zi-Wang Fu Han-Yang Wang Ai-Min Zhou Jia-Yin Qi
author_sort	Feng Liu
collection	DOAJ
description	Semantic-rich speech emotion recognition has a high degree of popularity in a range of areas. Speech emotion recognition aims to recognize human emotional states from utterances containing both acoustic and linguistic information. Since both textual and audio patterns play essential roles in speech emotion recognition (SER) tasks, various works have proposed novel modality fusing methods to exploit text and audio signals effectively. However, most of the high performance of existing models is dependent on a great number of learnable parameters, and they can only work well on data with fixed length. Therefore, minimizing computational overhead and improving generalization to unseen data with various lengths while maintaining a certain level of recognition accuracy is an urgent application problem. In this paper, we propose LGCCT, a light gated and crossed complementation transformer for multimodal speech emotion recognition. First, our model is capable of fusing modality information efficiently. Specifically, the acoustic features are extracted by CNN-BiLSTM while the textual features are extracted by BiLSTM. The modality-fused representation is then generated by the cross-attention module. We apply the gate-control mechanism to achieve the balanced integration of the original modality representation and the modality-fused representation. Second, the degree of attention focus can be considered, as the uncertainty and the entropy of the same token should converge to the same value independent of the length. To improve the generalization of the model to various testing-sequence lengths, we adopt the length-scaled dot product to calculate the attention score, which can be interpreted from a theoretical view of entropy. The operation of the length-scaled dot product is cheap but effective. Experiments are conducted on the benchmark dataset CMU-MOSEI. Compared to the baseline models, our model achieves an 81.0% F1 score with only 0.432 M parameters, showing an improvement in the balance between performance and the number of parameters. Moreover, the ablation study signifies the effectiveness of our model and its scalability to various input-sequence lengths, wherein the relative improvement is almost 20% of the baseline without a length-scaled dot product.
first_indexed	2024-03-09T10:19:08Z
format	Article
id	doaj.art-44b9b894d2174eb3948ea5f0da5abd4f
institution	Directory Open Access Journal
issn	1099-4300
language	English
last_indexed	2024-03-09T10:19:08Z
publishDate	2022-07-01
publisher	MDPI AG
record_format	Article
series	Entropy
spelling	doaj.art-44b9b894d2174eb3948ea5f0da5abd4f2023-12-01T22:08:00ZengMDPI AGEntropy1099-43002022-07-01247101010.3390/e24071010LGCCT: A Light Gated and Crossed Complementation Transformer for Multimodal Speech Emotion RecognitionFeng Liu0Si-Yuan Shen1Zi-Wang Fu2Han-Yang Wang3Ai-Min Zhou4Jia-Yin Qi5Institute of AI for Education, East China Normal University, Shanghai 200062, ChinaSchool of Computer Science and Technology, East China Normal University, Shanghai 200062, ChinaSchool of Computer Science, Beijing University of Posts and Telecommunications, Beijing 100876, ChinaSchool of Computer Science and Technology, East China Normal University, Shanghai 200062, ChinaInstitute of AI for Education, East China Normal University, Shanghai 200062, ChinaInstitute of Artificial Intelligence and Change Management, Shanghai University of International Business and Economics, Shanghai 200062, ChinaSemantic-rich speech emotion recognition has a high degree of popularity in a range of areas. Speech emotion recognition aims to recognize human emotional states from utterances containing both acoustic and linguistic information. Since both textual and audio patterns play essential roles in speech emotion recognition (SER) tasks, various works have proposed novel modality fusing methods to exploit text and audio signals effectively. However, most of the high performance of existing models is dependent on a great number of learnable parameters, and they can only work well on data with fixed length. Therefore, minimizing computational overhead and improving generalization to unseen data with various lengths while maintaining a certain level of recognition accuracy is an urgent application problem. In this paper, we propose LGCCT, a light gated and crossed complementation transformer for multimodal speech emotion recognition. First, our model is capable of fusing modality information efficiently. Specifically, the acoustic features are extracted by CNN-BiLSTM while the textual features are extracted by BiLSTM. The modality-fused representation is then generated by the cross-attention module. We apply the gate-control mechanism to achieve the balanced integration of the original modality representation and the modality-fused representation. Second, the degree of attention focus can be considered, as the uncertainty and the entropy of the same token should converge to the same value independent of the length. To improve the generalization of the model to various testing-sequence lengths, we adopt the length-scaled dot product to calculate the attention score, which can be interpreted from a theoretical view of entropy. The operation of the length-scaled dot product is cheap but effective. Experiments are conducted on the benchmark dataset CMU-MOSEI. Compared to the baseline models, our model achieves an 81.0% F1 score with only 0.432 M parameters, showing an improvement in the balance between performance and the number of parameters. Moreover, the ablation study signifies the effectiveness of our model and its scalability to various input-sequence lengths, wherein the relative improvement is almost 20% of the baseline without a length-scaled dot product.https://www.mdpi.com/1099-4300/24/7/1010entropy invariancemultimodal speech emotion recognitioncross-attentiongate controllightweight modelcomputational affection
spellingShingle	Feng Liu Si-Yuan Shen Zi-Wang Fu Han-Yang Wang Ai-Min Zhou Jia-Yin Qi LGCCT: A Light Gated and Crossed Complementation Transformer for Multimodal Speech Emotion Recognition Entropy entropy invariance multimodal speech emotion recognition cross-attention gate control lightweight model computational affection
title	LGCCT: A Light Gated and Crossed Complementation Transformer for Multimodal Speech Emotion Recognition
title_full	LGCCT: A Light Gated and Crossed Complementation Transformer for Multimodal Speech Emotion Recognition
title_fullStr	LGCCT: A Light Gated and Crossed Complementation Transformer for Multimodal Speech Emotion Recognition
title_full_unstemmed	LGCCT: A Light Gated and Crossed Complementation Transformer for Multimodal Speech Emotion Recognition
title_short	LGCCT: A Light Gated and Crossed Complementation Transformer for Multimodal Speech Emotion Recognition
title_sort	lgcct a light gated and crossed complementation transformer for multimodal speech emotion recognition
topic	entropy invariance multimodal speech emotion recognition cross-attention gate control lightweight model computational affection
url	https://www.mdpi.com/1099-4300/24/7/1010
work_keys_str_mv	AT fengliu lgcctalightgatedandcrossedcomplementationtransformerformultimodalspeechemotionrecognition AT siyuanshen lgcctalightgatedandcrossedcomplementationtransformerformultimodalspeechemotionrecognition AT ziwangfu lgcctalightgatedandcrossedcomplementationtransformerformultimodalspeechemotionrecognition AT hanyangwang lgcctalightgatedandcrossedcomplementationtransformerformultimodalspeechemotionrecognition AT aiminzhou lgcctalightgatedandcrossedcomplementationtransformerformultimodalspeechemotionrecognition AT jiayinqi lgcctalightgatedandcrossedcomplementationtransformerformultimodalspeechemotionrecognition

LGCCT: A Light Gated and Crossed Complementation Transformer for Multimodal Speech Emotion Recognition

Similar Items