Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality Fusion

Decades of scientific research have been conducted on developing and evaluating methods for automated emotion recognition. With exponentially growing technology, there is a wide range of emerging applications that require emotional state recognition of the user. This paper investigates a robust appr...

Full description

Bibliographic Details
Main Authors: Baijun Xie, Mariia Sidulova, Chung Hyuk Park
Format: Article
Language:English
Published: MDPI AG 2021-07-01
Series:Sensors
Subjects:
Online Access:https://www.mdpi.com/1424-8220/21/14/4913
_version_ 1797526053737988096
author Baijun Xie
Mariia Sidulova
Chung Hyuk Park
author_facet Baijun Xie
Mariia Sidulova
Chung Hyuk Park
author_sort Baijun Xie
collection DOAJ
description Decades of scientific research have been conducted on developing and evaluating methods for automated emotion recognition. With exponentially growing technology, there is a wide range of emerging applications that require emotional state recognition of the user. This paper investigates a robust approach for multimodal emotion recognition during a conversation. Three separate models for audio, video and text modalities are structured and fine-tuned on the MELD. In this paper, a transformer-based crossmodality fusion with the EmbraceNet architecture is employed to estimate the emotion. The proposed multimodal network architecture can achieve up to 65% accuracy, which significantly surpasses any of the unimodal models. We provide multiple evaluation techniques applied to our work to show that our model is robust and can even outperform the state-of-the-art models on the MELD.
first_indexed 2024-03-10T09:23:43Z
format Article
id doaj.art-5e44df4d0c164d3c8e03f0d045786407
institution Directory Open Access Journal
issn 1424-8220
language English
last_indexed 2024-03-10T09:23:43Z
publishDate 2021-07-01
publisher MDPI AG
record_format Article
series Sensors
spelling doaj.art-5e44df4d0c164d3c8e03f0d0457864072023-11-22T04:57:55ZengMDPI AGSensors1424-82202021-07-012114491310.3390/s21144913Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality FusionBaijun Xie0Mariia Sidulova1Chung Hyuk Park2Department of Biomedical Engineering, School of Engineering and Applied Science, George Washington University, Washington, DC 20052, USADepartment of Biomedical Engineering, School of Engineering and Applied Science, George Washington University, Washington, DC 20052, USADepartment of Biomedical Engineering, School of Engineering and Applied Science, George Washington University, Washington, DC 20052, USADecades of scientific research have been conducted on developing and evaluating methods for automated emotion recognition. With exponentially growing technology, there is a wide range of emerging applications that require emotional state recognition of the user. This paper investigates a robust approach for multimodal emotion recognition during a conversation. Three separate models for audio, video and text modalities are structured and fine-tuned on the MELD. In this paper, a transformer-based crossmodality fusion with the EmbraceNet architecture is employed to estimate the emotion. The proposed multimodal network architecture can achieve up to 65% accuracy, which significantly surpasses any of the unimodal models. We provide multiple evaluation techniques applied to our work to show that our model is robust and can even outperform the state-of-the-art models on the MELD.https://www.mdpi.com/1424-8220/21/14/4913multimodal emotion recognitionmultimodal fusioncrossmodal transformerattention mechanism
spellingShingle Baijun Xie
Mariia Sidulova
Chung Hyuk Park
Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality Fusion
Sensors
multimodal emotion recognition
multimodal fusion
crossmodal transformer
attention mechanism
title Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality Fusion
title_full Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality Fusion
title_fullStr Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality Fusion
title_full_unstemmed Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality Fusion
title_short Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality Fusion
title_sort robust multimodal emotion recognition from conversation with transformer based crossmodality fusion
topic multimodal emotion recognition
multimodal fusion
crossmodal transformer
attention mechanism
url https://www.mdpi.com/1424-8220/21/14/4913
work_keys_str_mv AT baijunxie robustmultimodalemotionrecognitionfromconversationwithtransformerbasedcrossmodalityfusion
AT mariiasidulova robustmultimodalemotionrecognitionfromconversationwithtransformerbasedcrossmodalityfusion
AT chunghyukpark robustmultimodalemotionrecognitionfromconversationwithtransformerbasedcrossmodalityfusion