Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality Fusion
Decades of scientific research have been conducted on developing and evaluating methods for automated emotion recognition. With exponentially growing technology, there is a wide range of emerging applications that require emotional state recognition of the user. This paper investigates a robust appr...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2021-07-01
|
Series: | Sensors |
Subjects: | |
Online Access: | https://www.mdpi.com/1424-8220/21/14/4913 |
_version_ | 1797526053737988096 |
---|---|
author | Baijun Xie Mariia Sidulova Chung Hyuk Park |
author_facet | Baijun Xie Mariia Sidulova Chung Hyuk Park |
author_sort | Baijun Xie |
collection | DOAJ |
description | Decades of scientific research have been conducted on developing and evaluating methods for automated emotion recognition. With exponentially growing technology, there is a wide range of emerging applications that require emotional state recognition of the user. This paper investigates a robust approach for multimodal emotion recognition during a conversation. Three separate models for audio, video and text modalities are structured and fine-tuned on the MELD. In this paper, a transformer-based crossmodality fusion with the EmbraceNet architecture is employed to estimate the emotion. The proposed multimodal network architecture can achieve up to 65% accuracy, which significantly surpasses any of the unimodal models. We provide multiple evaluation techniques applied to our work to show that our model is robust and can even outperform the state-of-the-art models on the MELD. |
first_indexed | 2024-03-10T09:23:43Z |
format | Article |
id | doaj.art-5e44df4d0c164d3c8e03f0d045786407 |
institution | Directory Open Access Journal |
issn | 1424-8220 |
language | English |
last_indexed | 2024-03-10T09:23:43Z |
publishDate | 2021-07-01 |
publisher | MDPI AG |
record_format | Article |
series | Sensors |
spelling | doaj.art-5e44df4d0c164d3c8e03f0d0457864072023-11-22T04:57:55ZengMDPI AGSensors1424-82202021-07-012114491310.3390/s21144913Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality FusionBaijun Xie0Mariia Sidulova1Chung Hyuk Park2Department of Biomedical Engineering, School of Engineering and Applied Science, George Washington University, Washington, DC 20052, USADepartment of Biomedical Engineering, School of Engineering and Applied Science, George Washington University, Washington, DC 20052, USADepartment of Biomedical Engineering, School of Engineering and Applied Science, George Washington University, Washington, DC 20052, USADecades of scientific research have been conducted on developing and evaluating methods for automated emotion recognition. With exponentially growing technology, there is a wide range of emerging applications that require emotional state recognition of the user. This paper investigates a robust approach for multimodal emotion recognition during a conversation. Three separate models for audio, video and text modalities are structured and fine-tuned on the MELD. In this paper, a transformer-based crossmodality fusion with the EmbraceNet architecture is employed to estimate the emotion. The proposed multimodal network architecture can achieve up to 65% accuracy, which significantly surpasses any of the unimodal models. We provide multiple evaluation techniques applied to our work to show that our model is robust and can even outperform the state-of-the-art models on the MELD.https://www.mdpi.com/1424-8220/21/14/4913multimodal emotion recognitionmultimodal fusioncrossmodal transformerattention mechanism |
spellingShingle | Baijun Xie Mariia Sidulova Chung Hyuk Park Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality Fusion Sensors multimodal emotion recognition multimodal fusion crossmodal transformer attention mechanism |
title | Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality Fusion |
title_full | Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality Fusion |
title_fullStr | Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality Fusion |
title_full_unstemmed | Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality Fusion |
title_short | Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality Fusion |
title_sort | robust multimodal emotion recognition from conversation with transformer based crossmodality fusion |
topic | multimodal emotion recognition multimodal fusion crossmodal transformer attention mechanism |
url | https://www.mdpi.com/1424-8220/21/14/4913 |
work_keys_str_mv | AT baijunxie robustmultimodalemotionrecognitionfromconversationwithtransformerbasedcrossmodalityfusion AT mariiasidulova robustmultimodalemotionrecognitionfromconversationwithtransformerbasedcrossmodalityfusion AT chunghyukpark robustmultimodalemotionrecognitionfromconversationwithtransformerbasedcrossmodalityfusion |