M2ER: Multimodal Emotion Recognition Based on Multi-Party Dialogue Scenarios

Researchers have recently focused on multimodal emotion recognition, but issues persist in recognizing emotions in multi-party dialogue scenarios. Most studies have only used text and audio modality, ignoring the video modality. To address this, we propose M2ER, a <b>m</b>ultimodal <b...

Full description

Bibliographic Details
Main Authors: Bo Zhang, Xiya Yang, Ge Wang, Ying Wang, Rui Sun
Format: Article
Language:English
Published: MDPI AG 2023-10-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/13/20/11340
Description
Summary:Researchers have recently focused on multimodal emotion recognition, but issues persist in recognizing emotions in multi-party dialogue scenarios. Most studies have only used text and audio modality, ignoring the video modality. To address this, we propose M2ER, a <b>m</b>ultimodal <b>e</b>motion <b>r</b>ecognition scheme based on <b>m</b>ulti-party dialogue scenarios. Addressing the issue of multiple faces appearing in the same frame of the video modality, M2ER introduces a method using multi-face localization for speaker recognition to eliminate the interference of non-speakers. The attention mechanism is used to fuse and classify different modalities. We conducted extensive experiments in unimodal and multimodal fusion using the multi-party dialogue dataset MELD. The results show that M2ER achieves superior emotion recognition in both text and audio modalities compared to the baseline model. The proposed method using speaker recognition in the video modality improves emotion recognition performance by 6.58% compared to the method without speaker recognition. In addition, the multimodal fusion based on the attention mechanism also outperforms the baseline fusion model.
ISSN:2076-3417