Multimodal continuous emotion analysis

Emotion recognition is an increasingly popular research topic in various fields, including human-computer interaction and affective computing. Continuous emotion recognition (CER), a sub-task in this area, focuses on performing sequence-to-sequence regression on the provided emotion cues, as opposed...

Full description

Bibliographic Details
Main Author: Zhang, Su
Other Authors: Guan Cuntai
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/166783
Description
Summary:Emotion recognition is an increasingly popular research topic in various fields, including human-computer interaction and affective computing. Continuous emotion recognition (CER), a sub-task in this area, focuses on performing sequence-to-sequence regression on the provided emotion cues, as opposed to other research topics such as sequence-to-category emotion classification. To create a trustworthy deep learning model for CER, it is essential to learn the long-range temporal dynamics and preserve the cross-subject generality. The reason is that emotion is a continuous event that depends on past emotional states, making it crucial to consider the dynamics over a longer time frame for a more accurate prediction. Moreover, emotion is susceptible to individual differences because it is linked to personal characteristics such as experience, mood, and personality. To tackle these challenges, we developed four approaches that utilize the advantages of long-range temporal learning and multi-modality in different ways. Our first method, which serves as the foundation for the other three, focuses on the long-range temporal modeling for CER by utilizing unimodal emotion information. The experiment conducted using the MAHNOB-HCI database shows the superior performance of our method compared to the state-of-the-art method. Additionally, we also explore the contribution of different brain regions and EEG frequency bands towards the emotion process using a saliency map-based visualization method. The second method proposes using the continuous labels' temporal and visual information to enhance EEG-based emotion classification. The standard configuration assigns a categorical label to each trial, ignoring the temporal variation, which may reduce the classifier's effectiveness. To overcome this limitation, a thresholding scheme is introduced to convert the emotional trace into a discretized label, allowing the training process to occur in an N-to-N manner. By discretizing the trace into three classes, the classifier can fit the features to their corresponding three-class labels more flexibly. Experimental results show a statistically significant 3\% increase in EEG-based emotion classification accuracy. The third method trains a teacher model on the visual modality and a student model on the EEG modality, where the teacher's temporal embeddings are taken as dark knowledge for the student. By employing L1 loss and concordance correlation coefficient (CCC) loss, the student model learns to fit the teacher's knowledge and predict the continuous labels. Experimental results show that the CKD method outperforms the student model without distillation on root mean square error (RMSE), Pearson correlation coefficient (PCC), and CCC. This approach provides a promising way to leverage the complementarity of different modalities for CER. The final method proposed in this thesis involves multimodal feature fusion for CER. Utilizing multiple modalities can disambiguate and preserve recognition robustness, improving accuracy in cases such as a crying face with joyful vocal expressions being recognized as happiness instead of sadness. The leader-follower attentive network (LFAN) is introduced to combine the learned encodings of the visual and EEG modalities using a cross-modality co-attention mechanism. The LFAN emphasizes the dominant visual modality, which is believed to have the strongest correlation with the label. Experiments on AVEC2019, MAHNOB-HCI, and AffWild2 databases demonstrate that the proposed LFAN achieves promising results compared to state-of-the-art methods.