Multimodal continuous emotion analysis

Emotion recognition is an increasingly popular research topic in various fields, including human-computer interaction and affective computing. Continuous emotion recognition (CER), a sub-task in this area, focuses on performing sequence-to-sequence regression on the provided emotion cues, as opposed...

Full description

Bibliographic Details
Main Author: Zhang, Su
Other Authors: Guan Cuntai
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2023
Subjects:
Online Access:https://hdl.handle.net/10356/166783
_version_ 1811692026845986816
author Zhang, Su
author2 Guan Cuntai
author_facet Guan Cuntai
Zhang, Su
author_sort Zhang, Su
collection NTU
description Emotion recognition is an increasingly popular research topic in various fields, including human-computer interaction and affective computing. Continuous emotion recognition (CER), a sub-task in this area, focuses on performing sequence-to-sequence regression on the provided emotion cues, as opposed to other research topics such as sequence-to-category emotion classification. To create a trustworthy deep learning model for CER, it is essential to learn the long-range temporal dynamics and preserve the cross-subject generality. The reason is that emotion is a continuous event that depends on past emotional states, making it crucial to consider the dynamics over a longer time frame for a more accurate prediction. Moreover, emotion is susceptible to individual differences because it is linked to personal characteristics such as experience, mood, and personality. To tackle these challenges, we developed four approaches that utilize the advantages of long-range temporal learning and multi-modality in different ways. Our first method, which serves as the foundation for the other three, focuses on the long-range temporal modeling for CER by utilizing unimodal emotion information. The experiment conducted using the MAHNOB-HCI database shows the superior performance of our method compared to the state-of-the-art method. Additionally, we also explore the contribution of different brain regions and EEG frequency bands towards the emotion process using a saliency map-based visualization method. The second method proposes using the continuous labels' temporal and visual information to enhance EEG-based emotion classification. The standard configuration assigns a categorical label to each trial, ignoring the temporal variation, which may reduce the classifier's effectiveness. To overcome this limitation, a thresholding scheme is introduced to convert the emotional trace into a discretized label, allowing the training process to occur in an N-to-N manner. By discretizing the trace into three classes, the classifier can fit the features to their corresponding three-class labels more flexibly. Experimental results show a statistically significant 3\% increase in EEG-based emotion classification accuracy. The third method trains a teacher model on the visual modality and a student model on the EEG modality, where the teacher's temporal embeddings are taken as dark knowledge for the student. By employing L1 loss and concordance correlation coefficient (CCC) loss, the student model learns to fit the teacher's knowledge and predict the continuous labels. Experimental results show that the CKD method outperforms the student model without distillation on root mean square error (RMSE), Pearson correlation coefficient (PCC), and CCC. This approach provides a promising way to leverage the complementarity of different modalities for CER. The final method proposed in this thesis involves multimodal feature fusion for CER. Utilizing multiple modalities can disambiguate and preserve recognition robustness, improving accuracy in cases such as a crying face with joyful vocal expressions being recognized as happiness instead of sadness. The leader-follower attentive network (LFAN) is introduced to combine the learned encodings of the visual and EEG modalities using a cross-modality co-attention mechanism. The LFAN emphasizes the dominant visual modality, which is believed to have the strongest correlation with the label. Experiments on AVEC2019, MAHNOB-HCI, and AffWild2 databases demonstrate that the proposed LFAN achieves promising results compared to state-of-the-art methods.
first_indexed 2024-10-01T06:29:15Z
format Thesis-Doctor of Philosophy
id ntu-10356/166783
institution Nanyang Technological University
language English
last_indexed 2024-10-01T06:29:15Z
publishDate 2023
publisher Nanyang Technological University
record_format dspace
spelling ntu-10356/1667832023-06-01T08:00:47Z Multimodal continuous emotion analysis Zhang, Su Guan Cuntai School of Computer Science and Engineering CTGuan@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Emotion recognition is an increasingly popular research topic in various fields, including human-computer interaction and affective computing. Continuous emotion recognition (CER), a sub-task in this area, focuses on performing sequence-to-sequence regression on the provided emotion cues, as opposed to other research topics such as sequence-to-category emotion classification. To create a trustworthy deep learning model for CER, it is essential to learn the long-range temporal dynamics and preserve the cross-subject generality. The reason is that emotion is a continuous event that depends on past emotional states, making it crucial to consider the dynamics over a longer time frame for a more accurate prediction. Moreover, emotion is susceptible to individual differences because it is linked to personal characteristics such as experience, mood, and personality. To tackle these challenges, we developed four approaches that utilize the advantages of long-range temporal learning and multi-modality in different ways. Our first method, which serves as the foundation for the other three, focuses on the long-range temporal modeling for CER by utilizing unimodal emotion information. The experiment conducted using the MAHNOB-HCI database shows the superior performance of our method compared to the state-of-the-art method. Additionally, we also explore the contribution of different brain regions and EEG frequency bands towards the emotion process using a saliency map-based visualization method. The second method proposes using the continuous labels' temporal and visual information to enhance EEG-based emotion classification. The standard configuration assigns a categorical label to each trial, ignoring the temporal variation, which may reduce the classifier's effectiveness. To overcome this limitation, a thresholding scheme is introduced to convert the emotional trace into a discretized label, allowing the training process to occur in an N-to-N manner. By discretizing the trace into three classes, the classifier can fit the features to their corresponding three-class labels more flexibly. Experimental results show a statistically significant 3\% increase in EEG-based emotion classification accuracy. The third method trains a teacher model on the visual modality and a student model on the EEG modality, where the teacher's temporal embeddings are taken as dark knowledge for the student. By employing L1 loss and concordance correlation coefficient (CCC) loss, the student model learns to fit the teacher's knowledge and predict the continuous labels. Experimental results show that the CKD method outperforms the student model without distillation on root mean square error (RMSE), Pearson correlation coefficient (PCC), and CCC. This approach provides a promising way to leverage the complementarity of different modalities for CER. The final method proposed in this thesis involves multimodal feature fusion for CER. Utilizing multiple modalities can disambiguate and preserve recognition robustness, improving accuracy in cases such as a crying face with joyful vocal expressions being recognized as happiness instead of sadness. The leader-follower attentive network (LFAN) is introduced to combine the learned encodings of the visual and EEG modalities using a cross-modality co-attention mechanism. The LFAN emphasizes the dominant visual modality, which is believed to have the strongest correlation with the label. Experiments on AVEC2019, MAHNOB-HCI, and AffWild2 databases demonstrate that the proposed LFAN achieves promising results compared to state-of-the-art methods. Doctor of Philosophy 2023-05-11T02:32:54Z 2023-05-11T02:32:54Z 2023 Thesis-Doctor of Philosophy Zhang, S. (2023). Multimodal continuous emotion analysis. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/166783 https://hdl.handle.net/10356/166783 10.32657/10356/166783 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University
spellingShingle Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
Zhang, Su
Multimodal continuous emotion analysis
title Multimodal continuous emotion analysis
title_full Multimodal continuous emotion analysis
title_fullStr Multimodal continuous emotion analysis
title_full_unstemmed Multimodal continuous emotion analysis
title_short Multimodal continuous emotion analysis
title_sort multimodal continuous emotion analysis
topic Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence
url https://hdl.handle.net/10356/166783
work_keys_str_mv AT zhangsu multimodalcontinuousemotionanalysis