AttendAffectNet–Emotion Prediction of Movie Viewers Using Multimodal Fusion with Self-Attention

In this paper, we tackle the problem of predicting the affective responses of movie viewers, based on the content of the movies. Current studies on this topic focus on video representation learning and fusion techniques to combine the extracted features for predicting affect. Yet, these typically, w...

Full description

Bibliographic Details
Main Authors: Ha Thi Phuong Thao, B T Balamurali, Gemma Roig, Dorien Herremans
Format: Article
Language:English
Published: MDPI AG 2021-12-01
Series:Sensors
Subjects:
Online Access:https://www.mdpi.com/1424-8220/21/24/8356
_version_ 1797500794791002112
author Ha Thi Phuong Thao
B T Balamurali
Gemma Roig
Dorien Herremans
author_facet Ha Thi Phuong Thao
B T Balamurali
Gemma Roig
Dorien Herremans
author_sort Ha Thi Phuong Thao
collection DOAJ
description In this paper, we tackle the problem of predicting the affective responses of movie viewers, based on the content of the movies. Current studies on this topic focus on video representation learning and fusion techniques to combine the extracted features for predicting affect. Yet, these typically, while ignoring the correlation between multiple modality inputs, ignore the correlation between temporal inputs (i.e., sequential features). To explore these correlations, a neural network architecture—namely AttendAffectNet (AAN)—uses the self-attention mechanism for predicting the emotions of movie viewers from different input modalities. Particularly, visual, audio, and text features are considered for predicting emotions (and expressed in terms of valence and arousal). We analyze three variants of our proposed AAN: Feature AAN, Temporal AAN, and Mixed AAN. The Feature AAN applies the self-attention mechanism in an innovative way on the features extracted from the different modalities (including video, audio, and movie subtitles) of a whole movie to, thereby, capture the relationships between them. The Temporal AAN takes the time domain of the movies and the sequential dependency of affective responses into account. In the Temporal AAN, self-attention is applied on the concatenated (multimodal) feature vectors representing different subsequent movie segments. In the Mixed AAN, we combine the strong points of the Feature AAN and the Temporal AAN, by applying self-attention first on vectors of features obtained from different modalities in each movie segment and then on the feature representations of all subsequent (temporal) movie segments. We extensively trained and validated our proposed AAN on both the MediaEval 2016 dataset for the Emotional Impact of Movies Task and the extended COGNIMUSE dataset. Our experiments demonstrate that audio features play a more influential role than those extracted from video and movie subtitles when predicting the emotions of movie viewers on these datasets. The models that use all visual, audio, and text features simultaneously as their inputs performed better than those using features extracted from each modality separately. In addition, the Feature AAN outperformed other AAN variants on the above-mentioned datasets, highlighting the importance of taking different features as context to one another when fusing them. The Feature AAN also performed better than the baseline models when predicting the valence dimension.
first_indexed 2024-03-10T03:08:58Z
format Article
id doaj.art-b34e149f49b6439386b991879fd70d1a
institution Directory Open Access Journal
issn 1424-8220
language English
last_indexed 2024-03-10T03:08:58Z
publishDate 2021-12-01
publisher MDPI AG
record_format Article
series Sensors
spelling doaj.art-b34e149f49b6439386b991879fd70d1a2023-11-23T10:30:20ZengMDPI AGSensors1424-82202021-12-012124835610.3390/s21248356AttendAffectNet–Emotion Prediction of Movie Viewers Using Multimodal Fusion with Self-AttentionHa Thi Phuong Thao0B T Balamurali1Gemma Roig2Dorien Herremans3Information Systems Technology and Design, Singapore University of Technology and Design, 8 Somapah Rd, Singapore 48737, SingaporeScience, Mathematics and Technology, Singapore University of Technology and Design, 8 Somapah Rd, Singapore 48737, SingaporeComputer Science Department, Goethe University Frankfurt, 60323 Frankfurt, GermanyInformation Systems Technology and Design, Singapore University of Technology and Design, 8 Somapah Rd, Singapore 48737, SingaporeIn this paper, we tackle the problem of predicting the affective responses of movie viewers, based on the content of the movies. Current studies on this topic focus on video representation learning and fusion techniques to combine the extracted features for predicting affect. Yet, these typically, while ignoring the correlation between multiple modality inputs, ignore the correlation between temporal inputs (i.e., sequential features). To explore these correlations, a neural network architecture—namely AttendAffectNet (AAN)—uses the self-attention mechanism for predicting the emotions of movie viewers from different input modalities. Particularly, visual, audio, and text features are considered for predicting emotions (and expressed in terms of valence and arousal). We analyze three variants of our proposed AAN: Feature AAN, Temporal AAN, and Mixed AAN. The Feature AAN applies the self-attention mechanism in an innovative way on the features extracted from the different modalities (including video, audio, and movie subtitles) of a whole movie to, thereby, capture the relationships between them. The Temporal AAN takes the time domain of the movies and the sequential dependency of affective responses into account. In the Temporal AAN, self-attention is applied on the concatenated (multimodal) feature vectors representing different subsequent movie segments. In the Mixed AAN, we combine the strong points of the Feature AAN and the Temporal AAN, by applying self-attention first on vectors of features obtained from different modalities in each movie segment and then on the feature representations of all subsequent (temporal) movie segments. We extensively trained and validated our proposed AAN on both the MediaEval 2016 dataset for the Emotional Impact of Movies Task and the extended COGNIMUSE dataset. Our experiments demonstrate that audio features play a more influential role than those extracted from video and movie subtitles when predicting the emotions of movie viewers on these datasets. The models that use all visual, audio, and text features simultaneously as their inputs performed better than those using features extracted from each modality separately. In addition, the Feature AAN outperformed other AAN variants on the above-mentioned datasets, highlighting the importance of taking different features as context to one another when fusing them. The Feature AAN also performed better than the baseline models when predicting the valence dimension.https://www.mdpi.com/1424-8220/21/24/8356neural networksself-attentionemotion predictionMediaEval 2016COGNIMUSEaffective computing
spellingShingle Ha Thi Phuong Thao
B T Balamurali
Gemma Roig
Dorien Herremans
AttendAffectNet–Emotion Prediction of Movie Viewers Using Multimodal Fusion with Self-Attention
Sensors
neural networks
self-attention
emotion prediction
MediaEval 2016
COGNIMUSE
affective computing
title AttendAffectNet–Emotion Prediction of Movie Viewers Using Multimodal Fusion with Self-Attention
title_full AttendAffectNet–Emotion Prediction of Movie Viewers Using Multimodal Fusion with Self-Attention
title_fullStr AttendAffectNet–Emotion Prediction of Movie Viewers Using Multimodal Fusion with Self-Attention
title_full_unstemmed AttendAffectNet–Emotion Prediction of Movie Viewers Using Multimodal Fusion with Self-Attention
title_short AttendAffectNet–Emotion Prediction of Movie Viewers Using Multimodal Fusion with Self-Attention
title_sort attendaffectnet emotion prediction of movie viewers using multimodal fusion with self attention
topic neural networks
self-attention
emotion prediction
MediaEval 2016
COGNIMUSE
affective computing
url https://www.mdpi.com/1424-8220/21/24/8356
work_keys_str_mv AT hathiphuongthao attendaffectnetemotionpredictionofmovieviewersusingmultimodalfusionwithselfattention
AT btbalamurali attendaffectnetemotionpredictionofmovieviewersusingmultimodalfusionwithselfattention
AT gemmaroig attendaffectnetemotionpredictionofmovieviewersusingmultimodalfusionwithselfattention
AT dorienherremans attendaffectnetemotionpredictionofmovieviewersusingmultimodalfusionwithselfattention