Summary: | This research work attempts to merge affective computing and video summarization,
thereby enhancing the latter by integrating cross-disciplinary affective information, termed
affective video summarization. Affective video summarization functions by identifying
emotionally impactful moments in the video using emotional cues, resulting in summaries
to enhance user experiences.
Existing visual-based video summarization methods frequently neglect integrating
affective information to improve summaries through emotional considerations. Alternatively,
they may disregard the visual element and instead utilize alternative modalities,
like EEG signals, to generate visual attention or emotion tagging for summarization. The
plausible cause is the emotion labels to guide video summarization are costly to acquire
and demand extensive labels to overcome the lack of nuanced richness for personalization
and emotion subtlety. Therefore, this study attempts to overcome the limitations
by addressing the problem of expensive human annotations and scalability for affective
video summarization.
This thesis proposes using EEG as a secondary modality for emotional cues in video
summarization. However, the challenge is demonstrating that EEG features retain affective
information after converting it into a latent representation. The thesis thus investigates
three areas: 1) Emotion recognition by spatiotemporal modeling to prove
that the EEG features contain affective information. This preliminary study introduces
Regionally-Operated Domain Adversarial Networks (RODAN), an attention-based model
for EEG-based emotion classification. 2) Affective semantics analysis by generative modeling,
employing Superposition Quantized Variational Autoencoder (SQVAE), based on
an orthonormal eigenvector codebook and spatiotemporal transformer as encoder and
decoder, to generate EEG latent representations and features to validate the presence of affective information. 3) Affective semantic guided video summarization with deep
reinforcement learning proposes EEG-Video Emotion-based Summarization (EVES), a
policy-based reinforcement learning model for integrating video and EEG signals for
emotion-based summarization.
In the first study, RODAN achieved emotion classification accuracies of 60.75% for
SEED-IV and 31.84% for DEAP datasets, indicating the presence of affective information.
Subsequently, reconstructed EEG signals using SQVAE on MAHNOB-HCI aligned
closely with the original signals, and the emotion recognition results with latent representations
validated the presence of affective information. Finally, through multimodal
pre-training, EVES produced summaries that were 11.4% more coherent and 7.4% more
emotion-evoking compared to alternative reinforcement learning models. Overall, this
thesis establishes that EEG signals can encode affective information, and multimodal
video summarization enhances summaries’ coherency and emotional impact.
|