Video understanding using multimodal deep learning

<p>Our experience of the world is multimodal, however deep learning networks have been traditionally designed for and trained on unimodal inputs such as images, audio segments or text. In this thesis we develop strategies to exploit multimodal information (in the form of vision, text, speech a...

Full description

Bibliographic Details
Main Author: Nagrani, A
Other Authors: Zisserman, A
Format: Thesis
Language:English
Published: 2020
Subjects:
_version_ 1797092351323144192
author Nagrani, A
author2 Zisserman, A
author_facet Zisserman, A
Nagrani, A
author_sort Nagrani, A
collection OXFORD
description <p>Our experience of the world is multimodal, however deep learning networks have been traditionally designed for and trained on unimodal inputs such as images, audio segments or text. In this thesis we develop strategies to exploit multimodal information (in the form of vision, text, speech and non-speech audio) for the automatic understanding of human-centric videos. The key ideas developed in this thesis are (i) Cross-modal Supervision, (ii) Self-supervised Representation Learning, and (iii) Modality Fusion.</p> <p>In cross-modal supervision, data labels from a supervision-rich modality are used to learn representations in another, supervision-starved target modality, eschewing the need for costly manual annotation in the target modality domain. This effectively exploits the redundant, or overlapping information between modalities. We demonstrate the utility of this technique for three different tasks; First we use face recognition and visual active speaker detection to curate a large scale audio-visual dataset of human speech called VoxCeleb, training on which yields state of the art models for speaker recognition; Second we train a text-based model to predict action labels from transcribed speech alone, and transfer these labels to accompanying videos. Training with these labels allows us to outperform fully supervised action recognition models trained with costly manual supervision; Third, we distill the information from a face model trained for emotion recognition to the speech domain, where manual emotion annotation is expensive. </p> <p>The second key idea explored in this thesis is the use of modality redundancy for self-supervised representation learning. Here we learn audio-visual representations without any manual supervision in either modality, specifically for human faces and voices. Unlike existing representations, our joint representations enable cross-modal retrieval from audio to vision and vice-versa. We then extend this work to explicitly remove learnt biases, enabling greater generalisation. </p> <p>Finally, we effectively combine the complementary information in different modalities through the development of new modality fusion architectures. By distilling the information from multiple modalities in a video to a single, compact video representation, we achieve robustness to unimodal inputs which can be missing, corrupted, occluded, or have various levels of background noise. With these models we achieve state of the art results in both action recognition and video-text retrieval.</p>
first_indexed 2024-03-07T03:44:49Z
format Thesis
id oxford-uuid:bf1ce489-dc0e-41a0-8dbf-8fe895bcbb7b
institution University of Oxford
language English
last_indexed 2024-03-07T03:44:49Z
publishDate 2020
record_format dspace
spelling oxford-uuid:bf1ce489-dc0e-41a0-8dbf-8fe895bcbb7b2022-03-27T05:44:54ZVideo understanding using multimodal deep learningThesishttp://purl.org/coar/resource_type/c_db06uuid:bf1ce489-dc0e-41a0-8dbf-8fe895bcbb7bComputer VisionMachine LearningEnglishHyrax Deposit2020Nagrani, AZisserman, A<p>Our experience of the world is multimodal, however deep learning networks have been traditionally designed for and trained on unimodal inputs such as images, audio segments or text. In this thesis we develop strategies to exploit multimodal information (in the form of vision, text, speech and non-speech audio) for the automatic understanding of human-centric videos. The key ideas developed in this thesis are (i) Cross-modal Supervision, (ii) Self-supervised Representation Learning, and (iii) Modality Fusion.</p> <p>In cross-modal supervision, data labels from a supervision-rich modality are used to learn representations in another, supervision-starved target modality, eschewing the need for costly manual annotation in the target modality domain. This effectively exploits the redundant, or overlapping information between modalities. We demonstrate the utility of this technique for three different tasks; First we use face recognition and visual active speaker detection to curate a large scale audio-visual dataset of human speech called VoxCeleb, training on which yields state of the art models for speaker recognition; Second we train a text-based model to predict action labels from transcribed speech alone, and transfer these labels to accompanying videos. Training with these labels allows us to outperform fully supervised action recognition models trained with costly manual supervision; Third, we distill the information from a face model trained for emotion recognition to the speech domain, where manual emotion annotation is expensive. </p> <p>The second key idea explored in this thesis is the use of modality redundancy for self-supervised representation learning. Here we learn audio-visual representations without any manual supervision in either modality, specifically for human faces and voices. Unlike existing representations, our joint representations enable cross-modal retrieval from audio to vision and vice-versa. We then extend this work to explicitly remove learnt biases, enabling greater generalisation. </p> <p>Finally, we effectively combine the complementary information in different modalities through the development of new modality fusion architectures. By distilling the information from multiple modalities in a video to a single, compact video representation, we achieve robustness to unimodal inputs which can be missing, corrupted, occluded, or have various levels of background noise. With these models we achieve state of the art results in both action recognition and video-text retrieval.</p>
spellingShingle Computer Vision
Machine Learning
Nagrani, A
Video understanding using multimodal deep learning
title Video understanding using multimodal deep learning
title_full Video understanding using multimodal deep learning
title_fullStr Video understanding using multimodal deep learning
title_full_unstemmed Video understanding using multimodal deep learning
title_short Video understanding using multimodal deep learning
title_sort video understanding using multimodal deep learning
topic Computer Vision
Machine Learning
work_keys_str_mv AT nagrania videounderstandingusingmultimodaldeeplearning