Video understanding using multimodal deep learning

<p>Our experience of the world is multimodal, however deep learning networks have been traditionally designed for and trained on unimodal inputs such as images, audio segments or text. In this thesis we develop strategies to exploit multimodal information (in the form of vision, text, speech a...

Full description

Bibliographic Details
Main Author:	Nagrani, A
Other Authors:	Zisserman, A
Format:	Thesis
Language:	English
Published:	2020
Subjects:	Computer Vision Machine Learning

_version_	1797092351323144192
author	Nagrani, A
author2	Zisserman, A
author_facet	Zisserman, A Nagrani, A
author_sort	Nagrani, A
collection	OXFORD
description	<p>Our experience of the world is multimodal, however deep learning networks have been traditionally designed for and trained on unimodal inputs such as images, audio segments or text. In this thesis we develop strategies to exploit multimodal information (in the form of vision, text, speech and non-speech audio) for the automatic understanding of human-centric videos. The key ideas developed in this thesis are (i) Cross-modal Supervision, (ii) Self-supervised Representation Learning, and (iii) Modality Fusion.</p> <p>In cross-modal supervision, data labels from a supervision-rich modality are used to learn representations in another, supervision-starved target modality, eschewing the need for costly manual annotation in the target modality domain. This effectively exploits the redundant, or overlapping information between modalities. We demonstrate the utility of this technique for three different tasks; First we use face recognition and visual active speaker detection to curate a large scale audio-visual dataset of human speech called VoxCeleb, training on which yields state of the art models for speaker recognition; Second we train a text-based model to predict action labels from transcribed speech alone, and transfer these labels to accompanying videos. Training with these labels allows us to outperform fully supervised action recognition models trained with costly manual supervision; Third, we distill the information from a face model trained for emotion recognition to the speech domain, where manual emotion annotation is expensive. </p> <p>The second key idea explored in this thesis is the use of modality redundancy for self-supervised representation learning. Here we learn audio-visual representations without any manual supervision in either modality, specifically for human faces and voices. Unlike existing representations, our joint representations enable cross-modal retrieval from audio to vision and vice-versa. We then extend this work to explicitly remove learnt biases, enabling greater generalisation. </p> <p>Finally, we effectively combine the complementary information in different modalities through the development of new modality fusion architectures. By distilling the information from multiple modalities in a video to a single, compact video representation, we achieve robustness to unimodal inputs which can be missing, corrupted, occluded, or have various levels of background noise. With these models we achieve state of the art results in both action recognition and video-text retrieval.</p>
first_indexed	2024-03-07T03:44:49Z
format	Thesis
id	oxford-uuid:bf1ce489-dc0e-41a0-8dbf-8fe895bcbb7b
institution	University of Oxford
language	English
last_indexed	2024-03-07T03:44:49Z
publishDate	2020
record_format	dspace
spelling	oxford-uuid:bf1ce489-dc0e-41a0-8dbf-8fe895bcbb7b2022-03-27T05:44:54ZVideo understanding using multimodal deep learningThesishttp://purl.org/coar/resource_type/c_db06uuid:bf1ce489-dc0e-41a0-8dbf-8fe895bcbb7bComputer VisionMachine LearningEnglishHyrax Deposit2020Nagrani, AZisserman, A<p>Our experience of the world is multimodal, however deep learning networks have been traditionally designed for and trained on unimodal inputs such as images, audio segments or text. In this thesis we develop strategies to exploit multimodal information (in the form of vision, text, speech and non-speech audio) for the automatic understanding of human-centric videos. The key ideas developed in this thesis are (i) Cross-modal Supervision, (ii) Self-supervised Representation Learning, and (iii) Modality Fusion.</p> <p>In cross-modal supervision, data labels from a supervision-rich modality are used to learn representations in another, supervision-starved target modality, eschewing the need for costly manual annotation in the target modality domain. This effectively exploits the redundant, or overlapping information between modalities. We demonstrate the utility of this technique for three different tasks; First we use face recognition and visual active speaker detection to curate a large scale audio-visual dataset of human speech called VoxCeleb, training on which yields state of the art models for speaker recognition; Second we train a text-based model to predict action labels from transcribed speech alone, and transfer these labels to accompanying videos. Training with these labels allows us to outperform fully supervised action recognition models trained with costly manual supervision; Third, we distill the information from a face model trained for emotion recognition to the speech domain, where manual emotion annotation is expensive. </p> <p>The second key idea explored in this thesis is the use of modality redundancy for self-supervised representation learning. Here we learn audio-visual representations without any manual supervision in either modality, specifically for human faces and voices. Unlike existing representations, our joint representations enable cross-modal retrieval from audio to vision and vice-versa. We then extend this work to explicitly remove learnt biases, enabling greater generalisation. </p> <p>Finally, we effectively combine the complementary information in different modalities through the development of new modality fusion architectures. By distilling the information from multiple modalities in a video to a single, compact video representation, we achieve robustness to unimodal inputs which can be missing, corrupted, occluded, or have various levels of background noise. With these models we achieve state of the art results in both action recognition and video-text retrieval.</p>
spellingShingle	Computer Vision Machine Learning Nagrani, A Video understanding using multimodal deep learning
title	Video understanding using multimodal deep learning
title_full	Video understanding using multimodal deep learning
title_fullStr	Video understanding using multimodal deep learning
title_full_unstemmed	Video understanding using multimodal deep learning
title_short	Video understanding using multimodal deep learning
title_sort	video understanding using multimodal deep learning
topic	Computer Vision Machine Learning
work_keys_str_mv	AT nagrania videounderstandingusingmultimodaldeeplearning

Video understanding using multimodal deep learning

Similar Items