Summary: | <p>Our experience of the world is multimodal, however deep learning networks have been traditionally designed for and trained on unimodal inputs such as images, audio segments or text. In this thesis we develop strategies to exploit multimodal information (in the form of vision, text, speech and non-speech audio) for the automatic understanding of human-centric videos. The key ideas developed in this thesis are (i) Cross-modal Supervision, (ii) Self-supervised Representation Learning, and (iii) Modality Fusion.</p>
<p>In cross-modal supervision, data labels from a supervision-rich modality are used to learn representations in another, supervision-starved target modality, eschewing the need for costly manual annotation in the target modality domain. This effectively exploits the redundant, or overlapping information between modalities. We demonstrate the utility of this technique for three different tasks; First we use face recognition and visual active speaker detection to curate a large scale audio-visual dataset of human speech called VoxCeleb, training on which yields state of the art models for speaker recognition; Second we train a text-based model to predict action labels from transcribed speech alone, and transfer these labels to accompanying videos. Training with these labels allows us to outperform fully supervised action recognition models trained with costly manual supervision; Third, we distill the information from a face model trained for emotion recognition to the speech domain, where manual emotion annotation is expensive. </p>
<p>The second key idea explored in this thesis is the use of modality redundancy for self-supervised representation learning. Here we learn audio-visual representations without any manual supervision in either modality, specifically for human faces and voices. Unlike existing representations, our joint representations enable cross-modal retrieval from audio to vision and vice-versa. We then extend this work to explicitly remove learnt biases, enabling greater generalisation. </p>
<p>Finally, we effectively combine the complementary information in different modalities through the development of new modality fusion architectures. By distilling the information from multiple modalities in a video to a single, compact video representation, we achieve robustness to unimodal inputs which can be missing, corrupted, occluded, or have various levels of background noise. With these models we achieve state of the art results in both action recognition and video-text retrieval.</p>
|