Self-supervised video understanding

<p>The advent of deep learning has brought about great progress on many funda- mental computer vision tasks such as classification, detection, and segmentation, which describe the categories and locations of objects in images and video. There has also been much work done on supervised learning...

Full description

Bibliographic Details
Main Author: Lu, E
Other Authors: Zisserman, A
Format: Thesis
Language:English
Published: 2021
Description
Summary:<p>The advent of deep learning has brought about great progress on many funda- mental computer vision tasks such as classification, detection, and segmentation, which describe the categories and locations of objects in images and video. There has also been much work done on supervised learning—teaching machines to solve these tasks using human-annotated labels. However, it is insufficient for machines to know only the names and locations of certain objects; many tasks require a deeper understanding of the complex physical world—how objects interact with their surroundings, for example (often by creating shadows, reflections, surface deformations, and other visual effects). Furthermore, training models to solve these tasks while relying heavily on human supervision is costly and impractical to scale. Thus, this thesis explores two directions: first, we aim to go beyond segmentation and address a wholly new task: grouping objects with their correlated visual effects (e.g. shadows, reflections, or attached objects); second, we address the fundamental task of video object segmentation in a self-supervised manner, without relying on any human annotation.</p> <p>To automatically group objects with their correlated visual effects, we adopt a layered approach: we aim to decompose a video into object-specific layers which contain all elements moving with the object. One application of these layers is that they can be recombined in new ways to produce a highly realistic, altered version of the original video (e.g. removing or duplicating objects, or changing the timing of their motions). Here the key is to leverage natural properties of convolutional neural networks to obtain a layered decomposition of the input video. We design a neural network that outputs layers for a video by overfitting to the video. We first introduce a human-specific method, then show how it can be adapted to arbitrary object classes, such as animals or cars.</p> <p>Our second task is video object segmentation: producing pixel-wise labels (segments) for objects in videos. Whereas our previous work is optimized on a single video, here we take a data-driven approach and train on a large corpus of videos in a self-supervised manner. We consider two different task settings: (1) semi-supervised object segmentation, where an initial object mask is provided for a single frame and the method must propagate this mask to the remaining frames, and (2) moving object discovery, where no mask is given and the method must segment the salient moving object. We explore two different input streams: RGB and optical flow, and discuss their connection to the human visual system.</p>