Learning the structure of object categories from incomplete supervision

<p>This thesis aims at learning and predicting the fine-grained structure of visual object categories given input image data. Alleviating the common requirement of collecting an ample amount of manual annotations, we propose several approaches that learn given an incomplete supervisory signal....

Full description

Bibliographic Details
Main Author: Novotny, D
Other Authors: Larlus, D
Format: Thesis
Language:English
Published: 2018
Subjects:
Description
Summary:<p>This thesis aims at learning and predicting the fine-grained structure of visual object categories given input image data. Alleviating the common requirement of collecting an ample amount of manual annotations, we propose several approaches that learn given an incomplete supervisory signal. </p> <p>Specifically, we begin with an analysis of the amount of supervision needed to learn all visual variations of an object part. Motivated by the gathered observations, a detector of semantic (i.e. nameable) parts supervised with inexpensive web image search data is then proposed. The main challenge of handling a significant amount of annotation noise is addressed with a novel geometry-appearance embedding. </p> <p>Moving away from semantic part detection, learning generic mid-level elements for understanding the geometry of object categories is brought into focus. A novel architecture that outputs a visual representation suitable for establishing image-to-image semantic correspondences is proposed. The main contribution consists of a new discriminability diversity objective that facilitates learning of sparse image features sensitive to the changes of the geometry of the input. </p> <p>A similar feature learning machine leveraging the equivariance constraint is later introduced. Differently from existing alternatives, we adapt the method for the noisy settings of the training dataset by means of a novel probabilistic introspection framework. This allows for a selective representation of image pixels that have the potential to result in a correct match. </p> <p>Inspired by the ability of deep networks to decompose an object into a constellation of pixel-perfect landmarks, an opposite problem of grouping image pixels belonging to an object is addressed. More specifically, we deal with the instance segmentation problem using a deep convolutional architecture that "colors" image pixels with their instance labels. Identifying the convolutional coloring dilemma, a drawback of standard position-agnostic networks that prevents them from solving this task, we propose a correction comprising a novel position-sensitive semi-convolutional operator. </p> <p>The last tackled task is learning 3D shapes of object categories. Inspired by the human visual system, a deep network that learns by observing an object category in a sequence of videos is described. </p> <p>Our final contribution is a probabilistic learning scheme that increases robustness of network training and enables test-time confidence predictions. This is achieved by explicitly modeling the distribution of training errors caused by the insufficiencies of the model or by the noise in ground truth annotations.</p>