Unsupervised learning of 3d objects in the wild

<p>We live in a physical 3D world. Being able to perceive the world in 3D allows us to confidently navigate and perform sophisticated actions in it. Developing 3D perception systems is not only key to many AR and robotics applications, but also fundamental to the goal of visual understanding....

Full description

Bibliographic Details
Main Author: Wu, S
Other Authors: Vedaldi, A
Format: Thesis
Language:English
Published: 2022
Subjects:
Description
Summary:<p>We live in a physical 3D world. Being able to perceive the world in 3D allows us to confidently navigate and perform sophisticated actions in it. Developing 3D perception systems is not only key to many AR and robotics applications, but also fundamental to the goal of visual understanding. However, most existing learning-based image understanding models process images simply as compositions of 2D patterns, ignoring the fact that images arise from a physical 3D world. Recent advances in differentiable rendering have opened new possibilities for learning physically-grounded disentangled 3D representations from raw images, marrying geometry with learning at last.</p> <p>Learning 3D representations typically requires 3D supervision. This has previously come in the form of 3D labels, shape models, multi-views, or 2D geometric annotations, such as keypoints, masks and camera viewpoints. However, all of these are prohibitively expensive to collect on a large scale, which has become a major obstacle for learning 3D representations of general objects in the wild.</p> <p>In this thesis, we explore the possibility of learning deformable 3D objects simply from raw casually recorded 2D image data “in the wild”, such as online photos and YouTube videos, without relying on heavy manual supervision. At the core of our approach is a Photo-Geometric Autoencoding framework that decomposes images into a set of physically-grounded photometric and geometric factors, including 3D shape, pose, shiny surface material, environment lighting and camera parameters, with the objective of recomposing them to reconstruct the input images through a differentiable renderer. In order to resolve the ambiguity in this highly ill-posed task, we propose to inject generic inductive biases to exploit the symmetries and regularities of the world. The resulting model is able to learn disentangled 3D priors of various deformable object categories, allowing for single-image inference at test time.</p>