Discovering, Learning, and Exploiting Visual Cues

Animals have evolved over millions of years to exploit the faintest visual cues for perception, navigation, and survival. Complex and intricate vision systems found in animals, such as bee eyes, exploit cues like polarization of light relative to the Sun’s position to navigate and process motion at...

Full description

Bibliographic Details
Main Author: Tiwary, Kushagra
Other Authors: Raskar, Ramesh
Format: Thesis
Published: Massachusetts Institute of Technology 2023
Online Access:https://hdl.handle.net/1721.1/152014
https://orcid.org/0000-0003-3964-8771
Description
Summary:Animals have evolved over millions of years to exploit the faintest visual cues for perception, navigation, and survival. Complex and intricate vision systems found in animals, such as bee eyes, exploit cues like polarization of light relative to the Sun’s position to navigate and process motion at one three-hundredth of a second. In humans, the evolution of the eyes and the processing of visual cues are also tightly intertwined. Babies develop depth-of-field at 6 months, are often scared of their own shadows, and confuse their reflections with the real world. As the infant matures into an adult, they intuitively learn from their experiences how these cues instead provide valuable hidden information about their environments and can be exploited for depth perception and driving. Inspired by our usage of visual cues, this thesis explores visual cues in the modern context of data-driven imaging techniques. We first explore how visual cues can be learned from and exploited by combining physics-based forward models with data-driven AI systems. We first map the space of physics-based and data-driven systems and show the future of vision lies in the intersection of both regimes. Next, we show how shadows can be exploited to image and 3D reconstruct the hidden parts of the scene. We then exploit multi-view reflections to convert household objects into radiance-field cameras that can image the world from the object's perspective in 5D. This enables applications of occlusion imaging, beyond field-of-view novel-view synthesis, and depth estimation from objects to their environments. Finally, we discuss how current approaches rely on humans to design imaging systems that can learn and exploit visual cues. However, as sensing in space, time, and different modalities become ubiquitous, relying on human-designed systems is not sufficient to build complex vision systems. We then propose a technique that combines reinforcement learning with computer vision to automatically learn which cues to exploit to accomplish the task without human intervention. We show how in one such scenario agents can start to automatically learn to use multiple cameras and the triangulation cue to estimate the depth of an unknown object in the scene without access to prior information about the camera, the algorithm, or the object.