Pursuing Mid-level Perception from Casual Videos

This thesis aims to summarize a series of explorations around a central theme: How can we learn mid-level perception from collections of casually shot videos? To avoid reader’s disappointment, I would like to be frank at the start: contents within are only starting steps towards solving the problem....

Full description

Bibliographic Details
Main Author: Zhang, Zhoutong
Other Authors: Freeman, William T.
Format: Thesis
Published: Massachusetts Institute of Technology 2023
Online Access:https://hdl.handle.net/1721.1/147336
_version_ 1826209002800807936
author Zhang, Zhoutong
author2 Freeman, William T.
author_facet Freeman, William T.
Zhang, Zhoutong
author_sort Zhang, Zhoutong
collection MIT
description This thesis aims to summarize a series of explorations around a central theme: How can we learn mid-level perception from collections of casually shot videos? To avoid reader’s disappointment, I would like to be frank at the start: contents within are only starting steps towards solving the problem. Specifically, a major part of this thesis addresses the problem of recovering depths and ego-motion despite of the dynamics in the video, which is only part of the mid-level perception problem. Why those in particular? First of all, they are the pillars of 3D understanding if the agent is able to move and interact with the dynamic world. In a more narrow sense, this corresponds to the "mid-level vision" in Marr’s perception theory, where 2.5D sketches are recovered from processed image signals. If we add the flexibility of motion, then the task would also include recovering the ego-motion, i.e. the trajectory of the viewer through time. In addition, depths and ego-motion recovery have the potential to help solving other mid-level vision tasks. In this thesis, we show that we can solve the video version of the checkershadow illusion [1] when both the observer and the checker is moving simultaneously. This is done by building a 3D representation of the scene that are split into persistent and transient effects, which is only possible with recovered the depth and ego-motion. To get depth and camera’s ego-motion from videos with unrestricted object motion and ego-motion, is quite challenging. The first chapter of the thesis gives an introduction of the problem, with brief reviews of past works and demonstrate how they fail to solve the problem robustly and why. The second chapter will address a partial form of the problem, where for a video with given camera ego-motion, how to recover reliable depth maps even if there’s significant object motion in the scene. The third chapter of the thesis addresses the full problem, presenting a solution to jointly recover depth and camera ego-motion for casually shot videos. It is remained to ask, why the ambitious title? Why not a more specific one and end the thesis here? Maybe a bit unconventional, I would like to think of this thesis as a starting milestone for the topic, which I feel committed and excited to pursuit, instead of an end, a mere warp up of what I did for my graduate studies. Therefore, the last chapter, named "Video Canonicalization", is dedicated to an ongoing pursuit that aims to provide a structure that is helpful for analyzing different works, and clarifying design dimensions for solving mid-level vision problems using videos. Some part of this chapter may seem half-baked, with rudimentary experiments and examples that merely aim to prove the concept. Hopefully those will mature into future projects that would better bear the title. Finally, I would like to cite, though not in its exact form, Patrick Winston’s remarks when I entered MIT: "There’s only one thing I can promise you after your journey at MIT: you will find the thing you are truly excited about, which will drive you for the future. If not, I’ll come to you and you will be in trouble with me." I’m really glad that this turned out to be true, but sad that he will never come to us even if it wasn’t.
first_indexed 2024-09-23T14:15:49Z
format Thesis
id mit-1721.1/147336
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T14:15:49Z
publishDate 2023
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/1473362023-01-20T03:20:39Z Pursuing Mid-level Perception from Casual Videos Zhang, Zhoutong Freeman, William T. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science This thesis aims to summarize a series of explorations around a central theme: How can we learn mid-level perception from collections of casually shot videos? To avoid reader’s disappointment, I would like to be frank at the start: contents within are only starting steps towards solving the problem. Specifically, a major part of this thesis addresses the problem of recovering depths and ego-motion despite of the dynamics in the video, which is only part of the mid-level perception problem. Why those in particular? First of all, they are the pillars of 3D understanding if the agent is able to move and interact with the dynamic world. In a more narrow sense, this corresponds to the "mid-level vision" in Marr’s perception theory, where 2.5D sketches are recovered from processed image signals. If we add the flexibility of motion, then the task would also include recovering the ego-motion, i.e. the trajectory of the viewer through time. In addition, depths and ego-motion recovery have the potential to help solving other mid-level vision tasks. In this thesis, we show that we can solve the video version of the checkershadow illusion [1] when both the observer and the checker is moving simultaneously. This is done by building a 3D representation of the scene that are split into persistent and transient effects, which is only possible with recovered the depth and ego-motion. To get depth and camera’s ego-motion from videos with unrestricted object motion and ego-motion, is quite challenging. The first chapter of the thesis gives an introduction of the problem, with brief reviews of past works and demonstrate how they fail to solve the problem robustly and why. The second chapter will address a partial form of the problem, where for a video with given camera ego-motion, how to recover reliable depth maps even if there’s significant object motion in the scene. The third chapter of the thesis addresses the full problem, presenting a solution to jointly recover depth and camera ego-motion for casually shot videos. It is remained to ask, why the ambitious title? Why not a more specific one and end the thesis here? Maybe a bit unconventional, I would like to think of this thesis as a starting milestone for the topic, which I feel committed and excited to pursuit, instead of an end, a mere warp up of what I did for my graduate studies. Therefore, the last chapter, named "Video Canonicalization", is dedicated to an ongoing pursuit that aims to provide a structure that is helpful for analyzing different works, and clarifying design dimensions for solving mid-level vision problems using videos. Some part of this chapter may seem half-baked, with rudimentary experiments and examples that merely aim to prove the concept. Hopefully those will mature into future projects that would better bear the title. Finally, I would like to cite, though not in its exact form, Patrick Winston’s remarks when I entered MIT: "There’s only one thing I can promise you after your journey at MIT: you will find the thing you are truly excited about, which will drive you for the future. If not, I’ll come to you and you will be in trouble with me." I’m really glad that this turned out to be true, but sad that he will never come to us even if it wasn’t. Ph.D. 2023-01-19T18:46:20Z 2023-01-19T18:46:20Z 2022-09 2022-10-19T19:12:13.568Z Thesis https://hdl.handle.net/1721.1/147336 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle Zhang, Zhoutong
Pursuing Mid-level Perception from Casual Videos
title Pursuing Mid-level Perception from Casual Videos
title_full Pursuing Mid-level Perception from Casual Videos
title_fullStr Pursuing Mid-level Perception from Casual Videos
title_full_unstemmed Pursuing Mid-level Perception from Casual Videos
title_short Pursuing Mid-level Perception from Casual Videos
title_sort pursuing mid level perception from casual videos
url https://hdl.handle.net/1721.1/147336
work_keys_str_mv AT zhangzhoutong pursuingmidlevelperceptionfromcasualvideos