Towards Object-based SLAM

Simultaneous localization and mapping (SLAM) is a fundamental capability for a robot to perceive its surrounding environment. The research area has developed for more than two decades from the original sparse landmark-based SLAM to dense SLAM, and now there is a demand for semantic understanding of...

Full description

Bibliographic Details
Main Author: Zhang, Yihao
Other Authors: Leonard, John J.
Format: Thesis
Published: Massachusetts Institute of Technology 2025
Online Access:https://hdl.handle.net/1721.1/158310
Description
Summary:Simultaneous localization and mapping (SLAM) is a fundamental capability for a robot to perceive its surrounding environment. The research area has developed for more than two decades from the original sparse landmark-based SLAM to dense SLAM, and now there is a demand for semantic understanding of the environment beyond pure geometric understanding. This thesis delves into object-based SLAM where the map consists of a set of objects with their semantic categories recognized and their poses and shapes estimated. Such a map provides vital object-level semantic and geometric perception to applications such as augmented reality (AR), mixed reality (MR), robot manipulation, and self-driving. In order to perform object-based SLAM, the sensor measurements have to undergo a series of processes. First, objects are semantically segmented in the sensor measurements. This step is typically done by a neural network. As robots are often required to bootstrap from some initial labeled datasets and adapt to different environments where labeled data are unavailable, it is important to enable semi-supervised learning to improve the robot’s performance with the unlabeled data collected by the robot itself. Second, after the objects are segmented, measurements for each object across different views have to be associated together for downstream processing. Lastly, the robot must be able to extract the object pose and shape information from the measurements without access to the detailed object CAD models which are commonly unavailable. This thesis studies these three aspects of object-based SLAM, namely semi-supervised learning of semantic segmentation in a robotics context, data association for object-based SLAM, and category-level object pose and shape estimation. The thesis closes with a discussion of how these components can be integrated into a full object-based SLAM system in the future.