Learning to represent scenes with geometry and semantics
<p>Scene representation is the process of converting sensory observations of an environment into compact descriptions. Such intelligent behavior is a cornerstone of artificial intelligence. Scientists have long sought to reproduce the extraordinary ability of humans for understanding the physi...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Language: | English |
Published: |
2022
|
Subjects: |
_version_ | 1826308902747111424 |
---|---|
author | Wang, B |
author2 | Trigoni, A |
author_facet | Trigoni, A Wang, B |
author_sort | Wang, B |
collection | OXFORD |
description | <p>Scene representation is the process of converting sensory observations of an environment into compact descriptions. Such intelligent behavior is a cornerstone of artificial intelligence. Scientists have long sought to reproduce the extraordinary ability of humans for understanding the physical environment. Taking visual sensory observations of an environment as input, modern intelligent systems mostly aim to learn neural representations that encode fundamental scene properties such as geometry and semantics. Such representations can be leveraged to support other downstream tasks and finally realize autonomous perception and interaction within the complex 3D world.</p>
<p>Impressive performance has been exhibited in recent deep neural networks that excel at modeling geometric and semantic information in neural scene representations. However, constructing robust systems is still highly challenging due to the fragility in uncontrolled real-world scenarios. This presents significant complexity for scene representation learning due to the variance of sensory observations with respect to scene changes, the domain gap among different types of visual representations, and the requirement for efficient perception for multiple categories of information. To overcome these challenges, we pursue robust, unified and informative scene representations, learning with geometry and semantics from different types of visual inputs, paving the way towards intelligent machines that autonomously learn to understand the world around them. In this context, this thesis makes three core contributions in the fields of visual localization, pixel-point matching, and semantic surface reconstruction.</p>
<p>In this thesis, we start by estimating the 6 Degree-of-Freedom (DoF) camera pose from single images. To learn scene representations that are robust to environmental changes and sensor operation, a neural network together with a self-attention module is proposed to model complex geometry relationships from which given images are taken relative to a reference environment. Then, we build a more general framework to find unified representations across 2D images and 3D point clouds based on the inherent constraints from epipolar geometry and stereo vision. By introducing an ultra-wide reception mechanism together with novel loss functions, a dual fully-convolutional framework is presented that maps 2D and 3D inputs into a shared latent representation space to simultaneously describe and detect keypoints, bridging the gap between 2D and 3D representations. Finally, we extend our study to develop informative representations, which are generally required for intelligent systems to operate in real-world scenarios for multiple purposes at the same time. By drawing on previous work in point-based networks, we introduce a brand-new end-to-end neural implicit function that can jointly estimate precise 3D surface and semantics from raw and large-scale point clouds.</p>
<p>Overall, this thesis develops a series of novel deep neural frameworks to advance the field of machine learning of scene representations towards artificial intelligence that can fully perceive our real-world 3D environment.</p> |
first_indexed | 2024-03-07T07:27:52Z |
format | Thesis |
id | oxford-uuid:c32cc47c-888a-4f1b-b9f1-b0e968bda1cf |
institution | University of Oxford |
language | English |
last_indexed | 2024-03-07T07:27:52Z |
publishDate | 2022 |
record_format | dspace |
spelling | oxford-uuid:c32cc47c-888a-4f1b-b9f1-b0e968bda1cf2022-11-24T08:42:45ZLearning to represent scenes with geometry and semanticsThesishttp://purl.org/coar/resource_type/c_db06uuid:c32cc47c-888a-4f1b-b9f1-b0e968bda1cfRoboticsArtificial intelligenceComputer visionEnglishHyrax Deposit2022Wang, BTrigoni, AMarkham, A<p>Scene representation is the process of converting sensory observations of an environment into compact descriptions. Such intelligent behavior is a cornerstone of artificial intelligence. Scientists have long sought to reproduce the extraordinary ability of humans for understanding the physical environment. Taking visual sensory observations of an environment as input, modern intelligent systems mostly aim to learn neural representations that encode fundamental scene properties such as geometry and semantics. Such representations can be leveraged to support other downstream tasks and finally realize autonomous perception and interaction within the complex 3D world.</p> <p>Impressive performance has been exhibited in recent deep neural networks that excel at modeling geometric and semantic information in neural scene representations. However, constructing robust systems is still highly challenging due to the fragility in uncontrolled real-world scenarios. This presents significant complexity for scene representation learning due to the variance of sensory observations with respect to scene changes, the domain gap among different types of visual representations, and the requirement for efficient perception for multiple categories of information. To overcome these challenges, we pursue robust, unified and informative scene representations, learning with geometry and semantics from different types of visual inputs, paving the way towards intelligent machines that autonomously learn to understand the world around them. In this context, this thesis makes three core contributions in the fields of visual localization, pixel-point matching, and semantic surface reconstruction.</p> <p>In this thesis, we start by estimating the 6 Degree-of-Freedom (DoF) camera pose from single images. To learn scene representations that are robust to environmental changes and sensor operation, a neural network together with a self-attention module is proposed to model complex geometry relationships from which given images are taken relative to a reference environment. Then, we build a more general framework to find unified representations across 2D images and 3D point clouds based on the inherent constraints from epipolar geometry and stereo vision. By introducing an ultra-wide reception mechanism together with novel loss functions, a dual fully-convolutional framework is presented that maps 2D and 3D inputs into a shared latent representation space to simultaneously describe and detect keypoints, bridging the gap between 2D and 3D representations. Finally, we extend our study to develop informative representations, which are generally required for intelligent systems to operate in real-world scenarios for multiple purposes at the same time. By drawing on previous work in point-based networks, we introduce a brand-new end-to-end neural implicit function that can jointly estimate precise 3D surface and semantics from raw and large-scale point clouds.</p> <p>Overall, this thesis develops a series of novel deep neural frameworks to advance the field of machine learning of scene representations towards artificial intelligence that can fully perceive our real-world 3D environment.</p> |
spellingShingle | Robotics Artificial intelligence Computer vision Wang, B Learning to represent scenes with geometry and semantics |
title | Learning to represent scenes with geometry and semantics |
title_full | Learning to represent scenes with geometry and semantics |
title_fullStr | Learning to represent scenes with geometry and semantics |
title_full_unstemmed | Learning to represent scenes with geometry and semantics |
title_short | Learning to represent scenes with geometry and semantics |
title_sort | learning to represent scenes with geometry and semantics |
topic | Robotics Artificial intelligence Computer vision |
work_keys_str_mv | AT wangb learningtorepresentsceneswithgeometryandsemantics |