Learning to represent scenes with geometry and semantics

<p>Scene representation is the process of converting sensory observations of an environment into compact descriptions. Such intelligent behavior is a cornerstone of artificial intelligence. Scientists have long sought to reproduce the extraordinary ability of humans for understanding the physi...

Full description

Bibliographic Details
Main Author:	Wang, B
Other Authors:	Trigoni, A
Format:	Thesis
Language:	English
Published:	2022
Subjects:	Robotics Artificial intelligence Computer vision

_version_	1826308902747111424
author	Wang, B
author2	Trigoni, A
author_facet	Trigoni, A Wang, B
author_sort	Wang, B
collection	OXFORD
description	<p>Scene representation is the process of converting sensory observations of an environment into compact descriptions. Such intelligent behavior is a cornerstone of artificial intelligence. Scientists have long sought to reproduce the extraordinary ability of humans for understanding the physical environment. Taking visual sensory observations of an environment as input, modern intelligent systems mostly aim to learn neural representations that encode fundamental scene properties such as geometry and semantics. Such representations can be leveraged to support other downstream tasks and finally realize autonomous perception and interaction within the complex 3D world.</p> <p>Impressive performance has been exhibited in recent deep neural networks that excel at modeling geometric and semantic information in neural scene representations. However, constructing robust systems is still highly challenging due to the fragility in uncontrolled real-world scenarios. This presents significant complexity for scene representation learning due to the variance of sensory observations with respect to scene changes, the domain gap among different types of visual representations, and the requirement for efficient perception for multiple categories of information. To overcome these challenges, we pursue robust, unified and informative scene representations, learning with geometry and semantics from different types of visual inputs, paving the way towards intelligent machines that autonomously learn to understand the world around them. In this context, this thesis makes three core contributions in the fields of visual localization, pixel-point matching, and semantic surface reconstruction.</p> <p>In this thesis, we start by estimating the 6 Degree-of-Freedom (DoF) camera pose from single images. To learn scene representations that are robust to environmental changes and sensor operation, a neural network together with a self-attention module is proposed to model complex geometry relationships from which given images are taken relative to a reference environment. Then, we build a more general framework to find unified representations across 2D images and 3D point clouds based on the inherent constraints from epipolar geometry and stereo vision. By introducing an ultra-wide reception mechanism together with novel loss functions, a dual fully-convolutional framework is presented that maps 2D and 3D inputs into a shared latent representation space to simultaneously describe and detect keypoints, bridging the gap between 2D and 3D representations. Finally, we extend our study to develop informative representations, which are generally required for intelligent systems to operate in real-world scenarios for multiple purposes at the same time. By drawing on previous work in point-based networks, we introduce a brand-new end-to-end neural implicit function that can jointly estimate precise 3D surface and semantics from raw and large-scale point clouds.</p> <p>Overall, this thesis develops a series of novel deep neural frameworks to advance the field of machine learning of scene representations towards artificial intelligence that can fully perceive our real-world 3D environment.</p>
first_indexed	2024-03-07T07:27:52Z
format	Thesis
id	oxford-uuid:c32cc47c-888a-4f1b-b9f1-b0e968bda1cf
institution	University of Oxford
language	English
last_indexed	2024-03-07T07:27:52Z
publishDate	2022
record_format	dspace
spelling	oxford-uuid:c32cc47c-888a-4f1b-b9f1-b0e968bda1cf2022-11-24T08:42:45ZLearning to represent scenes with geometry and semanticsThesishttp://purl.org/coar/resource_type/c_db06uuid:c32cc47c-888a-4f1b-b9f1-b0e968bda1cfRoboticsArtificial intelligenceComputer visionEnglishHyrax Deposit2022Wang, BTrigoni, AMarkham, A<p>Scene representation is the process of converting sensory observations of an environment into compact descriptions. Such intelligent behavior is a cornerstone of artificial intelligence. Scientists have long sought to reproduce the extraordinary ability of humans for understanding the physical environment. Taking visual sensory observations of an environment as input, modern intelligent systems mostly aim to learn neural representations that encode fundamental scene properties such as geometry and semantics. Such representations can be leveraged to support other downstream tasks and finally realize autonomous perception and interaction within the complex 3D world.</p> <p>Impressive performance has been exhibited in recent deep neural networks that excel at modeling geometric and semantic information in neural scene representations. However, constructing robust systems is still highly challenging due to the fragility in uncontrolled real-world scenarios. This presents significant complexity for scene representation learning due to the variance of sensory observations with respect to scene changes, the domain gap among different types of visual representations, and the requirement for efficient perception for multiple categories of information. To overcome these challenges, we pursue robust, unified and informative scene representations, learning with geometry and semantics from different types of visual inputs, paving the way towards intelligent machines that autonomously learn to understand the world around them. In this context, this thesis makes three core contributions in the fields of visual localization, pixel-point matching, and semantic surface reconstruction.</p> <p>In this thesis, we start by estimating the 6 Degree-of-Freedom (DoF) camera pose from single images. To learn scene representations that are robust to environmental changes and sensor operation, a neural network together with a self-attention module is proposed to model complex geometry relationships from which given images are taken relative to a reference environment. Then, we build a more general framework to find unified representations across 2D images and 3D point clouds based on the inherent constraints from epipolar geometry and stereo vision. By introducing an ultra-wide reception mechanism together with novel loss functions, a dual fully-convolutional framework is presented that maps 2D and 3D inputs into a shared latent representation space to simultaneously describe and detect keypoints, bridging the gap between 2D and 3D representations. Finally, we extend our study to develop informative representations, which are generally required for intelligent systems to operate in real-world scenarios for multiple purposes at the same time. By drawing on previous work in point-based networks, we introduce a brand-new end-to-end neural implicit function that can jointly estimate precise 3D surface and semantics from raw and large-scale point clouds.</p> <p>Overall, this thesis develops a series of novel deep neural frameworks to advance the field of machine learning of scene representations towards artificial intelligence that can fully perceive our real-world 3D environment.</p>
spellingShingle	Robotics Artificial intelligence Computer vision Wang, B Learning to represent scenes with geometry and semantics
title	Learning to represent scenes with geometry and semantics
title_full	Learning to represent scenes with geometry and semantics
title_fullStr	Learning to represent scenes with geometry and semantics
title_full_unstemmed	Learning to represent scenes with geometry and semantics
title_short	Learning to represent scenes with geometry and semantics
title_sort	learning to represent scenes with geometry and semantics
topic	Robotics Artificial intelligence Computer vision
work_keys_str_mv	AT wangb learningtorepresentsceneswithgeometryandsemantics

Learning to represent scenes with geometry and semantics

Similar Items