Summary: | <p>Precise ego-motion estimation is a fundamental component for numerous applications in robotics, autonomous driving, virtual/augmented reality, and mobile computing. Although most navigation systems heavily rely on signals from space-based Global Navigation Satellite Systems (GNSS), such as GPS, to estimate position, radio signals can be lost or degraded in many environments due to obstruction or reflection. In particular, operation in urban areas surrounded by high-rise buildings, underground, and indoor environments remains highly challenging. Moreover, GNSS typically only provides metre-level location accuracy without orientation information, rendering it poorly suited to high-precision applications. Ego-motion estimation with internal sensors such as cameras, inertial measurement units (IMUs), lidars and radars provide an alternative to obtaining accurate relative self-position in these challenging environments. A robust and reliable ego-motion estimation system should address the vulnerabilities of the sensors that might be caused by several factors, including occlusions, illumination changes, sensor noise, and multipath effects. Over the past decades, researchers have developed approaches based on various methodologies, including traditional and learning-based methods, to address this challenging problem, also known as odometry. Although learning-based approaches alleviate the issues such as the need for hand-crafted mathematical features and strict parameter tuning typically required for traditional methods, the acquisition of accurate ground-truth data to supervise the learning-based systems is expensive and limited due to the need for existing infrastructure and deficiencies of the existing sensors. Moreover, supervised methods suffer from poor generalization performance in new challenging environments that are unobserved during the training. Neither the traditional approaches nor the current research solutions meet all the requirements needed for a robust and reliable solution in such demanding conditions. In this thesis, we address specific challenges of ego-motion estimation that pertain to multiple sensor modalities, principally exploiting the geometric constraints of the scene and unlabelled data.</p>
<p>Cameras are widely deployed in ego-motion estimation systems due to their high mobility and rich visual acuity. Likewise, lidar sensors have emerged as an essential component of high-performance ego-motion estimation due to their outstanding angular resolution and highly accurate range measurements. Despite the maturity of optical sensing systems, such as camera and lidar, adverse operating conditions such as poor illumination and precipitation dramatically impact performance. Recent advances in radar technology have enabled ultralight single-chip millimetrewave (mmWave) radar sensors operating at 76–81 GHz spectrum and enabled their deployment in ubiquitous solutions. A key advantage of millimetre-wave (mmWave) radar over visible spectrum sensors is its immunity to adverse conditions, e.g., agnostic to scene illumination and airborne obscurants. However, radar has intrinsically lower spatial resolution than lidar due to the longer signal wavelength and wide beamwidth. In recent years, mmWave imaging radars have emerged, enabling the measured point clouds to be at a comparable resolution and density as a low-grade lidar. Although radars can provide an alternative and complementary solution for odometry tasks, radar measurements are still significantly coarser and noisier than lidar and cameras, requiring new approaches to achieve good performance. On the other hand, the fusion of multiple sensors has the potential of improving the overall ego-motion estimation performance. To utilize the diversity offered by multi-modal sensing, ego-motion estimation algorithms must deal with the spatially, geometrically and temporally unaligned data streams. To address such challenges of ego-motion estimation, we propose several solutions based on different modalities.</p>
<p>As a first major contribution, we propose a generative self-supervised learning framework that predicts 6-DoF pose camera motion from unlabelled RGB image sequences, using deep convolutional Generative Adversarial Networks (GANs) and exploiting predicted monocular depth map of the scene. We eliminate the need for a vast amount of labelled data for a data-centric visual odometry approach, exploiting the geometric consistency between the intermediate predictions and the scene. As part of this contribution, we also propose a novel monocular visual odometry estimation (DPVO) that can operate in challenging environments and recover depth map of the scene, providing persistent results over a long duration. Our contributions in the loss functions and the depth enhancement enable operation over long time periods in perceptually degraded environments.</p>
<p>Secondly, we introduce a novel self-supervised deep learning-based visual-inertial odometry (VIO) (SelfVIO) using adversarial training and self-adaptive visualinertial sensor fusion, exploiting the recovered depth map as part of the geometric consistency signal. SelfVIO increases the robustness of the proposed self-supervised VO approach, addressing numerous challenges such as scale ambiguity, the need for hand-crafted mathematical features (e.g., ORB, BRISK), strict parameter tuning and image blur caused by abrupt camera motion.</p>
<p>Thirdly, we introduce Milli-RIO, an mmWave radar-based odometry solution making use of a single-chip low-cost radar and inertial measurement unit sensor to estimate the 6-DOF ego-motion of a moving radar in indoor environments. MilliRIO fuses the radar measurements with IMU readings to reduce the drift in the predicted trajectory, addressing the deficiencies of the single-chip mmWave radar.</p>
<p>Finally, we propose a geometry-aware ego-motion learning method that is robust to inclement weather conditions. The proposed approach is a deep learning-based self-supervised method that attentively fuses the rich representation capability of visual sensors and weather-immune features provided by radar using an attentionbased geometry-aware learning technique. Our method predicts reliability masks of the multi-modal measurements and incorporates them without any need for labelled data. In various experiments, we show the cross-domain generalizability performance of our approach under harsh weather conditions such as rain, fog, and snow, as well as day and night conditions. Furthermore, we employ a game-theoretic approach to analyse the interpretability of the model predictions, illustrating the independent and uncorrelated failure modes of the multi-modal system. We further show that our method is generalisable to different sensor configurations and diverse datasets.</p>
|