Summary: | <p>Robots need to interact with the world to complete practical tasks. This typically requires the interaction with objects, rendering object perception a fundamental capability in robotics. Real-world robotics applications provide particular challenges and opportunities for object perception. It is possible to utilise data from a variety of sensing modalities, such as cameras and lidar devices, but physical hardware constraints demand careful management of computational resources. Similarly, the need to respond to changing environmental conditions and task definitions requires robots to learn efficiently from limited human supervision. Efficiency in terms of computation and supervision are thus essential. To these ends, this thesis makes several contributions. Firstly, we develop Vote3Deep which is - to the best of our knowledge - the first method that performs efficient, native 3D object detection in real-world lidar point clouds with convolutional neural networks (CNNs). This is achieved by maintaining the natural sparsity inherent in the data throughout the CNN hierarchy and by leveraging an efficient, sparse convolution operation. Secondly, we present GENESIS which is - to the best of our knowledge - the first object-centric generative model that learns to decompose rendered images of simulated 3D environments into object-like components without supervision, while also being able to generate entire coherent scenes in an object-centric fashion. This is facilitated by an autoregressive prior that captures correlations between objects in the generative model, offering the potential of utilising GENESIS as an object-centric ``world model'' for efficient skill acquisition. Given that we want to apply object-centric generative models in real-world applications, we finally develop GENESIS++ which utilises a novel algorithm for fully-differentiable clustering of instance embeddings to facilitate the symmetric inference of object representations. We show that GENESIS++ outperforms competitive baselines both on well-established simulated datasets and on challenging real-world datasets collected in the context of robotics applications.</p>
|