Overall Understanding of Indoor Scenes by Fusing Multiframe Local RGB-D Data Based on Conditional Random Fields

Indoor mobile robots normally cannot capture the whole information of a scene by a single frame of perceptive data due to the limited sensor scope. The category of the current scene may be misjudged by robotics due to incomplete scene information, which leads to operation error. To address this prob...

Full description

Bibliographic Details
Main Authors: Haotian Chen, Longfei Su, Biao Zhang, Fengchi Sun, Jing Yuan, Jie Liu
Format: Article
Language:English
Published: IEEE 2020-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9055402/
Description
Summary:Indoor mobile robots normally cannot capture the whole information of a scene by a single frame of perceptive data due to the limited sensor scope. The category of the current scene may be misjudged by robotics due to incomplete scene information, which leads to operation error. To address this problem, we propose an approach that leverages conditional random fields (CRFs) to fuse multiframe RGB and depth (RGB-D) visual data corresponding to the same scene. This method takes full advantage of prior knowledge that object categories significantly relate to the scene attributes. As a new image arrives, we incrementally integrate the current object detection results to update scene understanding by identifying duplicate objects between images, ranking available objects in terms of their relevance to the scene, and fusing new information with the existing CRF. With this approach, scene classification can be solved with higher precision based on multiview images than on single image frames sampled in the same places. Additionally, a configuration map of a scene is incrementally built into the above framework. The map includes identities of the recognized objects and various relations between them. This kind of map would not only benefit common robotic tasks but also offer a novel channel for human-robot interaction. We test the efficiency of our method on image sequences extracted from the NYU v2 dataset. The results show that our approach achieves the best performance against state-of-the-art baselines.
ISSN:2169-3536