RGB-D Scene Recognition via Spatial-Related Multi-Modal Feature Learning

RGB-D image-based scene recognition has achieved significant performance improvement with the development of deep learning methods. While convolutional neural networks can learn high-semantic level features for object recognition, these methods still have limitations for RGB-D scene classification....

Full description

Bibliographic Details
Main Authors:	Zhitong Xiong, Yuan Yuan, Qi Wang
Format:	Article
Language:	English
Published:	IEEE 2019-01-01
Series:	IEEE Access
Subjects:	RGB-D scene recognition global and local features multi-modal feature learning
Online Access:	https://ieeexplore.ieee.org/document/8782114/

_version_	1819181972016070656
author	Zhitong Xiong Yuan Yuan Qi Wang
author_facet	Zhitong Xiong Yuan Yuan Qi Wang
author_sort	Zhitong Xiong
collection	DOAJ
description	RGB-D image-based scene recognition has achieved significant performance improvement with the development of deep learning methods. While convolutional neural networks can learn high-semantic level features for object recognition, these methods still have limitations for RGB-D scene classification. One limitation is that how to learn better multi-modal features for the RGB-D scene recognition is still an open problem. Another limitation is that the scene images are usually not object-centric and with great spatial variability. Thus, vanilla full-image CNN features maybe not optimal for scene recognition. Considering these problems, in this paper, we propose a compact and effective framework for RGB-D scene recognition. Specifically, we make the following contributions: 1) A novel RGB-D scene recognition framework is proposed to explicitly learn the global modal-specific and local modal-consistent features simultaneously. Different from existing approaches, local CNN features are considered for the learning of modal-consistent representations; 2) key Feature Selection (KFS) module is designed, which can adaptively select important local features from the high-semantic level CNN feature maps. It is more efficient and effective than object detection and dense patch-sampling based methods, and; 3) a triplet correlation loss and a spatial-attention similarity loss are proposed for the training of KFS module. Under the supervision of the proposed loss functions, the network can learn import local features of two modalities with no need for extra annotations. Finally, by concatenating the global and local features together, the proposed framework can achieve new state-of-the-art scene recognition performance on the SUN RGB-D dataset and NYU Depth version 2 (NYUD v2) dataset.
first_indexed	2024-12-22T22:38:43Z
format	Article
id	doaj.art-e012691d29fd4667b6d05d3cf26b540a
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-12-22T22:38:43Z
publishDate	2019-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-e012691d29fd4667b6d05d3cf26b540a2022-12-21T18:10:14ZengIEEEIEEE Access2169-35362019-01-01710673910674710.1109/ACCESS.2019.29320808782114RGB-D Scene Recognition via Spatial-Related Multi-Modal Feature LearningZhitong Xiong0Yuan Yuan1Qi Wang2https://orcid.org/0000-0002-7028-4956School of Computer Science and Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University, Xi’an, ChinaSchool of Computer Science and Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University, Xi’an, ChinaSchool of Computer Science and Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University, Xi’an, ChinaRGB-D image-based scene recognition has achieved significant performance improvement with the development of deep learning methods. While convolutional neural networks can learn high-semantic level features for object recognition, these methods still have limitations for RGB-D scene classification. One limitation is that how to learn better multi-modal features for the RGB-D scene recognition is still an open problem. Another limitation is that the scene images are usually not object-centric and with great spatial variability. Thus, vanilla full-image CNN features maybe not optimal for scene recognition. Considering these problems, in this paper, we propose a compact and effective framework for RGB-D scene recognition. Specifically, we make the following contributions: 1) A novel RGB-D scene recognition framework is proposed to explicitly learn the global modal-specific and local modal-consistent features simultaneously. Different from existing approaches, local CNN features are considered for the learning of modal-consistent representations; 2) key Feature Selection (KFS) module is designed, which can adaptively select important local features from the high-semantic level CNN feature maps. It is more efficient and effective than object detection and dense patch-sampling based methods, and; 3) a triplet correlation loss and a spatial-attention similarity loss are proposed for the training of KFS module. Under the supervision of the proposed loss functions, the network can learn import local features of two modalities with no need for extra annotations. Finally, by concatenating the global and local features together, the proposed framework can achieve new state-of-the-art scene recognition performance on the SUN RGB-D dataset and NYU Depth version 2 (NYUD v2) dataset.https://ieeexplore.ieee.org/document/8782114/RGB-Dscene recognitionglobal and local featuresmulti-modal feature learning
spellingShingle	Zhitong Xiong Yuan Yuan Qi Wang RGB-D Scene Recognition via Spatial-Related Multi-Modal Feature Learning IEEE Access RGB-D scene recognition global and local features multi-modal feature learning
title	RGB-D Scene Recognition via Spatial-Related Multi-Modal Feature Learning
title_full	RGB-D Scene Recognition via Spatial-Related Multi-Modal Feature Learning
title_fullStr	RGB-D Scene Recognition via Spatial-Related Multi-Modal Feature Learning
title_full_unstemmed	RGB-D Scene Recognition via Spatial-Related Multi-Modal Feature Learning
title_short	RGB-D Scene Recognition via Spatial-Related Multi-Modal Feature Learning
title_sort	rgb d scene recognition via spatial related multi modal feature learning
topic	RGB-D scene recognition global and local features multi-modal feature learning
url	https://ieeexplore.ieee.org/document/8782114/
work_keys_str_mv	AT zhitongxiong rgbdscenerecognitionviaspatialrelatedmultimodalfeaturelearning AT yuanyuan rgbdscenerecognitionviaspatialrelatedmultimodalfeaturelearning AT qiwang rgbdscenerecognitionviaspatialrelatedmultimodalfeaturelearning

RGB-D Scene Recognition via Spatial-Related Multi-Modal Feature Learning

Similar Items