CMANet: Cross-Modality Attention Network for Indoor-Scene Semantic Segmentation

Indoor-scene semantic segmentation is of great significance to indoor navigation, high-precision map creation, route planning, etc. However, incorporating RGB and HHA images for indoor-scene semantic segmentation is a promising yet challenging task, due to the diversity of textures and structures an...

Full description

Bibliographic Details
Main Authors:	Longze Zhu, Zhizhong Kang, Mei Zhou, Xi Yang, Zhen Wang, Zhen Cao, Chenming Ye
Format:	Article
Language:	English
Published:	MDPI AG 2022-11-01
Series:	Sensors
Subjects:	semantic segmentation indoor scene HHA data cross-modality aggregation attention mechanism
Online Access:	https://www.mdpi.com/1424-8220/22/21/8520

_version_	1827645465263341568
author	Longze Zhu Zhizhong Kang Mei Zhou Xi Yang Zhen Wang Zhen Cao Chenming Ye
author_facet	Longze Zhu Zhizhong Kang Mei Zhou Xi Yang Zhen Wang Zhen Cao Chenming Ye
author_sort	Longze Zhu
collection	DOAJ
description	Indoor-scene semantic segmentation is of great significance to indoor navigation, high-precision map creation, route planning, etc. However, incorporating RGB and HHA images for indoor-scene semantic segmentation is a promising yet challenging task, due to the diversity of textures and structures and the disparity of multi-modality in physical significance. In this paper, we propose a Cross-Modality Attention Network (CMANet) that facilitates the extraction of both RGB and HHA features and enhances the cross-modality feature integration. CMANet is constructed under the encoder–decoder architecture. The encoder consists of two parallel branches that successively extract the latent modality features from RGB and HHA images, respectively. Particularly, a novel self-attention mechanism-based Cross-Modality Refine Gate (CMRG) is presented, which bridges the two branches. More importantly, the CMRG achieves cross-modality feature fusion and produces certain refined aggregated features; it serves as the most crucial part of CMANet. The decoder is a multi-stage up-sampled backbone that is composed of different residual blocks at each up-sampling stage. Furthermore, bi-directional multi-step propagation and pyramid supervision are applied to assist the leaning process. To evaluate the effectiveness and efficiency of the proposed method, extensive experiments are conducted on NYUDv2 and SUN RGB-D datasets. Experimental results demonstrate that our method outperforms the existing ones for indoor semantic-segmentation tasks.
first_indexed	2024-03-09T18:40:35Z
format	Article
id	doaj.art-e964d8bc2eb54325aa4173362c2e1662
institution	Directory Open Access Journal
issn	1424-8220
language	English
last_indexed	2024-03-09T18:40:35Z
publishDate	2022-11-01
publisher	MDPI AG
record_format	Article
series	Sensors
spelling	doaj.art-e964d8bc2eb54325aa4173362c2e16622023-11-24T06:49:21ZengMDPI AGSensors1424-82202022-11-012221852010.3390/s22218520CMANet: Cross-Modality Attention Network for Indoor-Scene Semantic SegmentationLongze Zhu0Zhizhong Kang1Mei Zhou2Xi Yang3Zhen Wang4Zhen Cao5Chenming Ye6School of Land Science and Technology, China University of Geosciences, Beijing 100083, ChinaSchool of Land Science and Technology, China University of Geosciences, Beijing 100083, ChinaKey Laboratory of Quantitative Remote Sensing Information Technology, Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, ChinaCollege of Computer Science and Technology, Zhejiang University of Technology, Hangzhou 310023, ChinaSchool of Land Science and Technology, China University of Geosciences, Beijing 100083, ChinaSchool of Land Science and Technology, China University of Geosciences, Beijing 100083, ChinaSchool of Land Science and Technology, China University of Geosciences, Beijing 100083, ChinaIndoor-scene semantic segmentation is of great significance to indoor navigation, high-precision map creation, route planning, etc. However, incorporating RGB and HHA images for indoor-scene semantic segmentation is a promising yet challenging task, due to the diversity of textures and structures and the disparity of multi-modality in physical significance. In this paper, we propose a Cross-Modality Attention Network (CMANet) that facilitates the extraction of both RGB and HHA features and enhances the cross-modality feature integration. CMANet is constructed under the encoder–decoder architecture. The encoder consists of two parallel branches that successively extract the latent modality features from RGB and HHA images, respectively. Particularly, a novel self-attention mechanism-based Cross-Modality Refine Gate (CMRG) is presented, which bridges the two branches. More importantly, the CMRG achieves cross-modality feature fusion and produces certain refined aggregated features; it serves as the most crucial part of CMANet. The decoder is a multi-stage up-sampled backbone that is composed of different residual blocks at each up-sampling stage. Furthermore, bi-directional multi-step propagation and pyramid supervision are applied to assist the leaning process. To evaluate the effectiveness and efficiency of the proposed method, extensive experiments are conducted on NYUDv2 and SUN RGB-D datasets. Experimental results demonstrate that our method outperforms the existing ones for indoor semantic-segmentation tasks.https://www.mdpi.com/1424-8220/22/21/8520semantic segmentationindoor sceneHHA datacross-modality aggregationattention mechanism
spellingShingle	Longze Zhu Zhizhong Kang Mei Zhou Xi Yang Zhen Wang Zhen Cao Chenming Ye CMANet: Cross-Modality Attention Network for Indoor-Scene Semantic Segmentation Sensors semantic segmentation indoor scene HHA data cross-modality aggregation attention mechanism
title	CMANet: Cross-Modality Attention Network for Indoor-Scene Semantic Segmentation
title_full	CMANet: Cross-Modality Attention Network for Indoor-Scene Semantic Segmentation
title_fullStr	CMANet: Cross-Modality Attention Network for Indoor-Scene Semantic Segmentation
title_full_unstemmed	CMANet: Cross-Modality Attention Network for Indoor-Scene Semantic Segmentation
title_short	CMANet: Cross-Modality Attention Network for Indoor-Scene Semantic Segmentation
title_sort	cmanet cross modality attention network for indoor scene semantic segmentation
topic	semantic segmentation indoor scene HHA data cross-modality aggregation attention mechanism
url	https://www.mdpi.com/1424-8220/22/21/8520
work_keys_str_mv	AT longzezhu cmanetcrossmodalityattentionnetworkforindoorscenesemanticsegmentation AT zhizhongkang cmanetcrossmodalityattentionnetworkforindoorscenesemanticsegmentation AT meizhou cmanetcrossmodalityattentionnetworkforindoorscenesemanticsegmentation AT xiyang cmanetcrossmodalityattentionnetworkforindoorscenesemanticsegmentation AT zhenwang cmanetcrossmodalityattentionnetworkforindoorscenesemanticsegmentation AT zhencao cmanetcrossmodalityattentionnetworkforindoorscenesemanticsegmentation AT chenmingye cmanetcrossmodalityattentionnetworkforindoorscenesemanticsegmentation

CMANet: Cross-Modality Attention Network for Indoor-Scene Semantic Segmentation

Similar Items