Swin Transformer Embedding Dual-Stream for Semantic Segmentation of Remote Sensing Imagery

The acquisition of global context and boundary information is crucial for the semantic segmentation of remote sensing (RS) images. In contrast to convolutional neural networks (CNNs), transformers exhibit superior performance in global modeling and shape feature encoding, which provides a novel aven...

Full description

Bibliographic Details
Main Authors: Xuanyu Zhou, Lifan Zhou, Shengrong Gong, Shan Zhong, Wei Yan, Yizhou Huang
Format: Article
Language:English
Published: IEEE 2024-01-01
Series:IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10294282/
_version_ 1797437687169286144
author Xuanyu Zhou
Lifan Zhou
Shengrong Gong
Shan Zhong
Wei Yan
Yizhou Huang
author_facet Xuanyu Zhou
Lifan Zhou
Shengrong Gong
Shan Zhong
Wei Yan
Yizhou Huang
author_sort Xuanyu Zhou
collection DOAJ
description The acquisition of global context and boundary information is crucial for the semantic segmentation of remote sensing (RS) images. In contrast to convolutional neural networks (CNNs), transformers exhibit superior performance in global modeling and shape feature encoding, which provides a novel avenue for obtaining global context and boundary information. However, current methods fail to effectively leverage these distinctive advantages of transformers. To address this issue, we propose a novel single encoder and dual decoders architecture called STDSNet, which embeds the Swin transformer into the dual-stream network for semantic segmentation of RS imagery. The proposed STDSNet employs the Swin transformer as the network backbone in the encoder to address the limitations of CNNs in global modeling and encoding shape features. The dual decoder comprises two parallel streams, namely the global stream (GS) and the shape stream (SS). The GS utilizes the global context fusion module (GCFM) to address the loss of global context during upsampling. It further integrates GCFMs with skip connections and a multiscale fusion strategy to mitigate large-scale regional object classification errors resulting from similar features or shadow occlusion in RS images. The SS introduces the gate convolution module (GCM) to filter out irrelevant features, allowing it to focus on processing boundary information, which improves the semantic segmentation performance of small targets and their boundaries in RS images. Extensive experiments demonstrate that STDSNet outperforms other state-of-the-art methods on the ISPRS Vaihingen and Potsdam benchmarks.
first_indexed 2024-03-09T11:26:12Z
format Article
id doaj.art-0ea6dd2779db466fb3c11f5ec7744423
institution Directory Open Access Journal
issn 2151-1535
language English
last_indexed 2024-03-09T11:26:12Z
publishDate 2024-01-01
publisher IEEE
record_format Article
series IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
spelling doaj.art-0ea6dd2779db466fb3c11f5ec77444232023-12-01T00:00:34ZengIEEEIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing2151-15352024-01-011717518910.1109/JSTARS.2023.332696710294282Swin Transformer Embedding Dual-Stream for Semantic Segmentation of Remote Sensing ImageryXuanyu Zhou0https://orcid.org/0009-0002-0633-694XLifan Zhou1https://orcid.org/0000-0001-7665-413XShengrong Gong2https://orcid.org/0000-0003-0266-2422Shan Zhong3https://orcid.org/0000-0003-0034-6952Wei Yan4https://orcid.org/0000-0002-3412-4015Yizhou Huang5https://orcid.org/0009-0000-1584-233XSchool of Computer and Information Technology, Northeast Petroleum University, Daqing, ChinaSchool of Computer Science and Engineering, Changshu Institute of Technology, Suzhou, ChinaSchool of Computer Science and Engineering, Changshu Institute of Technology, Suzhou, ChinaSchool of Computer Science and Engineering, Changshu Institute of Technology, Suzhou, ChinaSchool of Computer Science and Engineering, Changshu Institute of Technology, Suzhou, ChinaSchool of Electrical and Automation Engineering, Changshu Institute of Technology, Suzhou, ChinaThe acquisition of global context and boundary information is crucial for the semantic segmentation of remote sensing (RS) images. In contrast to convolutional neural networks (CNNs), transformers exhibit superior performance in global modeling and shape feature encoding, which provides a novel avenue for obtaining global context and boundary information. However, current methods fail to effectively leverage these distinctive advantages of transformers. To address this issue, we propose a novel single encoder and dual decoders architecture called STDSNet, which embeds the Swin transformer into the dual-stream network for semantic segmentation of RS imagery. The proposed STDSNet employs the Swin transformer as the network backbone in the encoder to address the limitations of CNNs in global modeling and encoding shape features. The dual decoder comprises two parallel streams, namely the global stream (GS) and the shape stream (SS). The GS utilizes the global context fusion module (GCFM) to address the loss of global context during upsampling. It further integrates GCFMs with skip connections and a multiscale fusion strategy to mitigate large-scale regional object classification errors resulting from similar features or shadow occlusion in RS images. The SS introduces the gate convolution module (GCM) to filter out irrelevant features, allowing it to focus on processing boundary information, which improves the semantic segmentation performance of small targets and their boundaries in RS images. Extensive experiments demonstrate that STDSNet outperforms other state-of-the-art methods on the ISPRS Vaihingen and Potsdam benchmarks.https://ieeexplore.ieee.org/document/10294282/Dual-streamremote sensing (RS)semantic segmentationSwin transformer
spellingShingle Xuanyu Zhou
Lifan Zhou
Shengrong Gong
Shan Zhong
Wei Yan
Yizhou Huang
Swin Transformer Embedding Dual-Stream for Semantic Segmentation of Remote Sensing Imagery
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Dual-stream
remote sensing (RS)
semantic segmentation
Swin transformer
title Swin Transformer Embedding Dual-Stream for Semantic Segmentation of Remote Sensing Imagery
title_full Swin Transformer Embedding Dual-Stream for Semantic Segmentation of Remote Sensing Imagery
title_fullStr Swin Transformer Embedding Dual-Stream for Semantic Segmentation of Remote Sensing Imagery
title_full_unstemmed Swin Transformer Embedding Dual-Stream for Semantic Segmentation of Remote Sensing Imagery
title_short Swin Transformer Embedding Dual-Stream for Semantic Segmentation of Remote Sensing Imagery
title_sort swin transformer embedding dual stream for semantic segmentation of remote sensing imagery
topic Dual-stream
remote sensing (RS)
semantic segmentation
Swin transformer
url https://ieeexplore.ieee.org/document/10294282/
work_keys_str_mv AT xuanyuzhou swintransformerembeddingdualstreamforsemanticsegmentationofremotesensingimagery
AT lifanzhou swintransformerembeddingdualstreamforsemanticsegmentationofremotesensingimagery
AT shengronggong swintransformerembeddingdualstreamforsemanticsegmentationofremotesensingimagery
AT shanzhong swintransformerembeddingdualstreamforsemanticsegmentationofremotesensingimagery
AT weiyan swintransformerembeddingdualstreamforsemanticsegmentationofremotesensingimagery
AT yizhouhuang swintransformerembeddingdualstreamforsemanticsegmentationofremotesensingimagery