Swin Transformer Embedding Dual-Stream for Semantic Segmentation of Remote Sensing Imagery

The acquisition of global context and boundary information is crucial for the semantic segmentation of remote sensing (RS) images. In contrast to convolutional neural networks (CNNs), transformers exhibit superior performance in global modeling and shape feature encoding, which provides a novel aven...

Full description

Bibliographic Details
Main Authors: Xuanyu Zhou, Lifan Zhou, Shengrong Gong, Shan Zhong, Wei Yan, Yizhou Huang
Format: Article
Language:English
Published: IEEE 2024-01-01
Series:IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10294282/
Description
Summary:The acquisition of global context and boundary information is crucial for the semantic segmentation of remote sensing (RS) images. In contrast to convolutional neural networks (CNNs), transformers exhibit superior performance in global modeling and shape feature encoding, which provides a novel avenue for obtaining global context and boundary information. However, current methods fail to effectively leverage these distinctive advantages of transformers. To address this issue, we propose a novel single encoder and dual decoders architecture called STDSNet, which embeds the Swin transformer into the dual-stream network for semantic segmentation of RS imagery. The proposed STDSNet employs the Swin transformer as the network backbone in the encoder to address the limitations of CNNs in global modeling and encoding shape features. The dual decoder comprises two parallel streams, namely the global stream (GS) and the shape stream (SS). The GS utilizes the global context fusion module (GCFM) to address the loss of global context during upsampling. It further integrates GCFMs with skip connections and a multiscale fusion strategy to mitigate large-scale regional object classification errors resulting from similar features or shadow occlusion in RS images. The SS introduces the gate convolution module (GCM) to filter out irrelevant features, allowing it to focus on processing boundary information, which improves the semantic segmentation performance of small targets and their boundaries in RS images. Extensive experiments demonstrate that STDSNet outperforms other state-of-the-art methods on the ISPRS Vaihingen and Potsdam benchmarks.
ISSN:2151-1535