A Lightweight Dual-Branch Swin Transformer for Remote Sensing Scene Classification

The main challenge of scene classification is to understand the semantic context information of high-resolution remote sensing images. Although vision transformer (ViT)-based methods have been explored to boost the long-range dependencies of high-resolution remote sensing images, the connectivity be...

Full description

Bibliographic Details
Main Authors: Fujian Zheng, Shuai Lin, Wei Zhou, Hong Huang
Format: Article
Language:English
Published: MDPI AG 2023-05-01
Series:Remote Sensing
Subjects:
Online Access:https://www.mdpi.com/2072-4292/15/11/2865
_version_ 1827739233126711296
author Fujian Zheng
Shuai Lin
Wei Zhou
Hong Huang
author_facet Fujian Zheng
Shuai Lin
Wei Zhou
Hong Huang
author_sort Fujian Zheng
collection DOAJ
description The main challenge of scene classification is to understand the semantic context information of high-resolution remote sensing images. Although vision transformer (ViT)-based methods have been explored to boost the long-range dependencies of high-resolution remote sensing images, the connectivity between neighboring windows is still limited. Meanwhile, ViT-based methods commonly contain a large number of parameters, resulting in a huge computational consumption. In this paper, a novel lightweight dual-branch swin transformer (LDBST) method for remote sensing scene classification is proposed, and the discriminative ability of scene features is increased through combining a ViT branch and convolutional neural network (CNN) branch. First, based on the hierarchical swin transformer model, LDBST divides the input features of each stage into two parts, which are then separately fed into the two branches. For the ViT branch, a dual multilayer perceptron structure with a depthwise convolutional layer, termed Conv-MLP, is integrated into the branch to boost the connections with neighboring windows. Then, a simple-structured CNN branch with maximum pooling preserves the strong features of the scene feature map. Specifically, the CNN branch lightens the LDBST, by avoiding complex multi-head attention and multilayer perceptron computations. To obtain better feature representation, LDBST was pretrained on the large-scale remote scene classification images of the MLRSN and RSD46-WHU datasets. These two pretrained weights were fine-tuned on target scene classification datasets. The experimental results showed that the proposed LDBST method was more effective than some other advanced remote sensing scene classification methods.
first_indexed 2024-03-11T02:57:56Z
format Article
id doaj.art-ac89f48275c847c3a73f1a4d43403384
institution Directory Open Access Journal
issn 2072-4292
language English
last_indexed 2024-03-11T02:57:56Z
publishDate 2023-05-01
publisher MDPI AG
record_format Article
series Remote Sensing
spelling doaj.art-ac89f48275c847c3a73f1a4d434033842023-11-18T08:29:46ZengMDPI AGRemote Sensing2072-42922023-05-011511286510.3390/rs15112865A Lightweight Dual-Branch Swin Transformer for Remote Sensing Scene ClassificationFujian Zheng0Shuai Lin1Wei Zhou2Hong Huang3Key Laboratory of Optoelectronic Technology and Systems of the Education Ministry of China, Chongqing University, Chongqing 400044, ChinaShandong Non-Metallic Materials Institute, Linyi 250031, ChinaSchool of Intelligent Technology and Engineering, Chongqing University of Science and Technology, Chongqing 401331, ChinaKey Laboratory of Optoelectronic Technology and Systems of the Education Ministry of China, Chongqing University, Chongqing 400044, ChinaThe main challenge of scene classification is to understand the semantic context information of high-resolution remote sensing images. Although vision transformer (ViT)-based methods have been explored to boost the long-range dependencies of high-resolution remote sensing images, the connectivity between neighboring windows is still limited. Meanwhile, ViT-based methods commonly contain a large number of parameters, resulting in a huge computational consumption. In this paper, a novel lightweight dual-branch swin transformer (LDBST) method for remote sensing scene classification is proposed, and the discriminative ability of scene features is increased through combining a ViT branch and convolutional neural network (CNN) branch. First, based on the hierarchical swin transformer model, LDBST divides the input features of each stage into two parts, which are then separately fed into the two branches. For the ViT branch, a dual multilayer perceptron structure with a depthwise convolutional layer, termed Conv-MLP, is integrated into the branch to boost the connections with neighboring windows. Then, a simple-structured CNN branch with maximum pooling preserves the strong features of the scene feature map. Specifically, the CNN branch lightens the LDBST, by avoiding complex multi-head attention and multilayer perceptron computations. To obtain better feature representation, LDBST was pretrained on the large-scale remote scene classification images of the MLRSN and RSD46-WHU datasets. These two pretrained weights were fine-tuned on target scene classification datasets. The experimental results showed that the proposed LDBST method was more effective than some other advanced remote sensing scene classification methods.https://www.mdpi.com/2072-4292/15/11/2865remote sensing scene classificationconvolutional neural networks (CNNs)transfer learningvision transformer (ViT)
spellingShingle Fujian Zheng
Shuai Lin
Wei Zhou
Hong Huang
A Lightweight Dual-Branch Swin Transformer for Remote Sensing Scene Classification
Remote Sensing
remote sensing scene classification
convolutional neural networks (CNNs)
transfer learning
vision transformer (ViT)
title A Lightweight Dual-Branch Swin Transformer for Remote Sensing Scene Classification
title_full A Lightweight Dual-Branch Swin Transformer for Remote Sensing Scene Classification
title_fullStr A Lightweight Dual-Branch Swin Transformer for Remote Sensing Scene Classification
title_full_unstemmed A Lightweight Dual-Branch Swin Transformer for Remote Sensing Scene Classification
title_short A Lightweight Dual-Branch Swin Transformer for Remote Sensing Scene Classification
title_sort lightweight dual branch swin transformer for remote sensing scene classification
topic remote sensing scene classification
convolutional neural networks (CNNs)
transfer learning
vision transformer (ViT)
url https://www.mdpi.com/2072-4292/15/11/2865
work_keys_str_mv AT fujianzheng alightweightdualbranchswintransformerforremotesensingsceneclassification
AT shuailin alightweightdualbranchswintransformerforremotesensingsceneclassification
AT weizhou alightweightdualbranchswintransformerforremotesensingsceneclassification
AT honghuang alightweightdualbranchswintransformerforremotesensingsceneclassification
AT fujianzheng lightweightdualbranchswintransformerforremotesensingsceneclassification
AT shuailin lightweightdualbranchswintransformerforremotesensingsceneclassification
AT weizhou lightweightdualbranchswintransformerforremotesensingsceneclassification
AT honghuang lightweightdualbranchswintransformerforremotesensingsceneclassification