A Lightweight Dual-Branch Swin Transformer for Remote Sensing Scene Classification
The main challenge of scene classification is to understand the semantic context information of high-resolution remote sensing images. Although vision transformer (ViT)-based methods have been explored to boost the long-range dependencies of high-resolution remote sensing images, the connectivity be...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2023-05-01
|
Series: | Remote Sensing |
Subjects: | |
Online Access: | https://www.mdpi.com/2072-4292/15/11/2865 |
_version_ | 1827739233126711296 |
---|---|
author | Fujian Zheng Shuai Lin Wei Zhou Hong Huang |
author_facet | Fujian Zheng Shuai Lin Wei Zhou Hong Huang |
author_sort | Fujian Zheng |
collection | DOAJ |
description | The main challenge of scene classification is to understand the semantic context information of high-resolution remote sensing images. Although vision transformer (ViT)-based methods have been explored to boost the long-range dependencies of high-resolution remote sensing images, the connectivity between neighboring windows is still limited. Meanwhile, ViT-based methods commonly contain a large number of parameters, resulting in a huge computational consumption. In this paper, a novel lightweight dual-branch swin transformer (LDBST) method for remote sensing scene classification is proposed, and the discriminative ability of scene features is increased through combining a ViT branch and convolutional neural network (CNN) branch. First, based on the hierarchical swin transformer model, LDBST divides the input features of each stage into two parts, which are then separately fed into the two branches. For the ViT branch, a dual multilayer perceptron structure with a depthwise convolutional layer, termed Conv-MLP, is integrated into the branch to boost the connections with neighboring windows. Then, a simple-structured CNN branch with maximum pooling preserves the strong features of the scene feature map. Specifically, the CNN branch lightens the LDBST, by avoiding complex multi-head attention and multilayer perceptron computations. To obtain better feature representation, LDBST was pretrained on the large-scale remote scene classification images of the MLRSN and RSD46-WHU datasets. These two pretrained weights were fine-tuned on target scene classification datasets. The experimental results showed that the proposed LDBST method was more effective than some other advanced remote sensing scene classification methods. |
first_indexed | 2024-03-11T02:57:56Z |
format | Article |
id | doaj.art-ac89f48275c847c3a73f1a4d43403384 |
institution | Directory Open Access Journal |
issn | 2072-4292 |
language | English |
last_indexed | 2024-03-11T02:57:56Z |
publishDate | 2023-05-01 |
publisher | MDPI AG |
record_format | Article |
series | Remote Sensing |
spelling | doaj.art-ac89f48275c847c3a73f1a4d434033842023-11-18T08:29:46ZengMDPI AGRemote Sensing2072-42922023-05-011511286510.3390/rs15112865A Lightweight Dual-Branch Swin Transformer for Remote Sensing Scene ClassificationFujian Zheng0Shuai Lin1Wei Zhou2Hong Huang3Key Laboratory of Optoelectronic Technology and Systems of the Education Ministry of China, Chongqing University, Chongqing 400044, ChinaShandong Non-Metallic Materials Institute, Linyi 250031, ChinaSchool of Intelligent Technology and Engineering, Chongqing University of Science and Technology, Chongqing 401331, ChinaKey Laboratory of Optoelectronic Technology and Systems of the Education Ministry of China, Chongqing University, Chongqing 400044, ChinaThe main challenge of scene classification is to understand the semantic context information of high-resolution remote sensing images. Although vision transformer (ViT)-based methods have been explored to boost the long-range dependencies of high-resolution remote sensing images, the connectivity between neighboring windows is still limited. Meanwhile, ViT-based methods commonly contain a large number of parameters, resulting in a huge computational consumption. In this paper, a novel lightweight dual-branch swin transformer (LDBST) method for remote sensing scene classification is proposed, and the discriminative ability of scene features is increased through combining a ViT branch and convolutional neural network (CNN) branch. First, based on the hierarchical swin transformer model, LDBST divides the input features of each stage into two parts, which are then separately fed into the two branches. For the ViT branch, a dual multilayer perceptron structure with a depthwise convolutional layer, termed Conv-MLP, is integrated into the branch to boost the connections with neighboring windows. Then, a simple-structured CNN branch with maximum pooling preserves the strong features of the scene feature map. Specifically, the CNN branch lightens the LDBST, by avoiding complex multi-head attention and multilayer perceptron computations. To obtain better feature representation, LDBST was pretrained on the large-scale remote scene classification images of the MLRSN and RSD46-WHU datasets. These two pretrained weights were fine-tuned on target scene classification datasets. The experimental results showed that the proposed LDBST method was more effective than some other advanced remote sensing scene classification methods.https://www.mdpi.com/2072-4292/15/11/2865remote sensing scene classificationconvolutional neural networks (CNNs)transfer learningvision transformer (ViT) |
spellingShingle | Fujian Zheng Shuai Lin Wei Zhou Hong Huang A Lightweight Dual-Branch Swin Transformer for Remote Sensing Scene Classification Remote Sensing remote sensing scene classification convolutional neural networks (CNNs) transfer learning vision transformer (ViT) |
title | A Lightweight Dual-Branch Swin Transformer for Remote Sensing Scene Classification |
title_full | A Lightweight Dual-Branch Swin Transformer for Remote Sensing Scene Classification |
title_fullStr | A Lightweight Dual-Branch Swin Transformer for Remote Sensing Scene Classification |
title_full_unstemmed | A Lightweight Dual-Branch Swin Transformer for Remote Sensing Scene Classification |
title_short | A Lightweight Dual-Branch Swin Transformer for Remote Sensing Scene Classification |
title_sort | lightweight dual branch swin transformer for remote sensing scene classification |
topic | remote sensing scene classification convolutional neural networks (CNNs) transfer learning vision transformer (ViT) |
url | https://www.mdpi.com/2072-4292/15/11/2865 |
work_keys_str_mv | AT fujianzheng alightweightdualbranchswintransformerforremotesensingsceneclassification AT shuailin alightweightdualbranchswintransformerforremotesensingsceneclassification AT weizhou alightweightdualbranchswintransformerforremotesensingsceneclassification AT honghuang alightweightdualbranchswintransformerforremotesensingsceneclassification AT fujianzheng lightweightdualbranchswintransformerforremotesensingsceneclassification AT shuailin lightweightdualbranchswintransformerforremotesensingsceneclassification AT weizhou lightweightdualbranchswintransformerforremotesensingsceneclassification AT honghuang lightweightdualbranchswintransformerforremotesensingsceneclassification |