Mixed Attention-Based CrossX Network for Satellite Image Classification

The classification of remote sensing scenes is always a challenging task due to the large range of variation in the data, high spatial resolutions, and complex backgrounds. In the analysis and interpretation of satellite images, remote sensing scene classification plays an important role. Most metho...

Full description

Bibliographic Details
Main Authors: Xiaofan Zhang, Yuhui Zheng
Format: Article
Language:English
Published: IEEE 2023-01-01
Series:IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10227553/
Description
Summary:The classification of remote sensing scenes is always a challenging task due to the large range of variation in the data, high spatial resolutions, and complex backgrounds. In the analysis and interpretation of satellite images, remote sensing scene classification plays an important role. Most methods use convolutional neural networks (CNNs) to realize classification; however, common CNNs cannot accurately suppress background information while capturing key local characteristics of satellite images. In this article, we propose a scene classification algorithm for remote sensing images using the hybrid attention improvement network CrossX in remote sensing scenarios. A new hybrid attention module, consisting of a spatial attention (SA) module and a channel attention (CA) module, is introduced to fully extract salient features of the target. Specifically, the SA network aggregates features along two spatial directions to better understand the spatial relationships in the scene. In addition, a CA network using 1-D convolution is proposed to extract image features with a focus on capturing dependencies on channels. Distinctive characteristics of different semantic parts can be noticed from the original features, compensating for the lack of semantic information in the spatial dimension, and more efficient feature representations can be obtained by fusing these features. The proposed method has been proven on the following remote sensing scene datasets: UC Merced, AID, and NWPU-RESISC45. ResNet34, as the backbone network, achieves 99.25%, 96.52%, and 96.9% classification accuracies on the test sets. The experimental results show that our method outperforms current representative scene classifiers on both AID and NWPU, and its performance on UC Merced is comparable to that of state-of-the-art models. The proposed method focuses on improving the ability of the attention mechanism to extract features and obtain an efficient target feature representation, which can be used in computer vision tasks related to the extraction of features and the classification of remote sensing scenes.
ISSN:2151-1535