Exploring Self-Supervised Learning for Multi-Modal Remote Sensing Pre-Training via Asymmetric Attention Fusion

Self-supervised learning (SSL) has significantly bridged the gap between supervised and unsupervised learning in computer vision tasks and shown impressive success in the field of remote sensing (RS). However, these methods have primarily focused on single-modal RS data, which may have limitations i...

Full description

Bibliographic Details
Main Authors:	Guozheng Xu, Xue Jiang, Xiangtai Li, Ze Zhang, Xingzhao Liu
Format:	Article
Language:	English
Published:	MDPI AG 2023-12-01
Series:	Remote Sensing
Subjects:	scene segmentation and classification remote sensing data multi-modal asymmetric attention fusion
Online Access:	https://www.mdpi.com/2072-4292/15/24/5682

_version_	1797379494523174912
author	Guozheng Xu Xue Jiang Xiangtai Li Ze Zhang Xingzhao Liu
author_facet	Guozheng Xu Xue Jiang Xiangtai Li Ze Zhang Xingzhao Liu
author_sort	Guozheng Xu
collection	DOAJ
description	Self-supervised learning (SSL) has significantly bridged the gap between supervised and unsupervised learning in computer vision tasks and shown impressive success in the field of remote sensing (RS). However, these methods have primarily focused on single-modal RS data, which may have limitations in capturing the diversity of information in complex scenes. In this paper, we propose the Asymmetric Attention Fusion (AAF) framework to explore the potential of multi-modal representation learning compared to two simpler fusion methods: early fusion and late fusion. Given that data from active sensors (e.g., digital surface models and light detection and ranging) is often noisier and less informative than optical images, the AAF is designed with an asymmetric attention mechanism within a two-stream encoder, applied at each encoder stage. Additionally, we introduce a Transfer Gate module to select more informative features from the fused representations, enhancing performance in downstream tasks. Our comparative analyses on the ISPRS Potsdam datasets, focusing on scene classification and segmentation tasks, demonstrate significant performance enhancements with AAF compared to baseline methods. The proposed approach achieves an improvement of over 7% in all metrics compared to randomly initialized methods for both tasks. Furthermore, when compared to early fusion and late fusion methods, AAF consistently outperforms in achieving superior improvements. These results underscore the effectiveness of AAF in leveraging the strengths of multi-modal RS data for SSL, opening doors for more sophisticated and nuanced RS analysis.
first_indexed	2024-03-08T20:24:06Z
format	Article
id	doaj.art-682477e5f7aa4969a96ed4ce5932f66a
institution	Directory Open Access Journal
issn	2072-4292
language	English
last_indexed	2024-03-08T20:24:06Z
publishDate	2023-12-01
publisher	MDPI AG
record_format	Article
series	Remote Sensing
spelling	doaj.art-682477e5f7aa4969a96ed4ce5932f66a2023-12-22T14:38:59ZengMDPI AGRemote Sensing2072-42922023-12-011524568210.3390/rs15245682Exploring Self-Supervised Learning for Multi-Modal Remote Sensing Pre-Training via Asymmetric Attention FusionGuozheng Xu0Xue Jiang1Xiangtai Li2Ze Zhang3Xingzhao Liu4The School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, ChinaThe School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, ChinaThe S-Lab, Nanyang Technological University, Singapore 639798, SingaporeThe School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, ChinaThe School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, ChinaSelf-supervised learning (SSL) has significantly bridged the gap between supervised and unsupervised learning in computer vision tasks and shown impressive success in the field of remote sensing (RS). However, these methods have primarily focused on single-modal RS data, which may have limitations in capturing the diversity of information in complex scenes. In this paper, we propose the Asymmetric Attention Fusion (AAF) framework to explore the potential of multi-modal representation learning compared to two simpler fusion methods: early fusion and late fusion. Given that data from active sensors (e.g., digital surface models and light detection and ranging) is often noisier and less informative than optical images, the AAF is designed with an asymmetric attention mechanism within a two-stream encoder, applied at each encoder stage. Additionally, we introduce a Transfer Gate module to select more informative features from the fused representations, enhancing performance in downstream tasks. Our comparative analyses on the ISPRS Potsdam datasets, focusing on scene classification and segmentation tasks, demonstrate significant performance enhancements with AAF compared to baseline methods. The proposed approach achieves an improvement of over 7% in all metrics compared to randomly initialized methods for both tasks. Furthermore, when compared to early fusion and late fusion methods, AAF consistently outperforms in achieving superior improvements. These results underscore the effectiveness of AAF in leveraging the strengths of multi-modal RS data for SSL, opening doors for more sophisticated and nuanced RS analysis.https://www.mdpi.com/2072-4292/15/24/5682scene segmentation and classificationremote sensing datamulti-modalasymmetric attention fusion
spellingShingle	Guozheng Xu Xue Jiang Xiangtai Li Ze Zhang Xingzhao Liu Exploring Self-Supervised Learning for Multi-Modal Remote Sensing Pre-Training via Asymmetric Attention Fusion Remote Sensing scene segmentation and classification remote sensing data multi-modal asymmetric attention fusion
title	Exploring Self-Supervised Learning for Multi-Modal Remote Sensing Pre-Training via Asymmetric Attention Fusion
title_full	Exploring Self-Supervised Learning for Multi-Modal Remote Sensing Pre-Training via Asymmetric Attention Fusion
title_fullStr	Exploring Self-Supervised Learning for Multi-Modal Remote Sensing Pre-Training via Asymmetric Attention Fusion
title_full_unstemmed	Exploring Self-Supervised Learning for Multi-Modal Remote Sensing Pre-Training via Asymmetric Attention Fusion
title_short	Exploring Self-Supervised Learning for Multi-Modal Remote Sensing Pre-Training via Asymmetric Attention Fusion
title_sort	exploring self supervised learning for multi modal remote sensing pre training via asymmetric attention fusion
topic	scene segmentation and classification remote sensing data multi-modal asymmetric attention fusion
url	https://www.mdpi.com/2072-4292/15/24/5682
work_keys_str_mv	AT guozhengxu exploringselfsupervisedlearningformultimodalremotesensingpretrainingviaasymmetricattentionfusion AT xuejiang exploringselfsupervisedlearningformultimodalremotesensingpretrainingviaasymmetricattentionfusion AT xiangtaili exploringselfsupervisedlearningformultimodalremotesensingpretrainingviaasymmetricattentionfusion AT zezhang exploringselfsupervisedlearningformultimodalremotesensingpretrainingviaasymmetricattentionfusion AT xingzhaoliu exploringselfsupervisedlearningformultimodalremotesensingpretrainingviaasymmetricattentionfusion

Exploring Self-Supervised Learning for Multi-Modal Remote Sensing Pre-Training via Asymmetric Attention Fusion

Similar Items