A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text–Image Retrieval in Remote Sensing

In recent years, there has been a growing interest in remote sensing image–text cross-modal retrieval due to the rapid development of space information technology and the significant increase in the volume of remote sensing image data. Remote sensing images have unique characteristics that make the...

Full description

Bibliographic Details
Main Authors:	Xiong Zhang, Weipeng Li, Xu Wang, Luyao Wang, Fuzhong Zheng, Long Wang, Haisu Zhang
Format:	Article
Language:	English
Published:	MDPI AG 2023-09-01
Series:	Remote Sensing
Subjects:	cross-modal retrieval remote sensing images fusion encoding method joint representation contrastive learning
Online Access:	https://www.mdpi.com/2072-4292/15/18/4637

_version_	1797577196886294528
author	Xiong Zhang Weipeng Li Xu Wang Luyao Wang Fuzhong Zheng Long Wang Haisu Zhang
author_facet	Xiong Zhang Weipeng Li Xu Wang Luyao Wang Fuzhong Zheng Long Wang Haisu Zhang
author_sort	Xiong Zhang
collection	DOAJ
description	In recent years, there has been a growing interest in remote sensing image–text cross-modal retrieval due to the rapid development of space information technology and the significant increase in the volume of remote sensing image data. Remote sensing images have unique characteristics that make the cross-modal retrieval task challenging. Firstly, the semantics of remote sensing images are fine-grained, meaning they can be divided into multiple basic units of semantic expression. Different combinations of basic units of semantic expression can generate diverse text descriptions. Additionally, these images exhibit variations in resolution, color, and perspective. To address these challenges, this paper proposes a multi-task guided fusion encoder (MTGFE) based on the multimodal fusion encoding method, the progressiveness of which has been proved in the cross-modal retrieval of natural images. By jointly training the model with three tasks: image–text matching (ITM), masked language modeling (MLM), and the newly introduced multi-view joint representations contrast (MVJRC), we enhance its capability to capture fine-grained correlations between remote sensing images and texts. Specifically, the MVJRC task is designed to improve the model’s consistency in joint representation expression and fine-grained correlation, particularly for remote sensing images with significant differences in resolution, color, and angle. Furthermore, to address the computational complexity associated with large-scale fusion models and improve retrieval efficiency, this paper proposes a retrieval filtering method, which achieves higher retrieval efficiency while minimizing accuracy loss. Extensive experiments were conducted on four public datasets to evaluate the proposed method, and the results validate its effectiveness.
first_indexed	2024-03-10T22:04:42Z
format	Article
id	doaj.art-b32d599baa5c484994e6fa10d6147734
institution	Directory Open Access Journal
issn	2072-4292
language	English
last_indexed	2024-03-10T22:04:42Z
publishDate	2023-09-01
publisher	MDPI AG
record_format	Article
series	Remote Sensing
spelling	doaj.art-b32d599baa5c484994e6fa10d61477342023-11-19T12:50:33ZengMDPI AGRemote Sensing2072-42922023-09-011518463710.3390/rs15184637A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text–Image Retrieval in Remote SensingXiong Zhang0Weipeng Li1Xu Wang2Luyao Wang3Fuzhong Zheng4Long Wang5Haisu Zhang6School of Information and Communication, National University of Defense Technology, Wuhan 430074, ChinaSchool of Information and Communication, National University of Defense Technology, Wuhan 430074, ChinaSchool of Information and Communication, National University of Defense Technology, Wuhan 430074, ChinaSchool of Information and Communication, National University of Defense Technology, Wuhan 430074, ChinaSchool of Information and Communication, National University of Defense Technology, Wuhan 430074, ChinaSchool of Information and Communication, National University of Defense Technology, Wuhan 430074, ChinaSchool of Information and Communication, National University of Defense Technology, Wuhan 430074, ChinaIn recent years, there has been a growing interest in remote sensing image–text cross-modal retrieval due to the rapid development of space information technology and the significant increase in the volume of remote sensing image data. Remote sensing images have unique characteristics that make the cross-modal retrieval task challenging. Firstly, the semantics of remote sensing images are fine-grained, meaning they can be divided into multiple basic units of semantic expression. Different combinations of basic units of semantic expression can generate diverse text descriptions. Additionally, these images exhibit variations in resolution, color, and perspective. To address these challenges, this paper proposes a multi-task guided fusion encoder (MTGFE) based on the multimodal fusion encoding method, the progressiveness of which has been proved in the cross-modal retrieval of natural images. By jointly training the model with three tasks: image–text matching (ITM), masked language modeling (MLM), and the newly introduced multi-view joint representations contrast (MVJRC), we enhance its capability to capture fine-grained correlations between remote sensing images and texts. Specifically, the MVJRC task is designed to improve the model’s consistency in joint representation expression and fine-grained correlation, particularly for remote sensing images with significant differences in resolution, color, and angle. Furthermore, to address the computational complexity associated with large-scale fusion models and improve retrieval efficiency, this paper proposes a retrieval filtering method, which achieves higher retrieval efficiency while minimizing accuracy loss. Extensive experiments were conducted on four public datasets to evaluate the proposed method, and the results validate its effectiveness.https://www.mdpi.com/2072-4292/15/18/4637cross-modal retrievalremote sensing imagesfusion encoding methodjoint representationcontrastive learning
spellingShingle	Xiong Zhang Weipeng Li Xu Wang Luyao Wang Fuzhong Zheng Long Wang Haisu Zhang A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text–Image Retrieval in Remote Sensing Remote Sensing cross-modal retrieval remote sensing images fusion encoding method joint representation contrastive learning
title	A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text–Image Retrieval in Remote Sensing
title_full	A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text–Image Retrieval in Remote Sensing
title_fullStr	A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text–Image Retrieval in Remote Sensing
title_full_unstemmed	A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text–Image Retrieval in Remote Sensing
title_short	A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text–Image Retrieval in Remote Sensing
title_sort	fusion encoder with multi task guidance for cross modal text image retrieval in remote sensing
topic	cross-modal retrieval remote sensing images fusion encoding method joint representation contrastive learning
url	https://www.mdpi.com/2072-4292/15/18/4637
work_keys_str_mv	AT xiongzhang afusionencoderwithmultitaskguidanceforcrossmodaltextimageretrievalinremotesensing AT weipengli afusionencoderwithmultitaskguidanceforcrossmodaltextimageretrievalinremotesensing AT xuwang afusionencoderwithmultitaskguidanceforcrossmodaltextimageretrievalinremotesensing AT luyaowang afusionencoderwithmultitaskguidanceforcrossmodaltextimageretrievalinremotesensing AT fuzhongzheng afusionencoderwithmultitaskguidanceforcrossmodaltextimageretrievalinremotesensing AT longwang afusionencoderwithmultitaskguidanceforcrossmodaltextimageretrievalinremotesensing AT haisuzhang afusionencoderwithmultitaskguidanceforcrossmodaltextimageretrievalinremotesensing AT xiongzhang fusionencoderwithmultitaskguidanceforcrossmodaltextimageretrievalinremotesensing AT weipengli fusionencoderwithmultitaskguidanceforcrossmodaltextimageretrievalinremotesensing AT xuwang fusionencoderwithmultitaskguidanceforcrossmodaltextimageretrievalinremotesensing AT luyaowang fusionencoderwithmultitaskguidanceforcrossmodaltextimageretrievalinremotesensing AT fuzhongzheng fusionencoderwithmultitaskguidanceforcrossmodaltextimageretrievalinremotesensing AT longwang fusionencoderwithmultitaskguidanceforcrossmodaltextimageretrievalinremotesensing AT haisuzhang fusionencoderwithmultitaskguidanceforcrossmodaltextimageretrievalinremotesensing

A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text–Image Retrieval in Remote Sensing

Similar Items