MC-Net: multi-scale contextual information aggregation network for image captioning on remote sensing images

ABSTRACTRemote Sensing Image Captioning (RSIC) plays a crucial role in advancing semantic understanding and has increasingly become a focal point of research. Nevertheless, existing RSIC methods grapple with challenges due to the intricate multi-scale nature and multifaceted backgrounds inherent in...

Full description

Bibliographic Details
Main Authors: Haiyan Huang, Zhenfeng Shao, Qimin Cheng, Xiao Huang, Xiaoping Wu, Guoming Li, Li Tan
Format: Article
Language:English
Published: Taylor & Francis Group 2023-12-01
Series:International Journal of Digital Earth
Subjects:
Online Access:https://www.tandfonline.com/doi/10.1080/17538947.2023.2283482
_version_ 1797449313935163392
author Haiyan Huang
Zhenfeng Shao
Qimin Cheng
Xiao Huang
Xiaoping Wu
Guoming Li
Li Tan
author_facet Haiyan Huang
Zhenfeng Shao
Qimin Cheng
Xiao Huang
Xiaoping Wu
Guoming Li
Li Tan
author_sort Haiyan Huang
collection DOAJ
description ABSTRACTRemote Sensing Image Captioning (RSIC) plays a crucial role in advancing semantic understanding and has increasingly become a focal point of research. Nevertheless, existing RSIC methods grapple with challenges due to the intricate multi-scale nature and multifaceted backgrounds inherent in Remote Sensing Images (RSIs). Compounding these challenges are the perceptible information disparities across diverse modalities. In response to these challenges, we propose a novel multi-scale contextual information aggregation image captioning network (MC-Net). This network incorporates an image encoder enhanced with a multi-scale feature extraction module, a feature fusion module, and a finely tuned adaptive decoder equipped with a visual-text alignment module. Notably, MC-Net possesses the capability to extract informative multiscale features, facilitated by the multilayer perceptron and transformer. We also introduce an adaptive gating mechanism during the decoding phase to ensure precise alignment between visual regions and their corresponding text descriptions. Empirical studies conducted on four publicly recognized cross-modal datasets unequivocally demonstrate the superior robustness and efficacy of MC-Net in comparison to contemporaneous RSIC methods.
first_indexed 2024-03-09T14:23:08Z
format Article
id doaj.art-d144ee6e2d634972896f943d9224b377
institution Directory Open Access Journal
issn 1753-8947
1753-8955
language English
last_indexed 2024-03-09T14:23:08Z
publishDate 2023-12-01
publisher Taylor & Francis Group
record_format Article
series International Journal of Digital Earth
spelling doaj.art-d144ee6e2d634972896f943d9224b3772023-11-28T09:04:27ZengTaylor & Francis GroupInternational Journal of Digital Earth1753-89471753-89552023-12-011624848486610.1080/17538947.2023.2283482MC-Net: multi-scale contextual information aggregation network for image captioning on remote sensing imagesHaiyan Huang0Zhenfeng Shao1Qimin Cheng2Xiao Huang3Xiaoping Wu4Guoming Li5Li Tan6State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, People's Republic of ChinaState Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, People's Republic of ChinaSchool of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan, People's Republic of ChinaDepartment of Geosciences, University of Arkansas, Fayetteville, USASchool of Geography and Resources Science, Sichuan Normal University, Sichuan, People's Republic of ChinaSchool of Resources and Environment, University of Electronic Science and Technology, Sichuan, People's Republic of ChinaSchool of Geophysics, Chengdu University of Technology, Sichuan, People's Republic of ChinaABSTRACTRemote Sensing Image Captioning (RSIC) plays a crucial role in advancing semantic understanding and has increasingly become a focal point of research. Nevertheless, existing RSIC methods grapple with challenges due to the intricate multi-scale nature and multifaceted backgrounds inherent in Remote Sensing Images (RSIs). Compounding these challenges are the perceptible information disparities across diverse modalities. In response to these challenges, we propose a novel multi-scale contextual information aggregation image captioning network (MC-Net). This network incorporates an image encoder enhanced with a multi-scale feature extraction module, a feature fusion module, and a finely tuned adaptive decoder equipped with a visual-text alignment module. Notably, MC-Net possesses the capability to extract informative multiscale features, facilitated by the multilayer perceptron and transformer. We also introduce an adaptive gating mechanism during the decoding phase to ensure precise alignment between visual regions and their corresponding text descriptions. Empirical studies conducted on four publicly recognized cross-modal datasets unequivocally demonstrate the superior robustness and efficacy of MC-Net in comparison to contemporaneous RSIC methods.https://www.tandfonline.com/doi/10.1080/17538947.2023.2283482Image captioningdeep learningsemantic understandingvisual-text alignment
spellingShingle Haiyan Huang
Zhenfeng Shao
Qimin Cheng
Xiao Huang
Xiaoping Wu
Guoming Li
Li Tan
MC-Net: multi-scale contextual information aggregation network for image captioning on remote sensing images
International Journal of Digital Earth
Image captioning
deep learning
semantic understanding
visual-text alignment
title MC-Net: multi-scale contextual information aggregation network for image captioning on remote sensing images
title_full MC-Net: multi-scale contextual information aggregation network for image captioning on remote sensing images
title_fullStr MC-Net: multi-scale contextual information aggregation network for image captioning on remote sensing images
title_full_unstemmed MC-Net: multi-scale contextual information aggregation network for image captioning on remote sensing images
title_short MC-Net: multi-scale contextual information aggregation network for image captioning on remote sensing images
title_sort mc net multi scale contextual information aggregation network for image captioning on remote sensing images
topic Image captioning
deep learning
semantic understanding
visual-text alignment
url https://www.tandfonline.com/doi/10.1080/17538947.2023.2283482
work_keys_str_mv AT haiyanhuang mcnetmultiscalecontextualinformationaggregationnetworkforimagecaptioningonremotesensingimages
AT zhenfengshao mcnetmultiscalecontextualinformationaggregationnetworkforimagecaptioningonremotesensingimages
AT qimincheng mcnetmultiscalecontextualinformationaggregationnetworkforimagecaptioningonremotesensingimages
AT xiaohuang mcnetmultiscalecontextualinformationaggregationnetworkforimagecaptioningonremotesensingimages
AT xiaopingwu mcnetmultiscalecontextualinformationaggregationnetworkforimagecaptioningonremotesensingimages
AT guomingli mcnetmultiscalecontextualinformationaggregationnetworkforimagecaptioningonremotesensingimages
AT litan mcnetmultiscalecontextualinformationaggregationnetworkforimagecaptioningonremotesensingimages