MC-Net: multi-scale contextual information aggregation network for image captioning on remote sensing images
ABSTRACTRemote Sensing Image Captioning (RSIC) plays a crucial role in advancing semantic understanding and has increasingly become a focal point of research. Nevertheless, existing RSIC methods grapple with challenges due to the intricate multi-scale nature and multifaceted backgrounds inherent in...
Main Authors: | , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Taylor & Francis Group
2023-12-01
|
Series: | International Journal of Digital Earth |
Subjects: | |
Online Access: | https://www.tandfonline.com/doi/10.1080/17538947.2023.2283482 |
_version_ | 1797449313935163392 |
---|---|
author | Haiyan Huang Zhenfeng Shao Qimin Cheng Xiao Huang Xiaoping Wu Guoming Li Li Tan |
author_facet | Haiyan Huang Zhenfeng Shao Qimin Cheng Xiao Huang Xiaoping Wu Guoming Li Li Tan |
author_sort | Haiyan Huang |
collection | DOAJ |
description | ABSTRACTRemote Sensing Image Captioning (RSIC) plays a crucial role in advancing semantic understanding and has increasingly become a focal point of research. Nevertheless, existing RSIC methods grapple with challenges due to the intricate multi-scale nature and multifaceted backgrounds inherent in Remote Sensing Images (RSIs). Compounding these challenges are the perceptible information disparities across diverse modalities. In response to these challenges, we propose a novel multi-scale contextual information aggregation image captioning network (MC-Net). This network incorporates an image encoder enhanced with a multi-scale feature extraction module, a feature fusion module, and a finely tuned adaptive decoder equipped with a visual-text alignment module. Notably, MC-Net possesses the capability to extract informative multiscale features, facilitated by the multilayer perceptron and transformer. We also introduce an adaptive gating mechanism during the decoding phase to ensure precise alignment between visual regions and their corresponding text descriptions. Empirical studies conducted on four publicly recognized cross-modal datasets unequivocally demonstrate the superior robustness and efficacy of MC-Net in comparison to contemporaneous RSIC methods. |
first_indexed | 2024-03-09T14:23:08Z |
format | Article |
id | doaj.art-d144ee6e2d634972896f943d9224b377 |
institution | Directory Open Access Journal |
issn | 1753-8947 1753-8955 |
language | English |
last_indexed | 2024-03-09T14:23:08Z |
publishDate | 2023-12-01 |
publisher | Taylor & Francis Group |
record_format | Article |
series | International Journal of Digital Earth |
spelling | doaj.art-d144ee6e2d634972896f943d9224b3772023-11-28T09:04:27ZengTaylor & Francis GroupInternational Journal of Digital Earth1753-89471753-89552023-12-011624848486610.1080/17538947.2023.2283482MC-Net: multi-scale contextual information aggregation network for image captioning on remote sensing imagesHaiyan Huang0Zhenfeng Shao1Qimin Cheng2Xiao Huang3Xiaoping Wu4Guoming Li5Li Tan6State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, People's Republic of ChinaState Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, People's Republic of ChinaSchool of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan, People's Republic of ChinaDepartment of Geosciences, University of Arkansas, Fayetteville, USASchool of Geography and Resources Science, Sichuan Normal University, Sichuan, People's Republic of ChinaSchool of Resources and Environment, University of Electronic Science and Technology, Sichuan, People's Republic of ChinaSchool of Geophysics, Chengdu University of Technology, Sichuan, People's Republic of ChinaABSTRACTRemote Sensing Image Captioning (RSIC) plays a crucial role in advancing semantic understanding and has increasingly become a focal point of research. Nevertheless, existing RSIC methods grapple with challenges due to the intricate multi-scale nature and multifaceted backgrounds inherent in Remote Sensing Images (RSIs). Compounding these challenges are the perceptible information disparities across diverse modalities. In response to these challenges, we propose a novel multi-scale contextual information aggregation image captioning network (MC-Net). This network incorporates an image encoder enhanced with a multi-scale feature extraction module, a feature fusion module, and a finely tuned adaptive decoder equipped with a visual-text alignment module. Notably, MC-Net possesses the capability to extract informative multiscale features, facilitated by the multilayer perceptron and transformer. We also introduce an adaptive gating mechanism during the decoding phase to ensure precise alignment between visual regions and their corresponding text descriptions. Empirical studies conducted on four publicly recognized cross-modal datasets unequivocally demonstrate the superior robustness and efficacy of MC-Net in comparison to contemporaneous RSIC methods.https://www.tandfonline.com/doi/10.1080/17538947.2023.2283482Image captioningdeep learningsemantic understandingvisual-text alignment |
spellingShingle | Haiyan Huang Zhenfeng Shao Qimin Cheng Xiao Huang Xiaoping Wu Guoming Li Li Tan MC-Net: multi-scale contextual information aggregation network for image captioning on remote sensing images International Journal of Digital Earth Image captioning deep learning semantic understanding visual-text alignment |
title | MC-Net: multi-scale contextual information aggregation network for image captioning on remote sensing images |
title_full | MC-Net: multi-scale contextual information aggregation network for image captioning on remote sensing images |
title_fullStr | MC-Net: multi-scale contextual information aggregation network for image captioning on remote sensing images |
title_full_unstemmed | MC-Net: multi-scale contextual information aggregation network for image captioning on remote sensing images |
title_short | MC-Net: multi-scale contextual information aggregation network for image captioning on remote sensing images |
title_sort | mc net multi scale contextual information aggregation network for image captioning on remote sensing images |
topic | Image captioning deep learning semantic understanding visual-text alignment |
url | https://www.tandfonline.com/doi/10.1080/17538947.2023.2283482 |
work_keys_str_mv | AT haiyanhuang mcnetmultiscalecontextualinformationaggregationnetworkforimagecaptioningonremotesensingimages AT zhenfengshao mcnetmultiscalecontextualinformationaggregationnetworkforimagecaptioningonremotesensingimages AT qimincheng mcnetmultiscalecontextualinformationaggregationnetworkforimagecaptioningonremotesensingimages AT xiaohuang mcnetmultiscalecontextualinformationaggregationnetworkforimagecaptioningonremotesensingimages AT xiaopingwu mcnetmultiscalecontextualinformationaggregationnetworkforimagecaptioningonremotesensingimages AT guomingli mcnetmultiscalecontextualinformationaggregationnetworkforimagecaptioningonremotesensingimages AT litan mcnetmultiscalecontextualinformationaggregationnetworkforimagecaptioningonremotesensingimages |