MC-Net: multi-scale contextual information aggregation network for image captioning on remote sensing images

ABSTRACTRemote Sensing Image Captioning (RSIC) plays a crucial role in advancing semantic understanding and has increasingly become a focal point of research. Nevertheless, existing RSIC methods grapple with challenges due to the intricate multi-scale nature and multifaceted backgrounds inherent in...

Full description

Bibliographic Details
Main Authors:	Haiyan Huang, Zhenfeng Shao, Qimin Cheng, Xiao Huang, Xiaoping Wu, Guoming Li, Li Tan
Format:	Article
Language:	English
Published:	Taylor & Francis Group 2023-12-01
Series:	International Journal of Digital Earth
Subjects:	Image captioning deep learning semantic understanding visual-text alignment
Online Access:	https://www.tandfonline.com/doi/10.1080/17538947.2023.2283482

_version_	1797449313935163392
author	Haiyan Huang Zhenfeng Shao Qimin Cheng Xiao Huang Xiaoping Wu Guoming Li Li Tan
author_facet	Haiyan Huang Zhenfeng Shao Qimin Cheng Xiao Huang Xiaoping Wu Guoming Li Li Tan
author_sort	Haiyan Huang
collection	DOAJ
description	ABSTRACTRemote Sensing Image Captioning (RSIC) plays a crucial role in advancing semantic understanding and has increasingly become a focal point of research. Nevertheless, existing RSIC methods grapple with challenges due to the intricate multi-scale nature and multifaceted backgrounds inherent in Remote Sensing Images (RSIs). Compounding these challenges are the perceptible information disparities across diverse modalities. In response to these challenges, we propose a novel multi-scale contextual information aggregation image captioning network (MC-Net). This network incorporates an image encoder enhanced with a multi-scale feature extraction module, a feature fusion module, and a finely tuned adaptive decoder equipped with a visual-text alignment module. Notably, MC-Net possesses the capability to extract informative multiscale features, facilitated by the multilayer perceptron and transformer. We also introduce an adaptive gating mechanism during the decoding phase to ensure precise alignment between visual regions and their corresponding text descriptions. Empirical studies conducted on four publicly recognized cross-modal datasets unequivocally demonstrate the superior robustness and efficacy of MC-Net in comparison to contemporaneous RSIC methods.
first_indexed	2024-03-09T14:23:08Z
format	Article
id	doaj.art-d144ee6e2d634972896f943d9224b377
institution	Directory Open Access Journal
issn	1753-8947 1753-8955
language	English
last_indexed	2024-03-09T14:23:08Z
publishDate	2023-12-01
publisher	Taylor & Francis Group
record_format	Article
series	International Journal of Digital Earth
spelling	doaj.art-d144ee6e2d634972896f943d9224b3772023-11-28T09:04:27ZengTaylor & Francis GroupInternational Journal of Digital Earth1753-89471753-89552023-12-011624848486610.1080/17538947.2023.2283482MC-Net: multi-scale contextual information aggregation network for image captioning on remote sensing imagesHaiyan Huang0Zhenfeng Shao1Qimin Cheng2Xiao Huang3Xiaoping Wu4Guoming Li5Li Tan6State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, People's Republic of ChinaState Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, People's Republic of ChinaSchool of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan, People's Republic of ChinaDepartment of Geosciences, University of Arkansas, Fayetteville, USASchool of Geography and Resources Science, Sichuan Normal University, Sichuan, People's Republic of ChinaSchool of Resources and Environment, University of Electronic Science and Technology, Sichuan, People's Republic of ChinaSchool of Geophysics, Chengdu University of Technology, Sichuan, People's Republic of ChinaABSTRACTRemote Sensing Image Captioning (RSIC) plays a crucial role in advancing semantic understanding and has increasingly become a focal point of research. Nevertheless, existing RSIC methods grapple with challenges due to the intricate multi-scale nature and multifaceted backgrounds inherent in Remote Sensing Images (RSIs). Compounding these challenges are the perceptible information disparities across diverse modalities. In response to these challenges, we propose a novel multi-scale contextual information aggregation image captioning network (MC-Net). This network incorporates an image encoder enhanced with a multi-scale feature extraction module, a feature fusion module, and a finely tuned adaptive decoder equipped with a visual-text alignment module. Notably, MC-Net possesses the capability to extract informative multiscale features, facilitated by the multilayer perceptron and transformer. We also introduce an adaptive gating mechanism during the decoding phase to ensure precise alignment between visual regions and their corresponding text descriptions. Empirical studies conducted on four publicly recognized cross-modal datasets unequivocally demonstrate the superior robustness and efficacy of MC-Net in comparison to contemporaneous RSIC methods.https://www.tandfonline.com/doi/10.1080/17538947.2023.2283482Image captioningdeep learningsemantic understandingvisual-text alignment
spellingShingle	Haiyan Huang Zhenfeng Shao Qimin Cheng Xiao Huang Xiaoping Wu Guoming Li Li Tan MC-Net: multi-scale contextual information aggregation network for image captioning on remote sensing images International Journal of Digital Earth Image captioning deep learning semantic understanding visual-text alignment
title	MC-Net: multi-scale contextual information aggregation network for image captioning on remote sensing images
title_full	MC-Net: multi-scale contextual information aggregation network for image captioning on remote sensing images
title_fullStr	MC-Net: multi-scale contextual information aggregation network for image captioning on remote sensing images
title_full_unstemmed	MC-Net: multi-scale contextual information aggregation network for image captioning on remote sensing images
title_short	MC-Net: multi-scale contextual information aggregation network for image captioning on remote sensing images
title_sort	mc net multi scale contextual information aggregation network for image captioning on remote sensing images
topic	Image captioning deep learning semantic understanding visual-text alignment
url	https://www.tandfonline.com/doi/10.1080/17538947.2023.2283482
work_keys_str_mv	AT haiyanhuang mcnetmultiscalecontextualinformationaggregationnetworkforimagecaptioningonremotesensingimages AT zhenfengshao mcnetmultiscalecontextualinformationaggregationnetworkforimagecaptioningonremotesensingimages AT qimincheng mcnetmultiscalecontextualinformationaggregationnetworkforimagecaptioningonremotesensingimages AT xiaohuang mcnetmultiscalecontextualinformationaggregationnetworkforimagecaptioningonremotesensingimages AT xiaopingwu mcnetmultiscalecontextualinformationaggregationnetworkforimagecaptioningonremotesensingimages AT guomingli mcnetmultiscalecontextualinformationaggregationnetworkforimagecaptioningonremotesensingimages AT litan mcnetmultiscalecontextualinformationaggregationnetworkforimagecaptioningonremotesensingimages

MC-Net: multi-scale contextual information aggregation network for image captioning on remote sensing images

Similar Items