From Plane to Hierarchy: Deformable Transformer for Remote Sensing Image Captioning

With the growth of remote sensing images, understanding image content automatically has attracted many researchers' interests in deep learning for remote sensing image. Inspired from the natural image captioning, the model with convolutional neural network (CNN)-Recurrent neural network (...

Full description

Bibliographic Details
Main Authors: Runyan Du, Wei Cao, Wenkai Zhang, Guo Zhi, Xian Sun, Shuoke Li, Jihao Li
Format: Article
Language:English
Published: IEEE 2023-01-01
Series:IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10221707/
_version_ 1797330392782471168
author Runyan Du
Wei Cao
Wenkai Zhang
Guo Zhi
Xian Sun
Shuoke Li
Jihao Li
author_facet Runyan Du
Wei Cao
Wenkai Zhang
Guo Zhi
Xian Sun
Shuoke Li
Jihao Li
author_sort Runyan Du
collection DOAJ
description With the growth of remote sensing images, understanding image content automatically has attracted many researchers' interests in deep learning for remote sensing image. Inspired from the natural image captioning, the model with convolutional neural network (CNN)-Recurrent neural network (RNN) as the backbone and supplemented by attention has been widely used in remote sensing image captioning. However, it is inefficient for the current attention layer to simultaneously mine hidden foreground from the background of remote sensing image and perform feature interactive learning. Meanwhile, the new mainstream language model has recently surpassed the traditional long short-term memory (LSTM) in sentence generation. For solving the above problems, in this article, we proposed a novel thought to make the flat remote sensing images stereoscopic by separating the foreground and background. Based on hierarchical image information, we designed a novel Deformable Transformer equipped with deformable scaled dot-product attention to learn multiscale feature from foreground and background through the powerful interactive learning ability. Evaluations are conducted on four classic remote sensing image captioning datasets. Compared with the state-of-the-art methods, our Transformer variant achieves higher captioning accuracy.
first_indexed 2024-03-08T07:19:12Z
format Article
id doaj.art-36bdc25d7c4a4495b0ba97d0d0751c5a
institution Directory Open Access Journal
issn 2151-1535
language English
last_indexed 2024-03-08T07:19:12Z
publishDate 2023-01-01
publisher IEEE
record_format Article
series IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
spelling doaj.art-36bdc25d7c4a4495b0ba97d0d0751c5a2024-02-03T00:01:21ZengIEEEIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing2151-15352023-01-01167704771710.1109/JSTARS.2023.330588910221707From Plane to Hierarchy: Deformable Transformer for Remote Sensing Image CaptioningRunyan Du0https://orcid.org/0000-0002-8207-4462Wei Cao1https://orcid.org/0009-0002-4291-4970Wenkai Zhang2https://orcid.org/0000-0002-8903-2708Guo Zhi3https://orcid.org/0000-0001-5083-3578Xian Sun4https://orcid.org/0000-0002-0038-9816Shuoke Li5https://orcid.org/0009-0003-6071-8014Jihao Li6https://orcid.org/0000-0002-8277-4223Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, ChinaAerospace Information Research Institute, Chinese Academy of Sciences, Beijing, ChinaAerospace Information Research Institute, Chinese Academy of Sciences, Beijing, ChinaAerospace Information Research Institute, Chinese Academy of Sciences, Beijing, ChinaAerospace Information Research Institute, Chinese Academy of Sciences, Beijing, ChinaAerospace Information Research Institute, Chinese Academy of Sciences, Beijing, ChinaAerospace Information Research Institute, Chinese Academy of Sciences, Beijing, ChinaWith the growth of remote sensing images, understanding image content automatically has attracted many researchers' interests in deep learning for remote sensing image. Inspired from the natural image captioning, the model with convolutional neural network (CNN)-Recurrent neural network (RNN) as the backbone and supplemented by attention has been widely used in remote sensing image captioning. However, it is inefficient for the current attention layer to simultaneously mine hidden foreground from the background of remote sensing image and perform feature interactive learning. Meanwhile, the new mainstream language model has recently surpassed the traditional long short-term memory (LSTM) in sentence generation. For solving the above problems, in this article, we proposed a novel thought to make the flat remote sensing images stereoscopic by separating the foreground and background. Based on hierarchical image information, we designed a novel Deformable Transformer equipped with deformable scaled dot-product attention to learn multiscale feature from foreground and background through the powerful interactive learning ability. Evaluations are conducted on four classic remote sensing image captioning datasets. Compared with the state-of-the-art methods, our Transformer variant achieves higher captioning accuracy.https://ieeexplore.ieee.org/document/10221707/Attentionremote sensing image captioning (RSIC)transformer
spellingShingle Runyan Du
Wei Cao
Wenkai Zhang
Guo Zhi
Xian Sun
Shuoke Li
Jihao Li
From Plane to Hierarchy: Deformable Transformer for Remote Sensing Image Captioning
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Attention
remote sensing image captioning (RSIC)
transformer
title From Plane to Hierarchy: Deformable Transformer for Remote Sensing Image Captioning
title_full From Plane to Hierarchy: Deformable Transformer for Remote Sensing Image Captioning
title_fullStr From Plane to Hierarchy: Deformable Transformer for Remote Sensing Image Captioning
title_full_unstemmed From Plane to Hierarchy: Deformable Transformer for Remote Sensing Image Captioning
title_short From Plane to Hierarchy: Deformable Transformer for Remote Sensing Image Captioning
title_sort from plane to hierarchy deformable transformer for remote sensing image captioning
topic Attention
remote sensing image captioning (RSIC)
transformer
url https://ieeexplore.ieee.org/document/10221707/
work_keys_str_mv AT runyandu fromplanetohierarchydeformabletransformerforremotesensingimagecaptioning
AT weicao fromplanetohierarchydeformabletransformerforremotesensingimagecaptioning
AT wenkaizhang fromplanetohierarchydeformabletransformerforremotesensingimagecaptioning
AT guozhi fromplanetohierarchydeformabletransformerforremotesensingimagecaptioning
AT xiansun fromplanetohierarchydeformabletransformerforremotesensingimagecaptioning
AT shuokeli fromplanetohierarchydeformabletransformerforremotesensingimagecaptioning
AT jihaoli fromplanetohierarchydeformabletransformerforremotesensingimagecaptioning