From Plane to Hierarchy: Deformable Transformer for Remote Sensing Image Captioning
With the growth of remote sensing images, understanding image content automatically has attracted many researchers' interests in deep learning for remote sensing image. Inspired from the natural image captioning, the model with convolutional neural network (CNN)-Recurrent neural network (...
Main Authors: | , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2023-01-01
|
Series: | IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10221707/ |
_version_ | 1797330392782471168 |
---|---|
author | Runyan Du Wei Cao Wenkai Zhang Guo Zhi Xian Sun Shuoke Li Jihao Li |
author_facet | Runyan Du Wei Cao Wenkai Zhang Guo Zhi Xian Sun Shuoke Li Jihao Li |
author_sort | Runyan Du |
collection | DOAJ |
description | With the growth of remote sensing images, understanding image content automatically has attracted many researchers' interests in deep learning for remote sensing image. Inspired from the natural image captioning, the model with convolutional neural network (CNN)-Recurrent neural network (RNN) as the backbone and supplemented by attention has been widely used in remote sensing image captioning. However, it is inefficient for the current attention layer to simultaneously mine hidden foreground from the background of remote sensing image and perform feature interactive learning. Meanwhile, the new mainstream language model has recently surpassed the traditional long short-term memory (LSTM) in sentence generation. For solving the above problems, in this article, we proposed a novel thought to make the flat remote sensing images stereoscopic by separating the foreground and background. Based on hierarchical image information, we designed a novel Deformable Transformer equipped with deformable scaled dot-product attention to learn multiscale feature from foreground and background through the powerful interactive learning ability. Evaluations are conducted on four classic remote sensing image captioning datasets. Compared with the state-of-the-art methods, our Transformer variant achieves higher captioning accuracy. |
first_indexed | 2024-03-08T07:19:12Z |
format | Article |
id | doaj.art-36bdc25d7c4a4495b0ba97d0d0751c5a |
institution | Directory Open Access Journal |
issn | 2151-1535 |
language | English |
last_indexed | 2024-03-08T07:19:12Z |
publishDate | 2023-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing |
spelling | doaj.art-36bdc25d7c4a4495b0ba97d0d0751c5a2024-02-03T00:01:21ZengIEEEIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing2151-15352023-01-01167704771710.1109/JSTARS.2023.330588910221707From Plane to Hierarchy: Deformable Transformer for Remote Sensing Image CaptioningRunyan Du0https://orcid.org/0000-0002-8207-4462Wei Cao1https://orcid.org/0009-0002-4291-4970Wenkai Zhang2https://orcid.org/0000-0002-8903-2708Guo Zhi3https://orcid.org/0000-0001-5083-3578Xian Sun4https://orcid.org/0000-0002-0038-9816Shuoke Li5https://orcid.org/0009-0003-6071-8014Jihao Li6https://orcid.org/0000-0002-8277-4223Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, ChinaAerospace Information Research Institute, Chinese Academy of Sciences, Beijing, ChinaAerospace Information Research Institute, Chinese Academy of Sciences, Beijing, ChinaAerospace Information Research Institute, Chinese Academy of Sciences, Beijing, ChinaAerospace Information Research Institute, Chinese Academy of Sciences, Beijing, ChinaAerospace Information Research Institute, Chinese Academy of Sciences, Beijing, ChinaAerospace Information Research Institute, Chinese Academy of Sciences, Beijing, ChinaWith the growth of remote sensing images, understanding image content automatically has attracted many researchers' interests in deep learning for remote sensing image. Inspired from the natural image captioning, the model with convolutional neural network (CNN)-Recurrent neural network (RNN) as the backbone and supplemented by attention has been widely used in remote sensing image captioning. However, it is inefficient for the current attention layer to simultaneously mine hidden foreground from the background of remote sensing image and perform feature interactive learning. Meanwhile, the new mainstream language model has recently surpassed the traditional long short-term memory (LSTM) in sentence generation. For solving the above problems, in this article, we proposed a novel thought to make the flat remote sensing images stereoscopic by separating the foreground and background. Based on hierarchical image information, we designed a novel Deformable Transformer equipped with deformable scaled dot-product attention to learn multiscale feature from foreground and background through the powerful interactive learning ability. Evaluations are conducted on four classic remote sensing image captioning datasets. Compared with the state-of-the-art methods, our Transformer variant achieves higher captioning accuracy.https://ieeexplore.ieee.org/document/10221707/Attentionremote sensing image captioning (RSIC)transformer |
spellingShingle | Runyan Du Wei Cao Wenkai Zhang Guo Zhi Xian Sun Shuoke Li Jihao Li From Plane to Hierarchy: Deformable Transformer for Remote Sensing Image Captioning IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing Attention remote sensing image captioning (RSIC) transformer |
title | From Plane to Hierarchy: Deformable Transformer for Remote Sensing Image Captioning |
title_full | From Plane to Hierarchy: Deformable Transformer for Remote Sensing Image Captioning |
title_fullStr | From Plane to Hierarchy: Deformable Transformer for Remote Sensing Image Captioning |
title_full_unstemmed | From Plane to Hierarchy: Deformable Transformer for Remote Sensing Image Captioning |
title_short | From Plane to Hierarchy: Deformable Transformer for Remote Sensing Image Captioning |
title_sort | from plane to hierarchy deformable transformer for remote sensing image captioning |
topic | Attention remote sensing image captioning (RSIC) transformer |
url | https://ieeexplore.ieee.org/document/10221707/ |
work_keys_str_mv | AT runyandu fromplanetohierarchydeformabletransformerforremotesensingimagecaptioning AT weicao fromplanetohierarchydeformabletransformerforremotesensingimagecaptioning AT wenkaizhang fromplanetohierarchydeformabletransformerforremotesensingimagecaptioning AT guozhi fromplanetohierarchydeformabletransformerforremotesensingimagecaptioning AT xiansun fromplanetohierarchydeformabletransformerforremotesensingimagecaptioning AT shuokeli fromplanetohierarchydeformabletransformerforremotesensingimagecaptioning AT jihaoli fromplanetohierarchydeformabletransformerforremotesensingimagecaptioning |