Stack-VS : stacked visual-semantic attention for image caption generation

Recently, automatic image caption generation has been an important focus of the work on multimodal translation task. Existing approaches can be roughly categorized into two classes, top-down and bottom-up, the former transfers the image information (called as visual-level feature) directly into a ca...

Full description

Bibliographic Details
Main Authors: Cheng, Ling, Wei, Wei, Mao, Xianling, Liu, Yong, Miao, Chunyan
Other Authors: School of Computer Science and Engineering
Format: Journal Article
Language:English
Published: 2021
Subjects:
Online Access:https://hdl.handle.net/10356/148460
_version_ 1811678542682914816
author Cheng, Ling
Wei, Wei
Mao, Xianling
Liu, Yong
Miao, Chunyan
author2 School of Computer Science and Engineering
author_facet School of Computer Science and Engineering
Cheng, Ling
Wei, Wei
Mao, Xianling
Liu, Yong
Miao, Chunyan
author_sort Cheng, Ling
collection NTU
description Recently, automatic image caption generation has been an important focus of the work on multimodal translation task. Existing approaches can be roughly categorized into two classes, top-down and bottom-up, the former transfers the image information (called as visual-level feature) directly into a caption, and the later uses the extracted words (called as semantic-level attribute) to generate a description. However, previous methods either are typically based one-stage decoder or partially utilize part of visual-level or semantic-level information for image caption generation. In this paper, we address the problem and propose an innovative multi-stage architecture (called as Stack-VS) for rich fine-grained image caption generation, via combining bottom-up and top-down attention models to effectively handle both visual-level and semantic-level information of an input image. Specifically, we also propose a novel well-designed stack decoder model, which is constituted by a sequence of decoder cells, each of which contains two LSTM-layers work interactively to re-optimize attention weights on both visual-level feature vectors and semantic-level attribute embeddings for generating a fine-grained image caption. Extensive experiments on the popular benchmark dataset MSCOCO show the significant improvements on different evaluation metrics, i.e., the improvements on BLEU-4 / CIDEr / SPICE scores are 0.372, 1.226 and 0.216, respectively, as compared to the state-of-the-art.
first_indexed 2024-10-01T02:54:55Z
format Journal Article
id ntu-10356/148460
institution Nanyang Technological University
language English
last_indexed 2024-10-01T02:54:55Z
publishDate 2021
record_format dspace
spelling ntu-10356/1484602021-04-27T02:54:08Z Stack-VS : stacked visual-semantic attention for image caption generation Cheng, Ling Wei, Wei Mao, Xianling Liu, Yong Miao, Chunyan School of Computer Science and Engineering Joint NTU-UBC Research Centre of Excellence in Active Living for the Elderly (LILY) Engineering::Computer science and engineering Image Captioning Recurrent Neural Network Recently, automatic image caption generation has been an important focus of the work on multimodal translation task. Existing approaches can be roughly categorized into two classes, top-down and bottom-up, the former transfers the image information (called as visual-level feature) directly into a caption, and the later uses the extracted words (called as semantic-level attribute) to generate a description. However, previous methods either are typically based one-stage decoder or partially utilize part of visual-level or semantic-level information for image caption generation. In this paper, we address the problem and propose an innovative multi-stage architecture (called as Stack-VS) for rich fine-grained image caption generation, via combining bottom-up and top-down attention models to effectively handle both visual-level and semantic-level information of an input image. Specifically, we also propose a novel well-designed stack decoder model, which is constituted by a sequence of decoder cells, each of which contains two LSTM-layers work interactively to re-optimize attention weights on both visual-level feature vectors and semantic-level attribute embeddings for generating a fine-grained image caption. Extensive experiments on the popular benchmark dataset MSCOCO show the significant improvements on different evaluation metrics, i.e., the improvements on BLEU-4 / CIDEr / SPICE scores are 0.372, 1.226 and 0.216, respectively, as compared to the state-of-the-art. Published version 2021-04-27T02:54:08Z 2021-04-27T02:54:08Z 2020 Journal Article Cheng, L., Wei, W., Mao, X., Liu, Y. & Miao, C. (2020). Stack-VS : stacked visual-semantic attention for image caption generation. IEEE Access, 8, 154953-154965. https://dx.doi.org/10.1109/ACCESS.2020.3018752 2169-3536 https://hdl.handle.net/10356/148460 10.1109/ACCESS.2020.3018752 8 154953 154965 en IEEE Access © 2020 IEEE. This journal is 100% open access, which means that all content is freely available without charge to users or their institutions. All articles accepted after 12 June 2019 are published under a CC BY 4.0 license, and the author retains copyright. Users are allowed to read, download, copy, distribute, print, search, or link to the full texts of the articles, or use them for any other lawful purpose, as long as proper attribution is given. application/pdf
spellingShingle Engineering::Computer science and engineering
Image Captioning
Recurrent Neural Network
Cheng, Ling
Wei, Wei
Mao, Xianling
Liu, Yong
Miao, Chunyan
Stack-VS : stacked visual-semantic attention for image caption generation
title Stack-VS : stacked visual-semantic attention for image caption generation
title_full Stack-VS : stacked visual-semantic attention for image caption generation
title_fullStr Stack-VS : stacked visual-semantic attention for image caption generation
title_full_unstemmed Stack-VS : stacked visual-semantic attention for image caption generation
title_short Stack-VS : stacked visual-semantic attention for image caption generation
title_sort stack vs stacked visual semantic attention for image caption generation
topic Engineering::Computer science and engineering
Image Captioning
Recurrent Neural Network
url https://hdl.handle.net/10356/148460
work_keys_str_mv AT chengling stackvsstackedvisualsemanticattentionforimagecaptiongeneration
AT weiwei stackvsstackedvisualsemanticattentionforimagecaptiongeneration
AT maoxianling stackvsstackedvisualsemanticattentionforimagecaptiongeneration
AT liuyong stackvsstackedvisualsemanticattentionforimagecaptiongeneration
AT miaochunyan stackvsstackedvisualsemanticattentionforimagecaptiongeneration