Hierarchical Attention-Based Fusion for Image Caption With Multi-Grained Rewards

Image caption based on reinforcement learning (RL) methods has achieved significant success recently. Most of these methods take CIDEr score as the reward of reinforcement learning algorithm to compute gradients, thus refining the image caption baseline model. However, CIDEr score is not the sole cr...

Full description

Bibliographic Details
Main Authors:	Chunlei Wu, Shaozu Yuan, Haiwen Cao, Yiwei Wei, Leiquan Wang
Format:	Article
Language:	English
Published:	IEEE 2020-01-01
Series:	IEEE Access
Subjects:	Image caption reforcement learning attention mechanism
Online Access:	https://ieeexplore.ieee.org/document/9039552/

_version_	1818933065213280256
author	Chunlei Wu Shaozu Yuan Haiwen Cao Yiwei Wei Leiquan Wang
author_facet	Chunlei Wu Shaozu Yuan Haiwen Cao Yiwei Wei Leiquan Wang
author_sort	Chunlei Wu
collection	DOAJ
description	Image caption based on reinforcement learning (RL) methods has achieved significant success recently. Most of these methods take CIDEr score as the reward of reinforcement learning algorithm to compute gradients, thus refining the image caption baseline model. However, CIDEr score is not the sole criterion to judge the quality of a generated caption. In this paper, a Hierarchical Attention Fusion (HAF) model is presented as a baseline for image caption based on RL, where multi-level feature maps of Resnet are integrated with hierarchical attention. Revaluation network (REN) is exploited for revaluating CIDEr score by assigning different weights for each word according to the importance of each word in a generating caption. The weighted reward can be regarded as word-level reward. Moreover, Scoring Network (SN) is implemented to score the generating sentence with its corresponding ground truth from a batch of captions. This reward can obtain benefits from additional unmatched ground truth, which acts as sentence-level reward. Experimental results on the COCO dataset show that the proposed methods have achieved competitive performance compared with the related image caption methods.
first_indexed	2024-12-20T04:42:27Z
format	Article
id	doaj.art-5ace8122ab4344ce871ceafff5a825c7
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-12-20T04:42:27Z
publishDate	2020-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-5ace8122ab4344ce871ceafff5a825c72022-12-21T19:53:05ZengIEEEIEEE Access2169-35362020-01-018579435795110.1109/ACCESS.2020.29815139039552Hierarchical Attention-Based Fusion for Image Caption With Multi-Grained RewardsChunlei Wu0https://orcid.org/0000-0002-0944-2564Shaozu Yuan1https://orcid.org/0000-0001-5084-7064Haiwen Cao2https://orcid.org/0000-0002-2863-5687Yiwei Wei3https://orcid.org/0000-0002-7627-5487Leiquan Wang4https://orcid.org/0000-0003-4314-0030College of Computer Science and Technology, China University of Petroleum, Qingdao, ChinaCollege of Computer Science and Technology, China University of Petroleum, Qingdao, ChinaCollege of Computer Science and Technology, China University of Petroleum, Qingdao, ChinaSchool of Petroleum Engineering, China University of Petroleum-Beijing at Karamay, Karamay, ChinaCollege of Computer Science and Technology, China University of Petroleum, Qingdao, ChinaImage caption based on reinforcement learning (RL) methods has achieved significant success recently. Most of these methods take CIDEr score as the reward of reinforcement learning algorithm to compute gradients, thus refining the image caption baseline model. However, CIDEr score is not the sole criterion to judge the quality of a generated caption. In this paper, a Hierarchical Attention Fusion (HAF) model is presented as a baseline for image caption based on RL, where multi-level feature maps of Resnet are integrated with hierarchical attention. Revaluation network (REN) is exploited for revaluating CIDEr score by assigning different weights for each word according to the importance of each word in a generating caption. The weighted reward can be regarded as word-level reward. Moreover, Scoring Network (SN) is implemented to score the generating sentence with its corresponding ground truth from a batch of captions. This reward can obtain benefits from additional unmatched ground truth, which acts as sentence-level reward. Experimental results on the COCO dataset show that the proposed methods have achieved competitive performance compared with the related image caption methods.https://ieeexplore.ieee.org/document/9039552/Image captionreforcement learningattention mechanism
spellingShingle	Chunlei Wu Shaozu Yuan Haiwen Cao Yiwei Wei Leiquan Wang Hierarchical Attention-Based Fusion for Image Caption With Multi-Grained Rewards IEEE Access Image caption reforcement learning attention mechanism
title	Hierarchical Attention-Based Fusion for Image Caption With Multi-Grained Rewards
title_full	Hierarchical Attention-Based Fusion for Image Caption With Multi-Grained Rewards
title_fullStr	Hierarchical Attention-Based Fusion for Image Caption With Multi-Grained Rewards
title_full_unstemmed	Hierarchical Attention-Based Fusion for Image Caption With Multi-Grained Rewards
title_short	Hierarchical Attention-Based Fusion for Image Caption With Multi-Grained Rewards
title_sort	hierarchical attention based fusion for image caption with multi grained rewards
topic	Image caption reforcement learning attention mechanism
url	https://ieeexplore.ieee.org/document/9039552/
work_keys_str_mv	AT chunleiwu hierarchicalattentionbasedfusionforimagecaptionwithmultigrainedrewards AT shaozuyuan hierarchicalattentionbasedfusionforimagecaptionwithmultigrainedrewards AT haiwencao hierarchicalattentionbasedfusionforimagecaptionwithmultigrainedrewards AT yiweiwei hierarchicalattentionbasedfusionforimagecaptionwithmultigrainedrewards AT leiquanwang hierarchicalattentionbasedfusionforimagecaptionwithmultigrainedrewards

Hierarchical Attention-Based Fusion for Image Caption With Multi-Grained Rewards

Similar Items