Controllable Image Captioning with Feature Refinement and Multilayer Fusion

Image captioning is the task of automatically generating a description of an image. Traditional image captioning models tend to generate a sentence describing the most conspicuous objects, but fail to describe a desired region or object as human. In order to generate sentences based on a given targe...

Full description

Bibliographic Details
Main Authors:	Sen Du, Hong Zhu, Yujia Zhang, Dong Wang, Jing Shi, Nan Xing, Guangfeng Lin, Huiyu Zhou
Format:	Article
Language:	English
Published:	MDPI AG 2023-04-01
Series:	Applied Sciences
Subjects:	controllable image captioning information-augmented embedding MR-WGCN similarity loss
Online Access:	https://www.mdpi.com/2076-3417/13/8/5020

_version_	1797606491914502144
author	Sen Du Hong Zhu Yujia Zhang Dong Wang Jing Shi Nan Xing Guangfeng Lin Huiyu Zhou
author_facet	Sen Du Hong Zhu Yujia Zhang Dong Wang Jing Shi Nan Xing Guangfeng Lin Huiyu Zhou
author_sort	Sen Du
collection	DOAJ
description	Image captioning is the task of automatically generating a description of an image. Traditional image captioning models tend to generate a sentence describing the most conspicuous objects, but fail to describe a desired region or object as human. In order to generate sentences based on a given target, understanding the relationships between particular objects and describing them accurately is central to this task. In detail, information-augmented embedding is used to add prior information to each object, and a new Multi-Relational Weighted Graph Convolutional Network (MR-WGCN) is designed for fusing the information of adjacent objects. Then, a dynamic attention decoder module selectively focuses on particular objects or semantic contents. Finally, the model is optimized by similarity loss. The experiment on MSCOCO Entities demonstrates that IANR obtains, to date, the best published CIDEr performance of 124.52% on the Karpathy test split. Extensive experiments and ablations on both the MSCOCO Entities and the Flickr30k Entities demonstrate the effectiveness of each module. Meanwhile, IANR achieves better accuracy and controllability than the state-of-the-art models under the widely used evaluation metric.
first_indexed	2024-03-11T05:15:57Z
format	Article
id	doaj.art-5f3b030084bd4a23a45a90d5634c2466
institution	Directory Open Access Journal
issn	2076-3417
language	English
last_indexed	2024-03-11T05:15:57Z
publishDate	2023-04-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj.art-5f3b030084bd4a23a45a90d5634c24662023-11-17T18:12:32ZengMDPI AGApplied Sciences2076-34172023-04-01138502010.3390/app13085020Controllable Image Captioning with Feature Refinement and Multilayer FusionSen Du0Hong Zhu1Yujia Zhang2Dong Wang3Jing Shi4Nan Xing5Guangfeng Lin6Huiyu Zhou7School of Automation and Information Engineering, Xi’an University of Technology, Xi’an 710048, ChinaSchool of Automation and Information Engineering, Xi’an University of Technology, Xi’an 710048, ChinaSchool of Automation and Information Engineering, Xi’an University of Technology, Xi’an 710048, ChinaSchool of Automation and Information Engineering, Xi’an University of Technology, Xi’an 710048, ChinaSchool of Automation and Information Engineering, Xi’an University of Technology, Xi’an 710048, ChinaSchool of Automation and Information Engineering, Xi’an University of Technology, Xi’an 710048, ChinaSchool of Printing, Packaging and Digital Media, Xi’an University of Technology, Xi’an 710054, ChinaSchool of Computing and Mathematical Sciences, University of Leicester, University Road, Leicester LE1 7RH, UKImage captioning is the task of automatically generating a description of an image. Traditional image captioning models tend to generate a sentence describing the most conspicuous objects, but fail to describe a desired region or object as human. In order to generate sentences based on a given target, understanding the relationships between particular objects and describing them accurately is central to this task. In detail, information-augmented embedding is used to add prior information to each object, and a new Multi-Relational Weighted Graph Convolutional Network (MR-WGCN) is designed for fusing the information of adjacent objects. Then, a dynamic attention decoder module selectively focuses on particular objects or semantic contents. Finally, the model is optimized by similarity loss. The experiment on MSCOCO Entities demonstrates that IANR obtains, to date, the best published CIDEr performance of 124.52% on the Karpathy test split. Extensive experiments and ablations on both the MSCOCO Entities and the Flickr30k Entities demonstrate the effectiveness of each module. Meanwhile, IANR achieves better accuracy and controllability than the state-of-the-art models under the widely used evaluation metric.https://www.mdpi.com/2076-3417/13/8/5020controllable image captioninginformation-augmented embeddingMR-WGCNsimilarity loss
spellingShingle	Sen Du Hong Zhu Yujia Zhang Dong Wang Jing Shi Nan Xing Guangfeng Lin Huiyu Zhou Controllable Image Captioning with Feature Refinement and Multilayer Fusion Applied Sciences controllable image captioning information-augmented embedding MR-WGCN similarity loss
title	Controllable Image Captioning with Feature Refinement and Multilayer Fusion
title_full	Controllable Image Captioning with Feature Refinement and Multilayer Fusion
title_fullStr	Controllable Image Captioning with Feature Refinement and Multilayer Fusion
title_full_unstemmed	Controllable Image Captioning with Feature Refinement and Multilayer Fusion
title_short	Controllable Image Captioning with Feature Refinement and Multilayer Fusion
title_sort	controllable image captioning with feature refinement and multilayer fusion
topic	controllable image captioning information-augmented embedding MR-WGCN similarity loss
url	https://www.mdpi.com/2076-3417/13/8/5020
work_keys_str_mv	AT sendu controllableimagecaptioningwithfeaturerefinementandmultilayerfusion AT hongzhu controllableimagecaptioningwithfeaturerefinementandmultilayerfusion AT yujiazhang controllableimagecaptioningwithfeaturerefinementandmultilayerfusion AT dongwang controllableimagecaptioningwithfeaturerefinementandmultilayerfusion AT jingshi controllableimagecaptioningwithfeaturerefinementandmultilayerfusion AT nanxing controllableimagecaptioningwithfeaturerefinementandmultilayerfusion AT guangfenglin controllableimagecaptioningwithfeaturerefinementandmultilayerfusion AT huiyuzhou controllableimagecaptioningwithfeaturerefinementandmultilayerfusion

Controllable Image Captioning with Feature Refinement and Multilayer Fusion

Similar Items