Image Caption Generation via Unified Retrieval and Generation-Based Method

Image captioning is a multi-modal transduction task, translating the source image into the target language. Numerous dominant approaches primarily employed the generation-based or the retrieval-based method. These two kinds of frameworks have their advantages and disadvantages. In this work, we make...

Full description

Bibliographic Details
Main Authors: Shanshan Zhao, Lixiang Li, Haipeng Peng, Zihang Yang, Jiaxuan Zhang
Format: Article
Language:English
Published: MDPI AG 2020-09-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/10/18/6235
Description
Summary:Image captioning is a multi-modal transduction task, translating the source image into the target language. Numerous dominant approaches primarily employed the generation-based or the retrieval-based method. These two kinds of frameworks have their advantages and disadvantages. In this work, we make the best of their respective advantages. We adopt the retrieval-based approach to search the visually similar image and their corresponding captions for each queried image in the MSCOCO data set. Based on the retrieved similar sequences and the visual features of the queried image, the proposed de-noising module yielded a set of attended textual features which brought additional textual information for the generation-based model. Finally, the decoder makes use of not only the visual features but also the textual features to generate the output descriptions. Additionally, the incorporated visual encoder and the de-noising module can be applied as a preprocessing component for the decoder-based attention mechanisms. We evaluate the proposed method on the MSCOCO benchmark data set. Extensive experiment yields state-of-the-art performance, and the incorporated module raises the baseline models in terms of almost all the evaluation metrics.
ISSN:2076-3417