An Attentive Fourier-Augmented Image-Captioning Transformer

Many vision–language models that output natural language, such as image-captioning models, usually use image features merely for grounding the captions and most of the good performance of the model can be attributed to the language model, which does all the heavy lifting, a phenomenon that has persi...

Full description

Bibliographic Details
Main Authors: Raymond Ian Osolo, Zhan Yang, Jun Long
Format: Article
Language:English
Published: MDPI AG 2021-09-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/11/18/8354