Full-Memory Transformer for Image Captioning

The Transformer-based approach represents the state-of-the-art in image captioning. However, existing studies have shown Transformer has a problem that irrelevant tokens with overlapping neighbors incorrectly attend to each other with relatively large attention scores. We believe that this limitatio...

Full description

Bibliographic Details
Main Authors:	Tongwei Lu, Jiarong Wang, Fen Min
Format:	Article
Language:	English
Published:	MDPI AG 2023-01-01
Series:	Symmetry
Subjects:	transformer attention image captioning symmetric
Online Access:	https://www.mdpi.com/2073-8994/15/1/190

_version_	1797436856582799360
author	Tongwei Lu Jiarong Wang Fen Min
author_facet	Tongwei Lu Jiarong Wang Fen Min
author_sort	Tongwei Lu
collection	DOAJ
description	The Transformer-based approach represents the state-of-the-art in image captioning. However, existing studies have shown Transformer has a problem that irrelevant tokens with overlapping neighbors incorrectly attend to each other with relatively large attention scores. We believe that this limitation is due to the incompleteness of the Self-Attention Network (SAN) and Feed-Forward Network (FFN). To solve this problem, we present the Full-Memory Transformer method for image captioning. The method improves the performance of both image encoding and language decoding. In the image encoding step, we propose the Full-LN symmetric structure, which enables stable training and better model generalization performance by symmetrically embedding Layer Normalization on both sides of the SAN and FFN. In the language decoding step, we propose the Memory Attention Network (MAN), which extends the traditional attention mechanism to determine the correlation between attention results and input sequences, guiding the model to focus on the words that need to be attended to. Our method is evaluated on the MS COCO dataset and achieves good performance, improving the result in terms of BLEU-4 from 38.4 to 39.3.
first_indexed	2024-03-09T11:08:38Z
format	Article
id	doaj.art-ee013be9068c439da3587dbec6fc7182
institution	Directory Open Access Journal
issn	2073-8994
language	English
last_indexed	2024-03-09T11:08:38Z
publishDate	2023-01-01
publisher	MDPI AG
record_format	Article
series	Symmetry
spelling	doaj.art-ee013be9068c439da3587dbec6fc71822023-12-01T00:53:14ZengMDPI AGSymmetry2073-89942023-01-0115119010.3390/sym15010190Full-Memory Transformer for Image CaptioningTongwei Lu0Jiarong Wang1Fen Min2School of Computer Science and Engineering, Wuhan Institute of Technology, Wuhan 430205, ChinaSchool of Computer Science and Engineering, Wuhan Institute of Technology, Wuhan 430205, ChinaSchool of Computer Science and Engineering, Wuhan Institute of Technology, Wuhan 430205, ChinaThe Transformer-based approach represents the state-of-the-art in image captioning. However, existing studies have shown Transformer has a problem that irrelevant tokens with overlapping neighbors incorrectly attend to each other with relatively large attention scores. We believe that this limitation is due to the incompleteness of the Self-Attention Network (SAN) and Feed-Forward Network (FFN). To solve this problem, we present the Full-Memory Transformer method for image captioning. The method improves the performance of both image encoding and language decoding. In the image encoding step, we propose the Full-LN symmetric structure, which enables stable training and better model generalization performance by symmetrically embedding Layer Normalization on both sides of the SAN and FFN. In the language decoding step, we propose the Memory Attention Network (MAN), which extends the traditional attention mechanism to determine the correlation between attention results and input sequences, guiding the model to focus on the words that need to be attended to. Our method is evaluated on the MS COCO dataset and achieves good performance, improving the result in terms of BLEU-4 from 38.4 to 39.3.https://www.mdpi.com/2073-8994/15/1/190transformerattentionimage captioningsymmetric
spellingShingle	Tongwei Lu Jiarong Wang Fen Min Full-Memory Transformer for Image Captioning Symmetry transformer attention image captioning symmetric
title	Full-Memory Transformer for Image Captioning
title_full	Full-Memory Transformer for Image Captioning
title_fullStr	Full-Memory Transformer for Image Captioning
title_full_unstemmed	Full-Memory Transformer for Image Captioning
title_short	Full-Memory Transformer for Image Captioning
title_sort	full memory transformer for image captioning
topic	transformer attention image captioning symmetric
url	https://www.mdpi.com/2073-8994/15/1/190
work_keys_str_mv	AT tongweilu fullmemorytransformerforimagecaptioning AT jiarongwang fullmemorytransformerforimagecaptioning AT fenmin fullmemorytransformerforimagecaptioning

Full-Memory Transformer for Image Captioning

Similar Items