Video Captioning With Adaptive Attention and Mixed Loss Optimization

The attention mechanism and sequence-to-sequence framework have shown promising advancements in the temporal task of video captioning. However, imposing the attention mechanism on non-visual words, such as “of” and “the”, may mislead the decoder and decrea...

Full description

Bibliographic Details
Main Authors:	Huanhou Xiao, Jinglun Shi
Format:	Article
Language:	English
Published:	IEEE 2019-01-01
Series:	IEEE Access
Subjects:	Video captioning sequence-to-sequence adaptive attention reinforcement learning long short-term memory
Online Access:	https://ieeexplore.ieee.org/document/8840863/

_version_	1829508860131409920
author	Huanhou Xiao Jinglun Shi
author_facet	Huanhou Xiao Jinglun Shi
author_sort	Huanhou Xiao
collection	DOAJ
description	The attention mechanism and sequence-to-sequence framework have shown promising advancements in the temporal task of video captioning. However, imposing the attention mechanism on non-visual words, such as “of” and “the”, may mislead the decoder and decrease the overall performance of video captioning. Furthermore, the traditional sequence to sequence framework optimizes the model by using word-level cross entropy loss, which results in an exposure bias problem. This problem occurs because, at test time, the model uses the previously generated words to predict the next word, while it maximizes the likelihood of the next ground-truth word with consideration of the true previous one during training. To address these issues, we propose the reinforced adaptive attention model (RAAM), which integrates an adaptive attention mechanism with long short-term memory to flexibly utilize visual signals and language information as needed. Accordingly, the model is trained with both word-level loss and sentence-level loss to take advantage of these two losses and alleviate the exposure bias problem by directly optimizing the sentence-level metric using a reinforcement learning algorithm. Besides, a novel training method is proposed for mixed loss optimization. Experiments on the Microsoft Video Description benchmark corpus (MSVD) and the challenging MPII-MD Movie Description dataset demonstrate that the proposed RAAM method, which uses only a single feature, achieves competitive or even superior results compared to existing state-of-the-art models for video captioning.
first_indexed	2024-12-16T11:34:41Z
format	Article
id	doaj.art-59051927f05d427fb7d583ca8039498a
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-12-16T11:34:41Z
publishDate	2019-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-59051927f05d427fb7d583ca8039498a2022-12-21T22:33:09ZengIEEEIEEE Access2169-35362019-01-01713575713576910.1109/ACCESS.2019.29420008840863Video Captioning With Adaptive Attention and Mixed Loss OptimizationHuanhou Xiao0https://orcid.org/0000-0002-5447-539XJinglun Shi1School of Electronic and Information Engineering, South China University of Technology, Guangzhou, ChinaSchool of Electronic and Information Engineering, South China University of Technology, Guangzhou, ChinaThe attention mechanism and sequence-to-sequence framework have shown promising advancements in the temporal task of video captioning. However, imposing the attention mechanism on non-visual words, such as “of” and “the”, may mislead the decoder and decrease the overall performance of video captioning. Furthermore, the traditional sequence to sequence framework optimizes the model by using word-level cross entropy loss, which results in an exposure bias problem. This problem occurs because, at test time, the model uses the previously generated words to predict the next word, while it maximizes the likelihood of the next ground-truth word with consideration of the true previous one during training. To address these issues, we propose the reinforced adaptive attention model (RAAM), which integrates an adaptive attention mechanism with long short-term memory to flexibly utilize visual signals and language information as needed. Accordingly, the model is trained with both word-level loss and sentence-level loss to take advantage of these two losses and alleviate the exposure bias problem by directly optimizing the sentence-level metric using a reinforcement learning algorithm. Besides, a novel training method is proposed for mixed loss optimization. Experiments on the Microsoft Video Description benchmark corpus (MSVD) and the challenging MPII-MD Movie Description dataset demonstrate that the proposed RAAM method, which uses only a single feature, achieves competitive or even superior results compared to existing state-of-the-art models for video captioning.https://ieeexplore.ieee.org/document/8840863/Video captioningsequence-to-sequenceadaptive attentionreinforcement learninglong short-term memory
spellingShingle	Huanhou Xiao Jinglun Shi Video Captioning With Adaptive Attention and Mixed Loss Optimization IEEE Access Video captioning sequence-to-sequence adaptive attention reinforcement learning long short-term memory
title	Video Captioning With Adaptive Attention and Mixed Loss Optimization
title_full	Video Captioning With Adaptive Attention and Mixed Loss Optimization
title_fullStr	Video Captioning With Adaptive Attention and Mixed Loss Optimization
title_full_unstemmed	Video Captioning With Adaptive Attention and Mixed Loss Optimization
title_short	Video Captioning With Adaptive Attention and Mixed Loss Optimization
title_sort	video captioning with adaptive attention and mixed loss optimization
topic	Video captioning sequence-to-sequence adaptive attention reinforcement learning long short-term memory
url	https://ieeexplore.ieee.org/document/8840863/
work_keys_str_mv	AT huanhouxiao videocaptioningwithadaptiveattentionandmixedlossoptimization AT jinglunshi videocaptioningwithadaptiveattentionandmixedlossoptimization

Video Captioning With Adaptive Attention and Mixed Loss Optimization

Similar Items