Video Description Model Based on Temporal-Spatial and Channel Multi-Attention Mechanisms

Video description plays an important role in the field of intelligent imaging technology. Attention perception mechanisms are extensively applied in video description models based on deep learning. Most existing models use a temporal-spatial attention mechanism to enhance the accuracy of models. Tem...

Full description

Bibliographic Details
Main Authors: Jie Xu, Haoliang Wei, Linke Li, Qiuru Fu, Jinhong Guo
Format: Article
Language:English
Published: MDPI AG 2020-06-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/10/12/4312
_version_ 1827714343520698368
author Jie Xu
Haoliang Wei
Linke Li
Qiuru Fu
Jinhong Guo
author_facet Jie Xu
Haoliang Wei
Linke Li
Qiuru Fu
Jinhong Guo
author_sort Jie Xu
collection DOAJ
description Video description plays an important role in the field of intelligent imaging technology. Attention perception mechanisms are extensively applied in video description models based on deep learning. Most existing models use a temporal-spatial attention mechanism to enhance the accuracy of models. Temporal attention mechanisms can obtain the global features of a video, whereas spatial attention mechanisms obtain local features. Nevertheless, because each channel of the convolutional neural network (CNN) feature maps has certain spatial semantic information, it is insufficient to merely divide the CNN features into regions and then apply a spatial attention mechanism. In this paper, we propose a temporal-spatial and channel attention mechanism that enables the model to take advantage of various video features and ensures the consistency of visual features between sentence descriptions to enhance the effect of the model. Meanwhile, in order to prove the effectiveness of the attention mechanism, this paper proposes a video visualization model based on the video description. Experimental results show that, our model has achieved good performance on the Microsoft Video Description (MSVD) dataset and a certain improvement on the Microsoft Research-Video to Text (MSR-VTT) dataset.
first_indexed 2024-03-10T18:56:26Z
format Article
id doaj.art-868022c8a6d74973b9d62433a8aa4612
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-10T18:56:26Z
publishDate 2020-06-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-868022c8a6d74973b9d62433a8aa46122023-11-20T04:44:47ZengMDPI AGApplied Sciences2076-34172020-06-011012431210.3390/app10124312Video Description Model Based on Temporal-Spatial and Channel Multi-Attention MechanismsJie Xu0Haoliang Wei1Linke Li2Qiuru Fu3Jinhong Guo4School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, ChinaSchool of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, ChinaSchool of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, ChinaSchool of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, ChinaSchool of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, ChinaVideo description plays an important role in the field of intelligent imaging technology. Attention perception mechanisms are extensively applied in video description models based on deep learning. Most existing models use a temporal-spatial attention mechanism to enhance the accuracy of models. Temporal attention mechanisms can obtain the global features of a video, whereas spatial attention mechanisms obtain local features. Nevertheless, because each channel of the convolutional neural network (CNN) feature maps has certain spatial semantic information, it is insufficient to merely divide the CNN features into regions and then apply a spatial attention mechanism. In this paper, we propose a temporal-spatial and channel attention mechanism that enables the model to take advantage of various video features and ensures the consistency of visual features between sentence descriptions to enhance the effect of the model. Meanwhile, in order to prove the effectiveness of the attention mechanism, this paper proposes a video visualization model based on the video description. Experimental results show that, our model has achieved good performance on the Microsoft Video Description (MSVD) dataset and a certain improvement on the Microsoft Research-Video to Text (MSR-VTT) dataset.https://www.mdpi.com/2076-3417/10/12/4312intelligent imaging technologydeep learningvideo descriptionmulti-attention perception mechanismconsistency of visual featuresvisualization model
spellingShingle Jie Xu
Haoliang Wei
Linke Li
Qiuru Fu
Jinhong Guo
Video Description Model Based on Temporal-Spatial and Channel Multi-Attention Mechanisms
Applied Sciences
intelligent imaging technology
deep learning
video description
multi-attention perception mechanism
consistency of visual features
visualization model
title Video Description Model Based on Temporal-Spatial and Channel Multi-Attention Mechanisms
title_full Video Description Model Based on Temporal-Spatial and Channel Multi-Attention Mechanisms
title_fullStr Video Description Model Based on Temporal-Spatial and Channel Multi-Attention Mechanisms
title_full_unstemmed Video Description Model Based on Temporal-Spatial and Channel Multi-Attention Mechanisms
title_short Video Description Model Based on Temporal-Spatial and Channel Multi-Attention Mechanisms
title_sort video description model based on temporal spatial and channel multi attention mechanisms
topic intelligent imaging technology
deep learning
video description
multi-attention perception mechanism
consistency of visual features
visualization model
url https://www.mdpi.com/2076-3417/10/12/4312
work_keys_str_mv AT jiexu videodescriptionmodelbasedontemporalspatialandchannelmultiattentionmechanisms
AT haoliangwei videodescriptionmodelbasedontemporalspatialandchannelmultiattentionmechanisms
AT linkeli videodescriptionmodelbasedontemporalspatialandchannelmultiattentionmechanisms
AT qiurufu videodescriptionmodelbasedontemporalspatialandchannelmultiattentionmechanisms
AT jinhongguo videodescriptionmodelbasedontemporalspatialandchannelmultiattentionmechanisms