Video Description Model Based on Temporal-Spatial and Channel Multi-Attention Mechanisms
Video description plays an important role in the field of intelligent imaging technology. Attention perception mechanisms are extensively applied in video description models based on deep learning. Most existing models use a temporal-spatial attention mechanism to enhance the accuracy of models. Tem...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2020-06-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/10/12/4312 |
_version_ | 1827714343520698368 |
---|---|
author | Jie Xu Haoliang Wei Linke Li Qiuru Fu Jinhong Guo |
author_facet | Jie Xu Haoliang Wei Linke Li Qiuru Fu Jinhong Guo |
author_sort | Jie Xu |
collection | DOAJ |
description | Video description plays an important role in the field of intelligent imaging technology. Attention perception mechanisms are extensively applied in video description models based on deep learning. Most existing models use a temporal-spatial attention mechanism to enhance the accuracy of models. Temporal attention mechanisms can obtain the global features of a video, whereas spatial attention mechanisms obtain local features. Nevertheless, because each channel of the convolutional neural network (CNN) feature maps has certain spatial semantic information, it is insufficient to merely divide the CNN features into regions and then apply a spatial attention mechanism. In this paper, we propose a temporal-spatial and channel attention mechanism that enables the model to take advantage of various video features and ensures the consistency of visual features between sentence descriptions to enhance the effect of the model. Meanwhile, in order to prove the effectiveness of the attention mechanism, this paper proposes a video visualization model based on the video description. Experimental results show that, our model has achieved good performance on the Microsoft Video Description (MSVD) dataset and a certain improvement on the Microsoft Research-Video to Text (MSR-VTT) dataset. |
first_indexed | 2024-03-10T18:56:26Z |
format | Article |
id | doaj.art-868022c8a6d74973b9d62433a8aa4612 |
institution | Directory Open Access Journal |
issn | 2076-3417 |
language | English |
last_indexed | 2024-03-10T18:56:26Z |
publishDate | 2020-06-01 |
publisher | MDPI AG |
record_format | Article |
series | Applied Sciences |
spelling | doaj.art-868022c8a6d74973b9d62433a8aa46122023-11-20T04:44:47ZengMDPI AGApplied Sciences2076-34172020-06-011012431210.3390/app10124312Video Description Model Based on Temporal-Spatial and Channel Multi-Attention MechanismsJie Xu0Haoliang Wei1Linke Li2Qiuru Fu3Jinhong Guo4School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, ChinaSchool of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, ChinaSchool of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, ChinaSchool of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, ChinaSchool of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, ChinaVideo description plays an important role in the field of intelligent imaging technology. Attention perception mechanisms are extensively applied in video description models based on deep learning. Most existing models use a temporal-spatial attention mechanism to enhance the accuracy of models. Temporal attention mechanisms can obtain the global features of a video, whereas spatial attention mechanisms obtain local features. Nevertheless, because each channel of the convolutional neural network (CNN) feature maps has certain spatial semantic information, it is insufficient to merely divide the CNN features into regions and then apply a spatial attention mechanism. In this paper, we propose a temporal-spatial and channel attention mechanism that enables the model to take advantage of various video features and ensures the consistency of visual features between sentence descriptions to enhance the effect of the model. Meanwhile, in order to prove the effectiveness of the attention mechanism, this paper proposes a video visualization model based on the video description. Experimental results show that, our model has achieved good performance on the Microsoft Video Description (MSVD) dataset and a certain improvement on the Microsoft Research-Video to Text (MSR-VTT) dataset.https://www.mdpi.com/2076-3417/10/12/4312intelligent imaging technologydeep learningvideo descriptionmulti-attention perception mechanismconsistency of visual featuresvisualization model |
spellingShingle | Jie Xu Haoliang Wei Linke Li Qiuru Fu Jinhong Guo Video Description Model Based on Temporal-Spatial and Channel Multi-Attention Mechanisms Applied Sciences intelligent imaging technology deep learning video description multi-attention perception mechanism consistency of visual features visualization model |
title | Video Description Model Based on Temporal-Spatial and Channel Multi-Attention Mechanisms |
title_full | Video Description Model Based on Temporal-Spatial and Channel Multi-Attention Mechanisms |
title_fullStr | Video Description Model Based on Temporal-Spatial and Channel Multi-Attention Mechanisms |
title_full_unstemmed | Video Description Model Based on Temporal-Spatial and Channel Multi-Attention Mechanisms |
title_short | Video Description Model Based on Temporal-Spatial and Channel Multi-Attention Mechanisms |
title_sort | video description model based on temporal spatial and channel multi attention mechanisms |
topic | intelligent imaging technology deep learning video description multi-attention perception mechanism consistency of visual features visualization model |
url | https://www.mdpi.com/2076-3417/10/12/4312 |
work_keys_str_mv | AT jiexu videodescriptionmodelbasedontemporalspatialandchannelmultiattentionmechanisms AT haoliangwei videodescriptionmodelbasedontemporalspatialandchannelmultiattentionmechanisms AT linkeli videodescriptionmodelbasedontemporalspatialandchannelmultiattentionmechanisms AT qiurufu videodescriptionmodelbasedontemporalspatialandchannelmultiattentionmechanisms AT jinhongguo videodescriptionmodelbasedontemporalspatialandchannelmultiattentionmechanisms |