Modality attention fusion model with hybrid multi-head self-attention for video understanding
Video question answering (Video-QA) is a subject undergoing intense study in Artificial Intelligence, which is one of the tasks which can evaluate such AI abilities. In this paper, we propose a Modality Attention Fusion framework with Hybrid Multi-head Self-attention (MAF-HMS). MAF-HMS focuses on th...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Public Library of Science (PLoS)
2022-01-01
|
Series: | PLoS ONE |
Online Access: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9536548/?tool=EBI |
_version_ | 1811226780599582720 |
---|---|
author | Xuqiang Zhuang Fang’ai Liu Jian Hou Jianhua Hao Xiaohong Cai |
author_facet | Xuqiang Zhuang Fang’ai Liu Jian Hou Jianhua Hao Xiaohong Cai |
author_sort | Xuqiang Zhuang |
collection | DOAJ |
description | Video question answering (Video-QA) is a subject undergoing intense study in Artificial Intelligence, which is one of the tasks which can evaluate such AI abilities. In this paper, we propose a Modality Attention Fusion framework with Hybrid Multi-head Self-attention (MAF-HMS). MAF-HMS focuses on the task of answering multiple-choice questions regarding a video-subtitle-QA representation by fusion of attention and self-attention between each modality. We use BERT to extract text features, and use Faster R-CNN to ex-tract visual features to provide a useful input representation for our model to answer questions. In addition, we have constructed a Modality Attention Fusion (MAF) framework for the attention fusion matrix from different modalities (video, subtitles, QA), and use a Hybrid Multi-headed Self-attention (HMS) to further determine the correct answer. Experiments on three separate scene datasets show our overall model outperforms the baseline methods by a large margin. Finally, we conducted extensive ablation studies to verify the various components of the network and demonstrate the effectiveness and advantages of our method over existing methods through question type and required modality experimental results. |
first_indexed | 2024-04-12T09:31:02Z |
format | Article |
id | doaj.art-2d7ac731f1a849549b44bcfa03d0d5f0 |
institution | Directory Open Access Journal |
issn | 1932-6203 |
language | English |
last_indexed | 2024-04-12T09:31:02Z |
publishDate | 2022-01-01 |
publisher | Public Library of Science (PLoS) |
record_format | Article |
series | PLoS ONE |
spelling | doaj.art-2d7ac731f1a849549b44bcfa03d0d5f02022-12-22T03:38:21ZengPublic Library of Science (PLoS)PLoS ONE1932-62032022-01-011710Modality attention fusion model with hybrid multi-head self-attention for video understandingXuqiang ZhuangFang’ai LiuJian HouJianhua HaoXiaohong CaiVideo question answering (Video-QA) is a subject undergoing intense study in Artificial Intelligence, which is one of the tasks which can evaluate such AI abilities. In this paper, we propose a Modality Attention Fusion framework with Hybrid Multi-head Self-attention (MAF-HMS). MAF-HMS focuses on the task of answering multiple-choice questions regarding a video-subtitle-QA representation by fusion of attention and self-attention between each modality. We use BERT to extract text features, and use Faster R-CNN to ex-tract visual features to provide a useful input representation for our model to answer questions. In addition, we have constructed a Modality Attention Fusion (MAF) framework for the attention fusion matrix from different modalities (video, subtitles, QA), and use a Hybrid Multi-headed Self-attention (HMS) to further determine the correct answer. Experiments on three separate scene datasets show our overall model outperforms the baseline methods by a large margin. Finally, we conducted extensive ablation studies to verify the various components of the network and demonstrate the effectiveness and advantages of our method over existing methods through question type and required modality experimental results.https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9536548/?tool=EBI |
spellingShingle | Xuqiang Zhuang Fang’ai Liu Jian Hou Jianhua Hao Xiaohong Cai Modality attention fusion model with hybrid multi-head self-attention for video understanding PLoS ONE |
title | Modality attention fusion model with hybrid multi-head self-attention for video understanding |
title_full | Modality attention fusion model with hybrid multi-head self-attention for video understanding |
title_fullStr | Modality attention fusion model with hybrid multi-head self-attention for video understanding |
title_full_unstemmed | Modality attention fusion model with hybrid multi-head self-attention for video understanding |
title_short | Modality attention fusion model with hybrid multi-head self-attention for video understanding |
title_sort | modality attention fusion model with hybrid multi head self attention for video understanding |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9536548/?tool=EBI |
work_keys_str_mv | AT xuqiangzhuang modalityattentionfusionmodelwithhybridmultiheadselfattentionforvideounderstanding AT fangailiu modalityattentionfusionmodelwithhybridmultiheadselfattentionforvideounderstanding AT jianhou modalityattentionfusionmodelwithhybridmultiheadselfattentionforvideounderstanding AT jianhuahao modalityattentionfusionmodelwithhybridmultiheadselfattentionforvideounderstanding AT xiaohongcai modalityattentionfusionmodelwithhybridmultiheadselfattentionforvideounderstanding |