Modality attention fusion model with hybrid multi-head self-attention for video understanding

Video question answering (Video-QA) is a subject undergoing intense study in Artificial Intelligence, which is one of the tasks which can evaluate such AI abilities. In this paper, we propose a Modality Attention Fusion framework with Hybrid Multi-head Self-attention (MAF-HMS). MAF-HMS focuses on th...

Full description

Bibliographic Details
Main Authors: Xuqiang Zhuang, Fang’ai Liu, Jian Hou, Jianhua Hao, Xiaohong Cai
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2022-01-01
Series:PLoS ONE
Online Access:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9536548/?tool=EBI
_version_ 1811226780599582720
author Xuqiang Zhuang
Fang’ai Liu
Jian Hou
Jianhua Hao
Xiaohong Cai
author_facet Xuqiang Zhuang
Fang’ai Liu
Jian Hou
Jianhua Hao
Xiaohong Cai
author_sort Xuqiang Zhuang
collection DOAJ
description Video question answering (Video-QA) is a subject undergoing intense study in Artificial Intelligence, which is one of the tasks which can evaluate such AI abilities. In this paper, we propose a Modality Attention Fusion framework with Hybrid Multi-head Self-attention (MAF-HMS). MAF-HMS focuses on the task of answering multiple-choice questions regarding a video-subtitle-QA representation by fusion of attention and self-attention between each modality. We use BERT to extract text features, and use Faster R-CNN to ex-tract visual features to provide a useful input representation for our model to answer questions. In addition, we have constructed a Modality Attention Fusion (MAF) framework for the attention fusion matrix from different modalities (video, subtitles, QA), and use a Hybrid Multi-headed Self-attention (HMS) to further determine the correct answer. Experiments on three separate scene datasets show our overall model outperforms the baseline methods by a large margin. Finally, we conducted extensive ablation studies to verify the various components of the network and demonstrate the effectiveness and advantages of our method over existing methods through question type and required modality experimental results.
first_indexed 2024-04-12T09:31:02Z
format Article
id doaj.art-2d7ac731f1a849549b44bcfa03d0d5f0
institution Directory Open Access Journal
issn 1932-6203
language English
last_indexed 2024-04-12T09:31:02Z
publishDate 2022-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj.art-2d7ac731f1a849549b44bcfa03d0d5f02022-12-22T03:38:21ZengPublic Library of Science (PLoS)PLoS ONE1932-62032022-01-011710Modality attention fusion model with hybrid multi-head self-attention for video understandingXuqiang ZhuangFang’ai LiuJian HouJianhua HaoXiaohong CaiVideo question answering (Video-QA) is a subject undergoing intense study in Artificial Intelligence, which is one of the tasks which can evaluate such AI abilities. In this paper, we propose a Modality Attention Fusion framework with Hybrid Multi-head Self-attention (MAF-HMS). MAF-HMS focuses on the task of answering multiple-choice questions regarding a video-subtitle-QA representation by fusion of attention and self-attention between each modality. We use BERT to extract text features, and use Faster R-CNN to ex-tract visual features to provide a useful input representation for our model to answer questions. In addition, we have constructed a Modality Attention Fusion (MAF) framework for the attention fusion matrix from different modalities (video, subtitles, QA), and use a Hybrid Multi-headed Self-attention (HMS) to further determine the correct answer. Experiments on three separate scene datasets show our overall model outperforms the baseline methods by a large margin. Finally, we conducted extensive ablation studies to verify the various components of the network and demonstrate the effectiveness and advantages of our method over existing methods through question type and required modality experimental results.https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9536548/?tool=EBI
spellingShingle Xuqiang Zhuang
Fang’ai Liu
Jian Hou
Jianhua Hao
Xiaohong Cai
Modality attention fusion model with hybrid multi-head self-attention for video understanding
PLoS ONE
title Modality attention fusion model with hybrid multi-head self-attention for video understanding
title_full Modality attention fusion model with hybrid multi-head self-attention for video understanding
title_fullStr Modality attention fusion model with hybrid multi-head self-attention for video understanding
title_full_unstemmed Modality attention fusion model with hybrid multi-head self-attention for video understanding
title_short Modality attention fusion model with hybrid multi-head self-attention for video understanding
title_sort modality attention fusion model with hybrid multi head self attention for video understanding
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9536548/?tool=EBI
work_keys_str_mv AT xuqiangzhuang modalityattentionfusionmodelwithhybridmultiheadselfattentionforvideounderstanding
AT fangailiu modalityattentionfusionmodelwithhybridmultiheadselfattentionforvideounderstanding
AT jianhou modalityattentionfusionmodelwithhybridmultiheadselfattentionforvideounderstanding
AT jianhuahao modalityattentionfusionmodelwithhybridmultiheadselfattentionforvideounderstanding
AT xiaohongcai modalityattentionfusionmodelwithhybridmultiheadselfattentionforvideounderstanding