Multi-Shared Attention with Global and Local Pathways for Video Question Answering

Video question answering is a challenging task of significant importance toward visual understanding.However,current visual question answering (VQA) methods mainly focus on a single static image,which is distinct from the sequential visual data we faced in the real world.In addition,due to the diver...

Full description

Bibliographic Details
Main Author: WANG Lei-quan, HOU Wen-yan, YUAN Shao-zu, ZHAO Xin, LIN Yao, WU Chun-lei
Format: Article
Language:zho
Published: Editorial office of Computer Science 2021-08-01
Series:Jisuanji kexue
Subjects:
Online Access:http://www.jsjkx.com/fileup/1002-137X/PDF/1002-137X-2021-8-145.pdf
Description
Summary:Video question answering is a challenging task of significant importance toward visual understanding.However,current visual question answering (VQA) methods mainly focus on a single static image,which is distinct from the sequential visual data we faced in the real world.In addition,due to the diversity of textual questions,the VideoQA task has to deal with various visual features to obtain the answers.This paper presents a multi-shared attention network by utilizing local and global frame-level visualinformation for video question answering (VideoQA).Specifically,a two-pathway model is proposed to capture the global and local frame-level features with different frame rates.The two pathways are fused together with the multi-shared attention by sharing the same attention funtion.Extensive experiments are conducted on Tianchi VideoQA dataset to validate the effectiveness of the proposed method.
ISSN:1002-137X