TASTA: Text‐Assisted Spatial and Temporal Attention Network for Video Question Answering

Video question answering (VideoQA) is a typical task that integrates language and vision. The key for VideoQA is to extract relevant and effective visual information for answering a specific question. Information selection is believed to be necessary for this task due to the large amount of irreleva...

Full description

Bibliographic Details
Main Authors: Tian Wang, Boyao Hou, Jiakun Li, Peng Shi, Baochang Zhang, Hichem Snoussi
Format: Article
Language:English
Published: Wiley 2023-04-01
Series:Advanced Intelligent Systems
Subjects:
Online Access:https://doi.org/10.1002/aisy.202200131
_version_ 1797842501940281344
author Tian Wang
Boyao Hou
Jiakun Li
Peng Shi
Baochang Zhang
Hichem Snoussi
author_facet Tian Wang
Boyao Hou
Jiakun Li
Peng Shi
Baochang Zhang
Hichem Snoussi
author_sort Tian Wang
collection DOAJ
description Video question answering (VideoQA) is a typical task that integrates language and vision. The key for VideoQA is to extract relevant and effective visual information for answering a specific question. Information selection is believed to be necessary for this task due to the large amount of irrelevant information in the video, and explicitly learning an attention model can be a reasonable and effective solution for the selection. Herein, a novel VideoQA model called Text‐Assisted Spatial and Temporal Attention Network (TASTA) is proposed, which shows the great potential of explicitly modeling attention. TASTA is made to be simple, small, clean, and efficient for clear performance justification and possible easy extension. Its success is mainly from two new strategies of better using the textual information. Experimental results on a large and most representative dataset, TGIF‐QA, show the significant superiority of TASTA w.r.t. the state‐of‐the‐art and demonstrate the effectiveness of its key components via ablation studies.
first_indexed 2024-04-09T16:48:57Z
format Article
id doaj.art-4e68299cda3847eba4207434742bae91
institution Directory Open Access Journal
issn 2640-4567
language English
last_indexed 2024-04-09T16:48:57Z
publishDate 2023-04-01
publisher Wiley
record_format Article
series Advanced Intelligent Systems
spelling doaj.art-4e68299cda3847eba4207434742bae912023-04-22T02:52:33ZengWileyAdvanced Intelligent Systems2640-45672023-04-0154n/an/a10.1002/aisy.202200131TASTA: Text‐Assisted Spatial and Temporal Attention Network for Video Question AnsweringTian Wang0Boyao Hou1Jiakun Li2Peng Shi3Baochang Zhang4Hichem Snoussi5Institute of Artificial Intelligence Beihang University Beijing 100083 ChinaSchool of Automation Science and Electrical Engineering Beihang University Beijing 100083 ChinaSchool of Automation Science and Electrical Engineering Beihang University Beijing 100083 ChinaCollege of Computer and Cyber Security Fujian Normal University Fuzhou Fujian 350117 ChinaInstitute of Artificial Intelligence Beihang University Beijing 100083 ChinaInstitute Charles Delaunay University of Technology of Troyes 10004 Troyes FranceVideo question answering (VideoQA) is a typical task that integrates language and vision. The key for VideoQA is to extract relevant and effective visual information for answering a specific question. Information selection is believed to be necessary for this task due to the large amount of irrelevant information in the video, and explicitly learning an attention model can be a reasonable and effective solution for the selection. Herein, a novel VideoQA model called Text‐Assisted Spatial and Temporal Attention Network (TASTA) is proposed, which shows the great potential of explicitly modeling attention. TASTA is made to be simple, small, clean, and efficient for clear performance justification and possible easy extension. Its success is mainly from two new strategies of better using the textual information. Experimental results on a large and most representative dataset, TGIF‐QA, show the significant superiority of TASTA w.r.t. the state‐of‐the‐art and demonstrate the effectiveness of its key components via ablation studies.https://doi.org/10.1002/aisy.202200131attention mechanismvideo question answeringvisual question answering
spellingShingle Tian Wang
Boyao Hou
Jiakun Li
Peng Shi
Baochang Zhang
Hichem Snoussi
TASTA: Text‐Assisted Spatial and Temporal Attention Network for Video Question Answering
Advanced Intelligent Systems
attention mechanism
video question answering
visual question answering
title TASTA: Text‐Assisted Spatial and Temporal Attention Network for Video Question Answering
title_full TASTA: Text‐Assisted Spatial and Temporal Attention Network for Video Question Answering
title_fullStr TASTA: Text‐Assisted Spatial and Temporal Attention Network for Video Question Answering
title_full_unstemmed TASTA: Text‐Assisted Spatial and Temporal Attention Network for Video Question Answering
title_short TASTA: Text‐Assisted Spatial and Temporal Attention Network for Video Question Answering
title_sort tasta text assisted spatial and temporal attention network for video question answering
topic attention mechanism
video question answering
visual question answering
url https://doi.org/10.1002/aisy.202200131
work_keys_str_mv AT tianwang tastatextassistedspatialandtemporalattentionnetworkforvideoquestionanswering
AT boyaohou tastatextassistedspatialandtemporalattentionnetworkforvideoquestionanswering
AT jiakunli tastatextassistedspatialandtemporalattentionnetworkforvideoquestionanswering
AT pengshi tastatextassistedspatialandtemporalattentionnetworkforvideoquestionanswering
AT baochangzhang tastatextassistedspatialandtemporalattentionnetworkforvideoquestionanswering
AT hichemsnoussi tastatextassistedspatialandtemporalattentionnetworkforvideoquestionanswering