Co-Attention Network With Question Type for Visual Question Answering

Visual Question Answering (VQA) is a challenging multi-modal learning task since it requires an understanding of both visual and textual modalities simultaneously. Therefore, the approaches used to represent the images and questions in a fine-grained manner play key roles in the performance. In orde...

Full description

Bibliographic Details
Main Authors: Chao Yang, Mengqi Jiang, Bin Jiang, Weixin Zhou, Keqin Li
Format: Article
Language:English
Published: IEEE 2019-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8676009/
_version_ 1818947654597476352
author Chao Yang
Mengqi Jiang
Bin Jiang
Weixin Zhou
Keqin Li
author_facet Chao Yang
Mengqi Jiang
Bin Jiang
Weixin Zhou
Keqin Li
author_sort Chao Yang
collection DOAJ
description Visual Question Answering (VQA) is a challenging multi-modal learning task since it requires an understanding of both visual and textual modalities simultaneously. Therefore, the approaches used to represent the images and questions in a fine-grained manner play key roles in the performance. In order to obtain the fine-grained image and question representations, we develop a co-attention mechanism using an end-to-end deep network architecture to jointly learn both the image and the question features. Specifically, textual attention implemented by a self-attention model will reduce unrelated information and extract more discriminative features for question-level representations, which is in turn used to guide visual attention. We also note that a lot of finished works use complex models to extract feature representations but neglect to use high-level information summary such as question types in learning. Hence, we introduce the question type in our work by directly concatenating it with the multi-modal joint representation to narrow down the candidate answer space. A new network architecture combining the proposed co-attention mechanism and question type provides a unified model for VQA. The extensive experiments on two public datasets demonstrate the effectiveness of our model as compared with several state-of-the-art approaches.
first_indexed 2024-12-20T08:34:21Z
format Article
id doaj.art-95f8ec35e2154a4b8218145b328bf005
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-12-20T08:34:21Z
publishDate 2019-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-95f8ec35e2154a4b8218145b328bf0052022-12-21T19:46:36ZengIEEEIEEE Access2169-35362019-01-017407714078110.1109/ACCESS.2019.29080358676009Co-Attention Network With Question Type for Visual Question AnsweringChao Yang0https://orcid.org/0000-0001-8774-8115Mengqi Jiang1Bin Jiang2Weixin Zhou3Keqin Li4https://orcid.org/0000-0001-5224-4048College of Computer Science and Electronic Engineering, Hunan University, Changsha, ChinaCollege of Computer Science and Electronic Engineering, Hunan University, Changsha, ChinaCollege of Computer Science and Electronic Engineering, Hunan University, Changsha, ChinaCollege of Computer Science and Electronic Engineering, Hunan University, Changsha, ChinaDepartment of Computer Science, State University of New York, New Paltz, NY, USAVisual Question Answering (VQA) is a challenging multi-modal learning task since it requires an understanding of both visual and textual modalities simultaneously. Therefore, the approaches used to represent the images and questions in a fine-grained manner play key roles in the performance. In order to obtain the fine-grained image and question representations, we develop a co-attention mechanism using an end-to-end deep network architecture to jointly learn both the image and the question features. Specifically, textual attention implemented by a self-attention model will reduce unrelated information and extract more discriminative features for question-level representations, which is in turn used to guide visual attention. We also note that a lot of finished works use complex models to extract feature representations but neglect to use high-level information summary such as question types in learning. Hence, we introduce the question type in our work by directly concatenating it with the multi-modal joint representation to narrow down the candidate answer space. A new network architecture combining the proposed co-attention mechanism and question type provides a unified model for VQA. The extensive experiments on two public datasets demonstrate the effectiveness of our model as compared with several state-of-the-art approaches.https://ieeexplore.ieee.org/document/8676009/Co-attentionquestion typeself-attentionvisual question answering
spellingShingle Chao Yang
Mengqi Jiang
Bin Jiang
Weixin Zhou
Keqin Li
Co-Attention Network With Question Type for Visual Question Answering
IEEE Access
Co-attention
question type
self-attention
visual question answering
title Co-Attention Network With Question Type for Visual Question Answering
title_full Co-Attention Network With Question Type for Visual Question Answering
title_fullStr Co-Attention Network With Question Type for Visual Question Answering
title_full_unstemmed Co-Attention Network With Question Type for Visual Question Answering
title_short Co-Attention Network With Question Type for Visual Question Answering
title_sort co attention network with question type for visual question answering
topic Co-attention
question type
self-attention
visual question answering
url https://ieeexplore.ieee.org/document/8676009/
work_keys_str_mv AT chaoyang coattentionnetworkwithquestiontypeforvisualquestionanswering
AT mengqijiang coattentionnetworkwithquestiontypeforvisualquestionanswering
AT binjiang coattentionnetworkwithquestiontypeforvisualquestionanswering
AT weixinzhou coattentionnetworkwithquestiontypeforvisualquestionanswering
AT keqinli coattentionnetworkwithquestiontypeforvisualquestionanswering