Co-Attention Network With Question Type for Visual Question Answering
Visual Question Answering (VQA) is a challenging multi-modal learning task since it requires an understanding of both visual and textual modalities simultaneously. Therefore, the approaches used to represent the images and questions in a fine-grained manner play key roles in the performance. In orde...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2019-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/8676009/ |
_version_ | 1818947654597476352 |
---|---|
author | Chao Yang Mengqi Jiang Bin Jiang Weixin Zhou Keqin Li |
author_facet | Chao Yang Mengqi Jiang Bin Jiang Weixin Zhou Keqin Li |
author_sort | Chao Yang |
collection | DOAJ |
description | Visual Question Answering (VQA) is a challenging multi-modal learning task since it requires an understanding of both visual and textual modalities simultaneously. Therefore, the approaches used to represent the images and questions in a fine-grained manner play key roles in the performance. In order to obtain the fine-grained image and question representations, we develop a co-attention mechanism using an end-to-end deep network architecture to jointly learn both the image and the question features. Specifically, textual attention implemented by a self-attention model will reduce unrelated information and extract more discriminative features for question-level representations, which is in turn used to guide visual attention. We also note that a lot of finished works use complex models to extract feature representations but neglect to use high-level information summary such as question types in learning. Hence, we introduce the question type in our work by directly concatenating it with the multi-modal joint representation to narrow down the candidate answer space. A new network architecture combining the proposed co-attention mechanism and question type provides a unified model for VQA. The extensive experiments on two public datasets demonstrate the effectiveness of our model as compared with several state-of-the-art approaches. |
first_indexed | 2024-12-20T08:34:21Z |
format | Article |
id | doaj.art-95f8ec35e2154a4b8218145b328bf005 |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-12-20T08:34:21Z |
publishDate | 2019-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-95f8ec35e2154a4b8218145b328bf0052022-12-21T19:46:36ZengIEEEIEEE Access2169-35362019-01-017407714078110.1109/ACCESS.2019.29080358676009Co-Attention Network With Question Type for Visual Question AnsweringChao Yang0https://orcid.org/0000-0001-8774-8115Mengqi Jiang1Bin Jiang2Weixin Zhou3Keqin Li4https://orcid.org/0000-0001-5224-4048College of Computer Science and Electronic Engineering, Hunan University, Changsha, ChinaCollege of Computer Science and Electronic Engineering, Hunan University, Changsha, ChinaCollege of Computer Science and Electronic Engineering, Hunan University, Changsha, ChinaCollege of Computer Science and Electronic Engineering, Hunan University, Changsha, ChinaDepartment of Computer Science, State University of New York, New Paltz, NY, USAVisual Question Answering (VQA) is a challenging multi-modal learning task since it requires an understanding of both visual and textual modalities simultaneously. Therefore, the approaches used to represent the images and questions in a fine-grained manner play key roles in the performance. In order to obtain the fine-grained image and question representations, we develop a co-attention mechanism using an end-to-end deep network architecture to jointly learn both the image and the question features. Specifically, textual attention implemented by a self-attention model will reduce unrelated information and extract more discriminative features for question-level representations, which is in turn used to guide visual attention. We also note that a lot of finished works use complex models to extract feature representations but neglect to use high-level information summary such as question types in learning. Hence, we introduce the question type in our work by directly concatenating it with the multi-modal joint representation to narrow down the candidate answer space. A new network architecture combining the proposed co-attention mechanism and question type provides a unified model for VQA. The extensive experiments on two public datasets demonstrate the effectiveness of our model as compared with several state-of-the-art approaches.https://ieeexplore.ieee.org/document/8676009/Co-attentionquestion typeself-attentionvisual question answering |
spellingShingle | Chao Yang Mengqi Jiang Bin Jiang Weixin Zhou Keqin Li Co-Attention Network With Question Type for Visual Question Answering IEEE Access Co-attention question type self-attention visual question answering |
title | Co-Attention Network With Question Type for Visual Question Answering |
title_full | Co-Attention Network With Question Type for Visual Question Answering |
title_fullStr | Co-Attention Network With Question Type for Visual Question Answering |
title_full_unstemmed | Co-Attention Network With Question Type for Visual Question Answering |
title_short | Co-Attention Network With Question Type for Visual Question Answering |
title_sort | co attention network with question type for visual question answering |
topic | Co-attention question type self-attention visual question answering |
url | https://ieeexplore.ieee.org/document/8676009/ |
work_keys_str_mv | AT chaoyang coattentionnetworkwithquestiontypeforvisualquestionanswering AT mengqijiang coattentionnetworkwithquestiontypeforvisualquestionanswering AT binjiang coattentionnetworkwithquestiontypeforvisualquestionanswering AT weixinzhou coattentionnetworkwithquestiontypeforvisualquestionanswering AT keqinli coattentionnetworkwithquestiontypeforvisualquestionanswering |