Object sequences: encoding categorical and spatial information for a yes/no visual question answering task
The task of visual question answering (VQA) has gained wide popularity in recent times. Effectively solving the VQA task requires the understanding of both the visual content in the image and the language information associated with the text‐based question. In this study, the authors propose a novel...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Wiley
2018-12-01
|
Series: | IET Computer Vision |
Subjects: | |
Online Access: | https://doi.org/10.1049/iet-cvi.2018.5226 |
_version_ | 1827816870289342464 |
---|---|
author | Shivam Garg Rajeev Srivastava |
author_facet | Shivam Garg Rajeev Srivastava |
author_sort | Shivam Garg |
collection | DOAJ |
description | The task of visual question answering (VQA) has gained wide popularity in recent times. Effectively solving the VQA task requires the understanding of both the visual content in the image and the language information associated with the text‐based question. In this study, the authors propose a novel method of encoding the visual information (categorical and spatial object information) of all the objects present in the image into a sequential format, which is called an object sequence. These object sequences can then be suitably processed by a neural network. They experiment with multiple techniques for obtaining a joint embedding from the visual features (in the form of object sequences) and language‐based features obtained from the question. They also provide a detailed analysis on the performance of a neural network architecture using object sequences, on the Oracle task of GuessWhat dataset (a Yes/No VQA task) and benchmark it against the baseline. |
first_indexed | 2024-03-12T00:26:30Z |
format | Article |
id | doaj.art-aef30dee609c44618234162e2e61722d |
institution | Directory Open Access Journal |
issn | 1751-9632 1751-9640 |
language | English |
last_indexed | 2024-03-12T00:26:30Z |
publishDate | 2018-12-01 |
publisher | Wiley |
record_format | Article |
series | IET Computer Vision |
spelling | doaj.art-aef30dee609c44618234162e2e61722d2023-09-15T10:32:11ZengWileyIET Computer Vision1751-96321751-96402018-12-011281141115010.1049/iet-cvi.2018.5226Object sequences: encoding categorical and spatial information for a yes/no visual question answering taskShivam Garg0Rajeev Srivastava1Department of Computer Science and EngineeringIndian Institute of Technology (BHU)Varanasi221005UPIndiaDepartment of Computer Science and EngineeringIndian Institute of Technology (BHU)Varanasi221005UPIndiaThe task of visual question answering (VQA) has gained wide popularity in recent times. Effectively solving the VQA task requires the understanding of both the visual content in the image and the language information associated with the text‐based question. In this study, the authors propose a novel method of encoding the visual information (categorical and spatial object information) of all the objects present in the image into a sequential format, which is called an object sequence. These object sequences can then be suitably processed by a neural network. They experiment with multiple techniques for obtaining a joint embedding from the visual features (in the form of object sequences) and language‐based features obtained from the question. They also provide a detailed analysis on the performance of a neural network architecture using object sequences, on the Oracle task of GuessWhat dataset (a Yes/No VQA task) and benchmark it against the baseline.https://doi.org/10.1049/iet-cvi.2018.5226object sequencesspatial object information encodingcategorical object information encodingyes-no visual question answering taskVQA tasklanguage information |
spellingShingle | Shivam Garg Rajeev Srivastava Object sequences: encoding categorical and spatial information for a yes/no visual question answering task IET Computer Vision object sequences spatial object information encoding categorical object information encoding yes-no visual question answering task VQA task language information |
title | Object sequences: encoding categorical and spatial information for a yes/no visual question answering task |
title_full | Object sequences: encoding categorical and spatial information for a yes/no visual question answering task |
title_fullStr | Object sequences: encoding categorical and spatial information for a yes/no visual question answering task |
title_full_unstemmed | Object sequences: encoding categorical and spatial information for a yes/no visual question answering task |
title_short | Object sequences: encoding categorical and spatial information for a yes/no visual question answering task |
title_sort | object sequences encoding categorical and spatial information for a yes no visual question answering task |
topic | object sequences spatial object information encoding categorical object information encoding yes-no visual question answering task VQA task language information |
url | https://doi.org/10.1049/iet-cvi.2018.5226 |
work_keys_str_mv | AT shivamgarg objectsequencesencodingcategoricalandspatialinformationforayesnovisualquestionansweringtask AT rajeevsrivastava objectsequencesencodingcategoricalandspatialinformationforayesnovisualquestionansweringtask |