Object sequences: encoding categorical and spatial information for a yes/no visual question answering task

The task of visual question answering (VQA) has gained wide popularity in recent times. Effectively solving the VQA task requires the understanding of both the visual content in the image and the language information associated with the text‐based question. In this study, the authors propose a novel...

Full description

Bibliographic Details
Main Authors: Shivam Garg, Rajeev Srivastava
Format: Article
Language:English
Published: Wiley 2018-12-01
Series:IET Computer Vision
Subjects:
Online Access:https://doi.org/10.1049/iet-cvi.2018.5226
_version_ 1827816870289342464
author Shivam Garg
Rajeev Srivastava
author_facet Shivam Garg
Rajeev Srivastava
author_sort Shivam Garg
collection DOAJ
description The task of visual question answering (VQA) has gained wide popularity in recent times. Effectively solving the VQA task requires the understanding of both the visual content in the image and the language information associated with the text‐based question. In this study, the authors propose a novel method of encoding the visual information (categorical and spatial object information) of all the objects present in the image into a sequential format, which is called an object sequence. These object sequences can then be suitably processed by a neural network. They experiment with multiple techniques for obtaining a joint embedding from the visual features (in the form of object sequences) and language‐based features obtained from the question. They also provide a detailed analysis on the performance of a neural network architecture using object sequences, on the Oracle task of GuessWhat dataset (a Yes/No VQA task) and benchmark it against the baseline.
first_indexed 2024-03-12T00:26:30Z
format Article
id doaj.art-aef30dee609c44618234162e2e61722d
institution Directory Open Access Journal
issn 1751-9632
1751-9640
language English
last_indexed 2024-03-12T00:26:30Z
publishDate 2018-12-01
publisher Wiley
record_format Article
series IET Computer Vision
spelling doaj.art-aef30dee609c44618234162e2e61722d2023-09-15T10:32:11ZengWileyIET Computer Vision1751-96321751-96402018-12-011281141115010.1049/iet-cvi.2018.5226Object sequences: encoding categorical and spatial information for a yes/no visual question answering taskShivam Garg0Rajeev Srivastava1Department of Computer Science and EngineeringIndian Institute of Technology (BHU)Varanasi221005UPIndiaDepartment of Computer Science and EngineeringIndian Institute of Technology (BHU)Varanasi221005UPIndiaThe task of visual question answering (VQA) has gained wide popularity in recent times. Effectively solving the VQA task requires the understanding of both the visual content in the image and the language information associated with the text‐based question. In this study, the authors propose a novel method of encoding the visual information (categorical and spatial object information) of all the objects present in the image into a sequential format, which is called an object sequence. These object sequences can then be suitably processed by a neural network. They experiment with multiple techniques for obtaining a joint embedding from the visual features (in the form of object sequences) and language‐based features obtained from the question. They also provide a detailed analysis on the performance of a neural network architecture using object sequences, on the Oracle task of GuessWhat dataset (a Yes/No VQA task) and benchmark it against the baseline.https://doi.org/10.1049/iet-cvi.2018.5226object sequencesspatial object information encodingcategorical object information encodingyes-no visual question answering taskVQA tasklanguage information
spellingShingle Shivam Garg
Rajeev Srivastava
Object sequences: encoding categorical and spatial information for a yes/no visual question answering task
IET Computer Vision
object sequences
spatial object information encoding
categorical object information encoding
yes-no visual question answering task
VQA task
language information
title Object sequences: encoding categorical and spatial information for a yes/no visual question answering task
title_full Object sequences: encoding categorical and spatial information for a yes/no visual question answering task
title_fullStr Object sequences: encoding categorical and spatial information for a yes/no visual question answering task
title_full_unstemmed Object sequences: encoding categorical and spatial information for a yes/no visual question answering task
title_short Object sequences: encoding categorical and spatial information for a yes/no visual question answering task
title_sort object sequences encoding categorical and spatial information for a yes no visual question answering task
topic object sequences
spatial object information encoding
categorical object information encoding
yes-no visual question answering task
VQA task
language information
url https://doi.org/10.1049/iet-cvi.2018.5226
work_keys_str_mv AT shivamgarg objectsequencesencodingcategoricalandspatialinformationforayesnovisualquestionansweringtask
AT rajeevsrivastava objectsequencesencodingcategoricalandspatialinformationforayesnovisualquestionansweringtask