Semi-Supervised Implicit Augmentation for Data-Scarce VQA

Vision-language models (VLMs) have demonstrated increasing potency in solving complex vision-language tasks in the recent past. Visual question answering (VQA) is one of the primary downstream tasks for assessing the capability of VLMs, as it helps in gauging the multimodal understanding of a VLM in...

Full description

Bibliographic Details
Main Authors: Bhargav Dodla, Kartik Hegde, A. N. Rajagopalan
Format: Article
Language:English
Published: MDPI AG 2024-02-01
Series:Computer Sciences & Mathematics Forum
Subjects:
Online Access:https://www.mdpi.com/2813-0324/9/1/3
_version_ 1797241517332496384
author Bhargav Dodla
Kartik Hegde
A. N. Rajagopalan
author_facet Bhargav Dodla
Kartik Hegde
A. N. Rajagopalan
author_sort Bhargav Dodla
collection DOAJ
description Vision-language models (VLMs) have demonstrated increasing potency in solving complex vision-language tasks in the recent past. Visual question answering (VQA) is one of the primary downstream tasks for assessing the capability of VLMs, as it helps in gauging the multimodal understanding of a VLM in answering open-ended questions. The vast contextual information learned during the pretraining stage in VLMs can be utilised effectively to finetune the VQA model for specific datasets. In particular, special types of VQA datasets, such as OK-VQA, A-OKVQA (outside knowledge-based), and ArtVQA (domain-specific), have a relatively smaller number of images and corresponding question-answer annotations in the training set. Such datasets can be categorised as data-scarce. This hinders the effective learning of VLMs due to the low information availability. We introduce SemIAug (<b>Sem</b>i-Supervised <b>I</b>mplicit <b>Aug</b>mentation), a model and dataset agnostic strategy specially designed to address the challenges faced by limited data availability in the domain-specific VQA datasets. SemIAug uses the annotated image-question data present within the chosen dataset and augments it with meaningful new image-question associations. We show that SemIAug improves the VQA performance on data-scarce datasets without the need for additional data or labels.
first_indexed 2024-04-24T18:24:35Z
format Article
id doaj.art-66185707026647499119669528d5e3f6
institution Directory Open Access Journal
issn 2813-0324
language English
last_indexed 2024-04-24T18:24:35Z
publishDate 2024-02-01
publisher MDPI AG
record_format Article
series Computer Sciences & Mathematics Forum
spelling doaj.art-66185707026647499119669528d5e3f62024-03-27T13:32:36ZengMDPI AGComputer Sciences & Mathematics Forum2813-03242024-02-0191310.3390/cmsf2024009003Semi-Supervised Implicit Augmentation for Data-Scarce VQABhargav Dodla0Kartik Hegde1A. N. Rajagopalan2Indian Institute of Technology, Madras 600036, IndiaIndian Institute of Technology, Madras 600036, IndiaIndian Institute of Technology, Madras 600036, IndiaVision-language models (VLMs) have demonstrated increasing potency in solving complex vision-language tasks in the recent past. Visual question answering (VQA) is one of the primary downstream tasks for assessing the capability of VLMs, as it helps in gauging the multimodal understanding of a VLM in answering open-ended questions. The vast contextual information learned during the pretraining stage in VLMs can be utilised effectively to finetune the VQA model for specific datasets. In particular, special types of VQA datasets, such as OK-VQA, A-OKVQA (outside knowledge-based), and ArtVQA (domain-specific), have a relatively smaller number of images and corresponding question-answer annotations in the training set. Such datasets can be categorised as data-scarce. This hinders the effective learning of VLMs due to the low information availability. We introduce SemIAug (<b>Sem</b>i-Supervised <b>I</b>mplicit <b>Aug</b>mentation), a model and dataset agnostic strategy specially designed to address the challenges faced by limited data availability in the domain-specific VQA datasets. SemIAug uses the annotated image-question data present within the chosen dataset and augments it with meaningful new image-question associations. We show that SemIAug improves the VQA performance on data-scarce datasets without the need for additional data or labels.https://www.mdpi.com/2813-0324/9/1/3visual question answeringvision-language modelssemi-supervised augmentation
spellingShingle Bhargav Dodla
Kartik Hegde
A. N. Rajagopalan
Semi-Supervised Implicit Augmentation for Data-Scarce VQA
Computer Sciences & Mathematics Forum
visual question answering
vision-language models
semi-supervised augmentation
title Semi-Supervised Implicit Augmentation for Data-Scarce VQA
title_full Semi-Supervised Implicit Augmentation for Data-Scarce VQA
title_fullStr Semi-Supervised Implicit Augmentation for Data-Scarce VQA
title_full_unstemmed Semi-Supervised Implicit Augmentation for Data-Scarce VQA
title_short Semi-Supervised Implicit Augmentation for Data-Scarce VQA
title_sort semi supervised implicit augmentation for data scarce vqa
topic visual question answering
vision-language models
semi-supervised augmentation
url https://www.mdpi.com/2813-0324/9/1/3
work_keys_str_mv AT bhargavdodla semisupervisedimplicitaugmentationfordatascarcevqa
AT kartikhegde semisupervisedimplicitaugmentationfordatascarcevqa
AT anrajagopalan semisupervisedimplicitaugmentationfordatascarcevqa