Semi-Supervised Implicit Augmentation for Data-Scarce VQA

Vision-language models (VLMs) have demonstrated increasing potency in solving complex vision-language tasks in the recent past. Visual question answering (VQA) is one of the primary downstream tasks for assessing the capability of VLMs, as it helps in gauging the multimodal understanding of a VLM in...

Full description

Bibliographic Details
Main Authors:	Bhargav Dodla, Kartik Hegde, A. N. Rajagopalan
Format:	Article
Language:	English
Published:	MDPI AG 2024-02-01
Series:	Computer Sciences & Mathematics Forum
Subjects:	visual question answering vision-language models semi-supervised augmentation
Online Access:	https://www.mdpi.com/2813-0324/9/1/3

_version_	1797241517332496384
author	Bhargav Dodla Kartik Hegde A. N. Rajagopalan
author_facet	Bhargav Dodla Kartik Hegde A. N. Rajagopalan
author_sort	Bhargav Dodla
collection	DOAJ
description	Vision-language models (VLMs) have demonstrated increasing potency in solving complex vision-language tasks in the recent past. Visual question answering (VQA) is one of the primary downstream tasks for assessing the capability of VLMs, as it helps in gauging the multimodal understanding of a VLM in answering open-ended questions. The vast contextual information learned during the pretraining stage in VLMs can be utilised effectively to finetune the VQA model for specific datasets. In particular, special types of VQA datasets, such as OK-VQA, A-OKVQA (outside knowledge-based), and ArtVQA (domain-specific), have a relatively smaller number of images and corresponding question-answer annotations in the training set. Such datasets can be categorised as data-scarce. This hinders the effective learning of VLMs due to the low information availability. We introduce SemIAug (<b>Sem</b>i-Supervised <b>I</b>mplicit <b>Aug</b>mentation), a model and dataset agnostic strategy specially designed to address the challenges faced by limited data availability in the domain-specific VQA datasets. SemIAug uses the annotated image-question data present within the chosen dataset and augments it with meaningful new image-question associations. We show that SemIAug improves the VQA performance on data-scarce datasets without the need for additional data or labels.
first_indexed	2024-04-24T18:24:35Z
format	Article
id	doaj.art-66185707026647499119669528d5e3f6
institution	Directory Open Access Journal
issn	2813-0324
language	English
last_indexed	2024-04-24T18:24:35Z
publishDate	2024-02-01
publisher	MDPI AG
record_format	Article
series	Computer Sciences & Mathematics Forum
spelling	doaj.art-66185707026647499119669528d5e3f62024-03-27T13:32:36ZengMDPI AGComputer Sciences & Mathematics Forum2813-03242024-02-0191310.3390/cmsf2024009003Semi-Supervised Implicit Augmentation for Data-Scarce VQABhargav Dodla0Kartik Hegde1A. N. Rajagopalan2Indian Institute of Technology, Madras 600036, IndiaIndian Institute of Technology, Madras 600036, IndiaIndian Institute of Technology, Madras 600036, IndiaVision-language models (VLMs) have demonstrated increasing potency in solving complex vision-language tasks in the recent past. Visual question answering (VQA) is one of the primary downstream tasks for assessing the capability of VLMs, as it helps in gauging the multimodal understanding of a VLM in answering open-ended questions. The vast contextual information learned during the pretraining stage in VLMs can be utilised effectively to finetune the VQA model for specific datasets. In particular, special types of VQA datasets, such as OK-VQA, A-OKVQA (outside knowledge-based), and ArtVQA (domain-specific), have a relatively smaller number of images and corresponding question-answer annotations in the training set. Such datasets can be categorised as data-scarce. This hinders the effective learning of VLMs due to the low information availability. We introduce SemIAug (<b>Sem</b>i-Supervised <b>I</b>mplicit <b>Aug</b>mentation), a model and dataset agnostic strategy specially designed to address the challenges faced by limited data availability in the domain-specific VQA datasets. SemIAug uses the annotated image-question data present within the chosen dataset and augments it with meaningful new image-question associations. We show that SemIAug improves the VQA performance on data-scarce datasets without the need for additional data or labels.https://www.mdpi.com/2813-0324/9/1/3visual question answeringvision-language modelssemi-supervised augmentation
spellingShingle	Bhargav Dodla Kartik Hegde A. N. Rajagopalan Semi-Supervised Implicit Augmentation for Data-Scarce VQA Computer Sciences & Mathematics Forum visual question answering vision-language models semi-supervised augmentation
title	Semi-Supervised Implicit Augmentation for Data-Scarce VQA
title_full	Semi-Supervised Implicit Augmentation for Data-Scarce VQA
title_fullStr	Semi-Supervised Implicit Augmentation for Data-Scarce VQA
title_full_unstemmed	Semi-Supervised Implicit Augmentation for Data-Scarce VQA
title_short	Semi-Supervised Implicit Augmentation for Data-Scarce VQA
title_sort	semi supervised implicit augmentation for data scarce vqa
topic	visual question answering vision-language models semi-supervised augmentation
url	https://www.mdpi.com/2813-0324/9/1/3
work_keys_str_mv	AT bhargavdodla semisupervisedimplicitaugmentationfordatascarcevqa AT kartikhegde semisupervisedimplicitaugmentationfordatascarcevqa AT anrajagopalan semisupervisedimplicitaugmentationfordatascarcevqa

Semi-Supervised Implicit Augmentation for Data-Scarce VQA

Similar Items