Semi-Supervised Implicit Augmentation for Data-Scarce VQA
Vision-language models (VLMs) have demonstrated increasing potency in solving complex vision-language tasks in the recent past. Visual question answering (VQA) is one of the primary downstream tasks for assessing the capability of VLMs, as it helps in gauging the multimodal understanding of a VLM in...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2024-02-01
|
Series: | Computer Sciences & Mathematics Forum |
Subjects: | |
Online Access: | https://www.mdpi.com/2813-0324/9/1/3 |
_version_ | 1797241517332496384 |
---|---|
author | Bhargav Dodla Kartik Hegde A. N. Rajagopalan |
author_facet | Bhargav Dodla Kartik Hegde A. N. Rajagopalan |
author_sort | Bhargav Dodla |
collection | DOAJ |
description | Vision-language models (VLMs) have demonstrated increasing potency in solving complex vision-language tasks in the recent past. Visual question answering (VQA) is one of the primary downstream tasks for assessing the capability of VLMs, as it helps in gauging the multimodal understanding of a VLM in answering open-ended questions. The vast contextual information learned during the pretraining stage in VLMs can be utilised effectively to finetune the VQA model for specific datasets. In particular, special types of VQA datasets, such as OK-VQA, A-OKVQA (outside knowledge-based), and ArtVQA (domain-specific), have a relatively smaller number of images and corresponding question-answer annotations in the training set. Such datasets can be categorised as data-scarce. This hinders the effective learning of VLMs due to the low information availability. We introduce SemIAug (<b>Sem</b>i-Supervised <b>I</b>mplicit <b>Aug</b>mentation), a model and dataset agnostic strategy specially designed to address the challenges faced by limited data availability in the domain-specific VQA datasets. SemIAug uses the annotated image-question data present within the chosen dataset and augments it with meaningful new image-question associations. We show that SemIAug improves the VQA performance on data-scarce datasets without the need for additional data or labels. |
first_indexed | 2024-04-24T18:24:35Z |
format | Article |
id | doaj.art-66185707026647499119669528d5e3f6 |
institution | Directory Open Access Journal |
issn | 2813-0324 |
language | English |
last_indexed | 2024-04-24T18:24:35Z |
publishDate | 2024-02-01 |
publisher | MDPI AG |
record_format | Article |
series | Computer Sciences & Mathematics Forum |
spelling | doaj.art-66185707026647499119669528d5e3f62024-03-27T13:32:36ZengMDPI AGComputer Sciences & Mathematics Forum2813-03242024-02-0191310.3390/cmsf2024009003Semi-Supervised Implicit Augmentation for Data-Scarce VQABhargav Dodla0Kartik Hegde1A. N. Rajagopalan2Indian Institute of Technology, Madras 600036, IndiaIndian Institute of Technology, Madras 600036, IndiaIndian Institute of Technology, Madras 600036, IndiaVision-language models (VLMs) have demonstrated increasing potency in solving complex vision-language tasks in the recent past. Visual question answering (VQA) is one of the primary downstream tasks for assessing the capability of VLMs, as it helps in gauging the multimodal understanding of a VLM in answering open-ended questions. The vast contextual information learned during the pretraining stage in VLMs can be utilised effectively to finetune the VQA model for specific datasets. In particular, special types of VQA datasets, such as OK-VQA, A-OKVQA (outside knowledge-based), and ArtVQA (domain-specific), have a relatively smaller number of images and corresponding question-answer annotations in the training set. Such datasets can be categorised as data-scarce. This hinders the effective learning of VLMs due to the low information availability. We introduce SemIAug (<b>Sem</b>i-Supervised <b>I</b>mplicit <b>Aug</b>mentation), a model and dataset agnostic strategy specially designed to address the challenges faced by limited data availability in the domain-specific VQA datasets. SemIAug uses the annotated image-question data present within the chosen dataset and augments it with meaningful new image-question associations. We show that SemIAug improves the VQA performance on data-scarce datasets without the need for additional data or labels.https://www.mdpi.com/2813-0324/9/1/3visual question answeringvision-language modelssemi-supervised augmentation |
spellingShingle | Bhargav Dodla Kartik Hegde A. N. Rajagopalan Semi-Supervised Implicit Augmentation for Data-Scarce VQA Computer Sciences & Mathematics Forum visual question answering vision-language models semi-supervised augmentation |
title | Semi-Supervised Implicit Augmentation for Data-Scarce VQA |
title_full | Semi-Supervised Implicit Augmentation for Data-Scarce VQA |
title_fullStr | Semi-Supervised Implicit Augmentation for Data-Scarce VQA |
title_full_unstemmed | Semi-Supervised Implicit Augmentation for Data-Scarce VQA |
title_short | Semi-Supervised Implicit Augmentation for Data-Scarce VQA |
title_sort | semi supervised implicit augmentation for data scarce vqa |
topic | visual question answering vision-language models semi-supervised augmentation |
url | https://www.mdpi.com/2813-0324/9/1/3 |
work_keys_str_mv | AT bhargavdodla semisupervisedimplicitaugmentationfordatascarcevqa AT kartikhegde semisupervisedimplicitaugmentationfordatascarcevqa AT anrajagopalan semisupervisedimplicitaugmentationfordatascarcevqa |