Learning deep networks for image classification

Visual Question Answering stands at the intersection of computer vision and natural language processing, bridging the semantic gap between visual information and textual queries. The dominant approach for this complex task, end-to-end models, do not demonstrate the difference between visual processi...

Full description

Bibliographic Details
Main Author: Zhou, Yixuan
Other Authors: Hanwang Zhang
Format: Final Year Project (FYP)
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/175074
Description
Summary:Visual Question Answering stands at the intersection of computer vision and natural language processing, bridging the semantic gap between visual information and textual queries. The dominant approach for this complex task, end-to-end models, do not demonstrate the difference between visual processing and reasoning, leading to constraints in both interpretation and generalization. The exploration of modular program learning emerges as a promising alternative, although its implementation proves intricate due to the challenges in learning the modules and programs simultaneously. This project introduces VQA-GPT, a framework employing code generation models and the Python interpreter for composing vision-and-language modules to produce results for textual queries. This zero-shot method outperforms traditional end-to-end models in solving various complex visual tasks.