Learning deep networks for image classification

Visual Question Answering stands at the intersection of computer vision and natural language processing, bridging the semantic gap between visual information and textual queries. The dominant approach for this complex task, end-to-end models, do not demonstrate the difference between visual processi...

Full description

Bibliographic Details
Main Author:	Zhou, Yixuan
Other Authors:	Hanwang Zhang
Format:	Final Year Project (FYP)
Language:	English
Published:	Nanyang Technological University 2024
Subjects:	Computer and Information Science
Online Access:	https://hdl.handle.net/10356/175074

_version_	1811687206763364352
author	Zhou, Yixuan
author2	Hanwang Zhang
author_facet	Hanwang Zhang Zhou, Yixuan
author_sort	Zhou, Yixuan
collection	NTU
description	Visual Question Answering stands at the intersection of computer vision and natural language processing, bridging the semantic gap between visual information and textual queries. The dominant approach for this complex task, end-to-end models, do not demonstrate the difference between visual processing and reasoning, leading to constraints in both interpretation and generalization. The exploration of modular program learning emerges as a promising alternative, although its implementation proves intricate due to the challenges in learning the modules and programs simultaneously. This project introduces VQA-GPT, a framework employing code generation models and the Python interpreter for composing vision-and-language modules to produce results for textual queries. This zero-shot method outperforms traditional end-to-end models in solving various complex visual tasks.
first_indexed	2024-10-01T05:12:38Z
format	Final Year Project (FYP)
id	ntu-10356/175074
institution	Nanyang Technological University
language	English
last_indexed	2024-10-01T05:12:38Z
publishDate	2024
publisher	Nanyang Technological University
record_format	dspace
spelling	ntu-10356/1750742024-04-19T15:46:03Z Learning deep networks for image classification Zhou, Yixuan Hanwang Zhang School of Computer Science and Engineering hanwangzhang@ntu.edu.sg Computer and Information Science Visual Question Answering stands at the intersection of computer vision and natural language processing, bridging the semantic gap between visual information and textual queries. The dominant approach for this complex task, end-to-end models, do not demonstrate the difference between visual processing and reasoning, leading to constraints in both interpretation and generalization. The exploration of modular program learning emerges as a promising alternative, although its implementation proves intricate due to the challenges in learning the modules and programs simultaneously. This project introduces VQA-GPT, a framework employing code generation models and the Python interpreter for composing vision-and-language modules to produce results for textual queries. This zero-shot method outperforms traditional end-to-end models in solving various complex visual tasks. Bachelor's degree 2024-04-19T04:01:42Z 2024-04-19T04:01:42Z 2024 Final Year Project (FYP) Zhou, Y. (2024). Learning deep networks for image classification. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/175074 https://hdl.handle.net/10356/175074 en SCSE23-0210 application/pdf Nanyang Technological University
spellingShingle	Computer and Information Science Zhou, Yixuan Learning deep networks for image classification
title	Learning deep networks for image classification
title_full	Learning deep networks for image classification
title_fullStr	Learning deep networks for image classification
title_full_unstemmed	Learning deep networks for image classification
title_short	Learning deep networks for image classification
title_sort	learning deep networks for image classification
topic	Computer and Information Science
url	https://hdl.handle.net/10356/175074
work_keys_str_mv	AT zhouyixuan learningdeepnetworksforimageclassification

Learning deep networks for image classification

Similar Items