Learning deep networks for image classification

Visual Question Answering stands at the intersection of computer vision and natural language processing, bridging the semantic gap between visual information and textual queries. The dominant approach for this complex task, end-to-end models, do not demonstrate the difference between visual processi...

Full description

Bibliographic Details
Main Author: Zhou, Yixuan
Other Authors: Hanwang Zhang
Format: Final Year Project (FYP)
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/175074
_version_ 1811687206763364352
author Zhou, Yixuan
author2 Hanwang Zhang
author_facet Hanwang Zhang
Zhou, Yixuan
author_sort Zhou, Yixuan
collection NTU
description Visual Question Answering stands at the intersection of computer vision and natural language processing, bridging the semantic gap between visual information and textual queries. The dominant approach for this complex task, end-to-end models, do not demonstrate the difference between visual processing and reasoning, leading to constraints in both interpretation and generalization. The exploration of modular program learning emerges as a promising alternative, although its implementation proves intricate due to the challenges in learning the modules and programs simultaneously. This project introduces VQA-GPT, a framework employing code generation models and the Python interpreter for composing vision-and-language modules to produce results for textual queries. This zero-shot method outperforms traditional end-to-end models in solving various complex visual tasks.
first_indexed 2024-10-01T05:12:38Z
format Final Year Project (FYP)
id ntu-10356/175074
institution Nanyang Technological University
language English
last_indexed 2024-10-01T05:12:38Z
publishDate 2024
publisher Nanyang Technological University
record_format dspace
spelling ntu-10356/1750742024-04-19T15:46:03Z Learning deep networks for image classification Zhou, Yixuan Hanwang Zhang School of Computer Science and Engineering hanwangzhang@ntu.edu.sg Computer and Information Science Visual Question Answering stands at the intersection of computer vision and natural language processing, bridging the semantic gap between visual information and textual queries. The dominant approach for this complex task, end-to-end models, do not demonstrate the difference between visual processing and reasoning, leading to constraints in both interpretation and generalization. The exploration of modular program learning emerges as a promising alternative, although its implementation proves intricate due to the challenges in learning the modules and programs simultaneously. This project introduces VQA-GPT, a framework employing code generation models and the Python interpreter for composing vision-and-language modules to produce results for textual queries. This zero-shot method outperforms traditional end-to-end models in solving various complex visual tasks. Bachelor's degree 2024-04-19T04:01:42Z 2024-04-19T04:01:42Z 2024 Final Year Project (FYP) Zhou, Y. (2024). Learning deep networks for image classification. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/175074 https://hdl.handle.net/10356/175074 en SCSE23-0210 application/pdf Nanyang Technological University
spellingShingle Computer and Information Science
Zhou, Yixuan
Learning deep networks for image classification
title Learning deep networks for image classification
title_full Learning deep networks for image classification
title_fullStr Learning deep networks for image classification
title_full_unstemmed Learning deep networks for image classification
title_short Learning deep networks for image classification
title_sort learning deep networks for image classification
topic Computer and Information Science
url https://hdl.handle.net/10356/175074
work_keys_str_mv AT zhouyixuan learningdeepnetworksforimageclassification