Learning deep networks for image classification
Visual Question Answering stands at the intersection of computer vision and natural language processing, bridging the semantic gap between visual information and textual queries. The dominant approach for this complex task, end-to-end models, do not demonstrate the difference between visual processi...
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project (FYP) |
Language: | English |
Published: |
Nanyang Technological University
2024
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/175074 |
_version_ | 1811687206763364352 |
---|---|
author | Zhou, Yixuan |
author2 | Hanwang Zhang |
author_facet | Hanwang Zhang Zhou, Yixuan |
author_sort | Zhou, Yixuan |
collection | NTU |
description | Visual Question Answering stands at the intersection of computer vision and natural language processing, bridging the semantic gap between visual information and textual queries. The dominant approach for this complex task, end-to-end models, do not demonstrate the difference between visual processing and reasoning, leading to constraints in both interpretation and generalization. The exploration of modular program learning emerges as a promising alternative, although its implementation proves intricate due to the challenges in learning the modules and programs simultaneously. This project introduces VQA-GPT, a framework employing code generation models and the Python interpreter for composing vision-and-language modules to produce results for textual queries. This zero-shot method outperforms traditional end-to-end models in solving various complex visual tasks. |
first_indexed | 2024-10-01T05:12:38Z |
format | Final Year Project (FYP) |
id | ntu-10356/175074 |
institution | Nanyang Technological University |
language | English |
last_indexed | 2024-10-01T05:12:38Z |
publishDate | 2024 |
publisher | Nanyang Technological University |
record_format | dspace |
spelling | ntu-10356/1750742024-04-19T15:46:03Z Learning deep networks for image classification Zhou, Yixuan Hanwang Zhang School of Computer Science and Engineering hanwangzhang@ntu.edu.sg Computer and Information Science Visual Question Answering stands at the intersection of computer vision and natural language processing, bridging the semantic gap between visual information and textual queries. The dominant approach for this complex task, end-to-end models, do not demonstrate the difference between visual processing and reasoning, leading to constraints in both interpretation and generalization. The exploration of modular program learning emerges as a promising alternative, although its implementation proves intricate due to the challenges in learning the modules and programs simultaneously. This project introduces VQA-GPT, a framework employing code generation models and the Python interpreter for composing vision-and-language modules to produce results for textual queries. This zero-shot method outperforms traditional end-to-end models in solving various complex visual tasks. Bachelor's degree 2024-04-19T04:01:42Z 2024-04-19T04:01:42Z 2024 Final Year Project (FYP) Zhou, Y. (2024). Learning deep networks for image classification. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/175074 https://hdl.handle.net/10356/175074 en SCSE23-0210 application/pdf Nanyang Technological University |
spellingShingle | Computer and Information Science Zhou, Yixuan Learning deep networks for image classification |
title | Learning deep networks for image classification |
title_full | Learning deep networks for image classification |
title_fullStr | Learning deep networks for image classification |
title_full_unstemmed | Learning deep networks for image classification |
title_short | Learning deep networks for image classification |
title_sort | learning deep networks for image classification |
topic | Computer and Information Science |
url | https://hdl.handle.net/10356/175074 |
work_keys_str_mv | AT zhouyixuan learningdeepnetworksforimageclassification |