Towards unified visual perception

<p>This thesis explores the frontier of visual perception in computer vision by leveraging the capabilities of Vision Transformers (ViTs) to create a unified framework that addresses cross-task and cross-granularity challenges. Drawing inspiration from the human visual system's ability to...

Full description

Bibliographic Details
Main Author: Sun, S
Other Authors: Torr, P
Format: Thesis
Language:English
Published: 2024
Subjects:
_version_ 1811140366126022656
author Sun, S
author2 Torr, P
author_facet Torr, P
Sun, S
author_sort Sun, S
collection OXFORD
description <p>This thesis explores the frontier of visual perception in computer vision by leveraging the capabilities of Vision Transformers (ViTs) to create a unified framework that addresses cross-task and cross-granularity challenges. Drawing inspiration from the human visual system's ability to process visual information at varying levels of detail and the success of Transformers in Natural Language Processing (NLP), we aim to bridge the gap between broad visual concepts and their fine-grained counterparts. Our investigation is structured into three parts.</p> <br> <p>First, we delve into a range of training methods and architectures for ViTs, with the goal of gathering valuable insights. These insights are intended to guide the optimization of ViTs in the subsequent phase of our research, ensuring we build a strong foundation for enhancing their performance in complex visual tasks.</p> <br> <p>Second, our focus shifts towards the recognition of fine-grained visual concepts, employing precise annotations to delve deeper into the intricate details of visual scenes. Here, we tackle the challenge of discerning and classifying objects and pixels with remarkable accuracy, leveraging the foundational insights gained from our initial explorations of ViTs.</p> <br> <p>In the final part of our thesis, we demonstrate how language can serve as a bridge, enabling vision-language models, which are only trained to recognize images, to navigate countless visual concepts on fine-grained entities like objects and pixels without the need for fine-tuning.</p>
first_indexed 2024-09-25T04:20:50Z
format Thesis
id oxford-uuid:a3567b92-48e1-49a1-8aad-cef7663c2b40
institution University of Oxford
language English
last_indexed 2024-09-25T04:20:50Z
publishDate 2024
record_format dspace
spelling oxford-uuid:a3567b92-48e1-49a1-8aad-cef7663c2b402024-08-08T14:47:31ZTowards unified visual perceptionThesishttp://purl.org/coar/resource_type/c_db06uuid:a3567b92-48e1-49a1-8aad-cef7663c2b40Computer visionDeep learning (machine learning)EnglishHyrax Deposit2024Sun, STorr, PZisserman, APrisacariu, VCipolla, R<p>This thesis explores the frontier of visual perception in computer vision by leveraging the capabilities of Vision Transformers (ViTs) to create a unified framework that addresses cross-task and cross-granularity challenges. Drawing inspiration from the human visual system's ability to process visual information at varying levels of detail and the success of Transformers in Natural Language Processing (NLP), we aim to bridge the gap between broad visual concepts and their fine-grained counterparts. Our investigation is structured into three parts.</p> <br> <p>First, we delve into a range of training methods and architectures for ViTs, with the goal of gathering valuable insights. These insights are intended to guide the optimization of ViTs in the subsequent phase of our research, ensuring we build a strong foundation for enhancing their performance in complex visual tasks.</p> <br> <p>Second, our focus shifts towards the recognition of fine-grained visual concepts, employing precise annotations to delve deeper into the intricate details of visual scenes. Here, we tackle the challenge of discerning and classifying objects and pixels with remarkable accuracy, leveraging the foundational insights gained from our initial explorations of ViTs.</p> <br> <p>In the final part of our thesis, we demonstrate how language can serve as a bridge, enabling vision-language models, which are only trained to recognize images, to navigate countless visual concepts on fine-grained entities like objects and pixels without the need for fine-tuning.</p>
spellingShingle Computer vision
Deep learning (machine learning)
Sun, S
Towards unified visual perception
title Towards unified visual perception
title_full Towards unified visual perception
title_fullStr Towards unified visual perception
title_full_unstemmed Towards unified visual perception
title_short Towards unified visual perception
title_sort towards unified visual perception
topic Computer vision
Deep learning (machine learning)
work_keys_str_mv AT suns towardsunifiedvisualperception