Zero-shot object detection and referring expression comprehension using vision-language models

This project focused on constructing a comprehensive perception pipeline integrating Natural Language Processing (NLP), zero-shot object detection, and Referring Expression Comprehension (ReC) within a ROS (Robot Operating System) framework. The aim was to enhance robotic assistive devices in accura...

Full description

Bibliographic Details
Main Author:	A Manicka, Praveen
Other Authors:	Ang Wei Tech
Format:	Final Year Project (FYP)
Language:	English
Published:	Nanyang Technological University 2024
Subjects:	Computer and Information Science Engineering
Online Access:	https://hdl.handle.net/10356/177827

_version_	1826120282956365824
author	A Manicka, Praveen
author2	Ang Wei Tech
author_facet	Ang Wei Tech A Manicka, Praveen
author_sort	A Manicka, Praveen
collection	NTU
description	This project focused on constructing a comprehensive perception pipeline integrating Natural Language Processing (NLP), zero-shot object detection, and Referring Expression Comprehension (ReC) within a ROS (Robot Operating System) framework. The aim was to enhance robotic assistive devices in accurately interpreting natural language commands and grounding language to physical objects in the real world. To achieve this, we compared various combinations of zero-shot object detectors and ReC models, specifically specifically OWL-ViT and Grounding DINO for zero-shot object detection; and ReCLIP and GPT-4 for ReC. Our evaluation assessed the models' capabilities in counting, spatial reasoning, understanding superlatives, handling multiple instances, self-referential comprehension, and identifying household objects. The findings were showed that GPT-4 outperformed ReCLIP as for the purpose of ReC, and the combination of Grounding DINO and GPT-4 proved to be the best zero-shot object detector and ReC pair.
first_indexed	2024-10-01T05:13:40Z
format	Final Year Project (FYP)
id	ntu-10356/177827
institution	Nanyang Technological University
language	English
last_indexed	2024-10-01T05:13:40Z
publishDate	2024
publisher	Nanyang Technological University
record_format	dspace
spelling	ntu-10356/1778272024-06-08T16:50:58Z Zero-shot object detection and referring expression comprehension using vision-language models A Manicka, Praveen Ang Wei Tech School of Mechanical and Aerospace Engineering Rehabilitation Research Institute of Singapore (RRIS) WTAng@ntu.edu.sg Computer and Information Science Engineering This project focused on constructing a comprehensive perception pipeline integrating Natural Language Processing (NLP), zero-shot object detection, and Referring Expression Comprehension (ReC) within a ROS (Robot Operating System) framework. The aim was to enhance robotic assistive devices in accurately interpreting natural language commands and grounding language to physical objects in the real world. To achieve this, we compared various combinations of zero-shot object detectors and ReC models, specifically specifically OWL-ViT and Grounding DINO for zero-shot object detection; and ReCLIP and GPT-4 for ReC. Our evaluation assessed the models' capabilities in counting, spatial reasoning, understanding superlatives, handling multiple instances, self-referential comprehension, and identifying household objects. The findings were showed that GPT-4 outperformed ReCLIP as for the purpose of ReC, and the combination of Grounding DINO and GPT-4 proved to be the best zero-shot object detector and ReC pair. Bachelor's degree 2024-05-31T12:13:12Z 2024-05-31T12:13:12Z 2024 Final Year Project (FYP) A Manicka, P. (2024). Zero-shot object detection and referring expression comprehension using vision-language models. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/177827 https://hdl.handle.net/10356/177827 en application/pdf Nanyang Technological University
spellingShingle	Computer and Information Science Engineering A Manicka, Praveen Zero-shot object detection and referring expression comprehension using vision-language models
title	Zero-shot object detection and referring expression comprehension using vision-language models
title_full	Zero-shot object detection and referring expression comprehension using vision-language models
title_fullStr	Zero-shot object detection and referring expression comprehension using vision-language models
title_full_unstemmed	Zero-shot object detection and referring expression comprehension using vision-language models
title_short	Zero-shot object detection and referring expression comprehension using vision-language models
title_sort	zero shot object detection and referring expression comprehension using vision language models
topic	Computer and Information Science Engineering
url	https://hdl.handle.net/10356/177827
work_keys_str_mv	AT amanickapraveen zeroshotobjectdetectionandreferringexpressioncomprehensionusingvisionlanguagemodels

Zero-shot object detection and referring expression comprehension using vision-language models

Similar Items