Grounding referring expressions in images with neural module tree network

Grounding referring expressions in images or visual grounding for short, is a task used in Artificial Intelligence (AI) to locate and identify a target object through localization of natural language in images. The complex task of visual grounding requires composite visual reasoning to better m...

Full description

Bibliographic Details
Main Author:	Tan, Kuan Yeow
Other Authors:	Zhang Hanwang
Format:	Final Year Project (FYP)
Language:	English
Published:	Nanyang Technological University 2022
Subjects:	Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
Online Access:	https://hdl.handle.net/10356/156618

_version_	1826109796427759616
author	Tan, Kuan Yeow
author2	Zhang Hanwang
author_facet	Zhang Hanwang Tan, Kuan Yeow
author_sort	Tan, Kuan Yeow
collection	NTU
description	Grounding referring expressions in images or visual grounding for short, is a task used in Artificial Intelligence (AI) to locate and identify a target object through localization of natural language in images. The complex task of visual grounding requires composite visual reasoning to better mimic the human logical thought process. However, existing methods do not extend towards the multiple components of natural language and over-simplify it into either a monolithic sentence embedding or a rough composition of subject-predicate-object. To venture more into the complexity of natural language, a Neural Module Tree network (NMTree) is applied on the dependency parsing tree of the referring expression during the visual grounding process. Each node of the dependency parsing tree is taken as a neural module that calculates visual attention where the grounding score is accumulated in a bottom-up fashion to the root node of the tree. Gumbel-Softmax approximation is utilized to train the modules and their assembly end-to-end reducing parsing errors. NMTree will allow for the composite reasoning portion to be more loosely coupled from the visual grounding providing more intuitive perception during localization. The inclusion of NMTree had provided better explanation of grounding natural language and outperforms state-of-the-arts on several benchmarks.
first_indexed	2024-10-01T02:23:49Z
format	Final Year Project (FYP)
id	ntu-10356/156618
institution	Nanyang Technological University
language	English
last_indexed	2024-10-01T02:23:49Z
publishDate	2022
publisher	Nanyang Technological University
record_format	dspace
spelling	ntu-10356/1566182022-04-21T05:26:57Z Grounding referring expressions in images with neural module tree network Tan, Kuan Yeow Zhang Hanwang School of Computer Science and Engineering hanwangzhang@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Grounding referring expressions in images or visual grounding for short, is a task used in Artificial Intelligence (AI) to locate and identify a target object through localization of natural language in images. The complex task of visual grounding requires composite visual reasoning to better mimic the human logical thought process. However, existing methods do not extend towards the multiple components of natural language and over-simplify it into either a monolithic sentence embedding or a rough composition of subject-predicate-object. To venture more into the complexity of natural language, a Neural Module Tree network (NMTree) is applied on the dependency parsing tree of the referring expression during the visual grounding process. Each node of the dependency parsing tree is taken as a neural module that calculates visual attention where the grounding score is accumulated in a bottom-up fashion to the root node of the tree. Gumbel-Softmax approximation is utilized to train the modules and their assembly end-to-end reducing parsing errors. NMTree will allow for the composite reasoning portion to be more loosely coupled from the visual grounding providing more intuitive perception during localization. The inclusion of NMTree had provided better explanation of grounding natural language and outperforms state-of-the-arts on several benchmarks. Bachelor of Engineering (Computer Science) 2022-04-21T05:26:57Z 2022-04-21T05:26:57Z 2022 Final Year Project (FYP) Tan, K. Y. (2022). Grounding referring expressions in images with neural module tree network. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/156618 https://hdl.handle.net/10356/156618 en SCSE21-0519 application/pdf Nanyang Technological University
spellingShingle	Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision Tan, Kuan Yeow Grounding referring expressions in images with neural module tree network
title	Grounding referring expressions in images with neural module tree network
title_full	Grounding referring expressions in images with neural module tree network
title_fullStr	Grounding referring expressions in images with neural module tree network
title_full_unstemmed	Grounding referring expressions in images with neural module tree network
title_short	Grounding referring expressions in images with neural module tree network
title_sort	grounding referring expressions in images with neural module tree network
topic	Engineering::Computer science and engineering::Computing methodologies::Image processing and computer vision
url	https://hdl.handle.net/10356/156618
work_keys_str_mv	AT tankuanyeow groundingreferringexpressionsinimageswithneuralmoduletreenetwork

Grounding referring expressions in images with neural module tree network

Similar Items