End‐to‐end visual grounding via region proposal networks and bilinear pooling

Phrase‐based visual grounding aims to localise the object in the image referred by a textual query phrase. Most existing approaches adopt a two‐stage mechanism to address this problem: first, an off‐the‐shelf proposal generation model is adopted to extract region‐based visual features, and then a de...

Full description

Bibliographic Details
Main Authors:	Chenchao Xiang, Zhou Yu, Suguo Zhu, Jun Yu, Xiaokang Yang
Format:	Article
Language:	English
Published:	Wiley 2019-03-01
Series:	IET Computer Vision
Subjects:	multimodal features real-world visual grounding datasets end-to-end approach phrase-based visual grounding region proposal networks textual query phrase
Online Access:	https://doi.org/10.1049/iet-cvi.2018.5104

_version_	1797684292694835200
author	Chenchao Xiang Zhou Yu Suguo Zhu Jun Yu Xiaokang Yang
author_facet	Chenchao Xiang Zhou Yu Suguo Zhu Jun Yu Xiaokang Yang
author_sort	Chenchao Xiang
collection	DOAJ
description	Phrase‐based visual grounding aims to localise the object in the image referred by a textual query phrase. Most existing approaches adopt a two‐stage mechanism to address this problem: first, an off‐the‐shelf proposal generation model is adopted to extract region‐based visual features, and then a deep model is designed to score the proposals based on the query phrase and extracted visual features. In contrast to that, the authors design an end‐to‐end approach to tackle the visual grounding problem in this study. They use a region proposal network to generate object proposals and the corresponding visual features simultaneously, and multi‐modal factorised bilinear pooling model to fuse the multi‐modal features effectively. After that, two novel losses are posed on top of the multi‐modal features to rank and refine the proposals, respectively. To verify the effectiveness of the proposed approach, the authors conduct experiments on three real‐world visual grounding datasets, namely Flickr‐30k Entities, ReferItGame and RefCOCO. The experimental results demonstrate the significant superiority of the proposed method over the existing state‐of‐the‐arts.
first_indexed	2024-03-12T00:28:32Z
format	Article
id	doaj.art-c14bde2331fd48989aa7d00f23273041
institution	Directory Open Access Journal
issn	1751-9632 1751-9640
language	English
last_indexed	2024-03-12T00:28:32Z
publishDate	2019-03-01
publisher	Wiley
record_format	Article
series	IET Computer Vision
spelling	doaj.art-c14bde2331fd48989aa7d00f232730412023-09-15T10:31:50ZengWileyIET Computer Vision1751-96321751-96402019-03-0113213113810.1049/iet-cvi.2018.5104End‐to‐end visual grounding via region proposal networks and bilinear poolingChenchao Xiang0Zhou Yu1Suguo Zhu2Jun Yu3Xiaokang Yang4Key Laboratory of Complex Systems Modeling and SimulationSchool of Computer Science and Technology, Hangzhou Dianzi UniversityHangzhouPeople's Republic of ChinaKey Laboratory of Complex Systems Modeling and SimulationSchool of Computer Science and Technology, Hangzhou Dianzi UniversityHangzhouPeople's Republic of ChinaKey Laboratory of Complex Systems Modeling and SimulationSchool of Computer Science and Technology, Hangzhou Dianzi UniversityHangzhouPeople's Republic of ChinaKey Laboratory of Complex Systems Modeling and SimulationSchool of Computer Science and Technology, Hangzhou Dianzi UniversityHangzhouPeople's Republic of ChinaSchool of Electronic Information and Electrical Engineering, Shanghai Jiao Tong UniversityShanghaiPeople's Republic of ChinaPhrase‐based visual grounding aims to localise the object in the image referred by a textual query phrase. Most existing approaches adopt a two‐stage mechanism to address this problem: first, an off‐the‐shelf proposal generation model is adopted to extract region‐based visual features, and then a deep model is designed to score the proposals based on the query phrase and extracted visual features. In contrast to that, the authors design an end‐to‐end approach to tackle the visual grounding problem in this study. They use a region proposal network to generate object proposals and the corresponding visual features simultaneously, and multi‐modal factorised bilinear pooling model to fuse the multi‐modal features effectively. After that, two novel losses are posed on top of the multi‐modal features to rank and refine the proposals, respectively. To verify the effectiveness of the proposed approach, the authors conduct experiments on three real‐world visual grounding datasets, namely Flickr‐30k Entities, ReferItGame and RefCOCO. The experimental results demonstrate the significant superiority of the proposed method over the existing state‐of‐the‐arts.https://doi.org/10.1049/iet-cvi.2018.5104multimodal featuresreal-world visual grounding datasetsend-to-end approachphrase-based visual groundingregion proposal networkstextual query phrase
spellingShingle	Chenchao Xiang Zhou Yu Suguo Zhu Jun Yu Xiaokang Yang End‐to‐end visual grounding via region proposal networks and bilinear pooling IET Computer Vision multimodal features real-world visual grounding datasets end-to-end approach phrase-based visual grounding region proposal networks textual query phrase
title	End‐to‐end visual grounding via region proposal networks and bilinear pooling
title_full	End‐to‐end visual grounding via region proposal networks and bilinear pooling
title_fullStr	End‐to‐end visual grounding via region proposal networks and bilinear pooling
title_full_unstemmed	End‐to‐end visual grounding via region proposal networks and bilinear pooling
title_short	End‐to‐end visual grounding via region proposal networks and bilinear pooling
title_sort	end to end visual grounding via region proposal networks and bilinear pooling
topic	multimodal features real-world visual grounding datasets end-to-end approach phrase-based visual grounding region proposal networks textual query phrase
url	https://doi.org/10.1049/iet-cvi.2018.5104
work_keys_str_mv	AT chenchaoxiang endtoendvisualgroundingviaregionproposalnetworksandbilinearpooling AT zhouyu endtoendvisualgroundingviaregionproposalnetworksandbilinearpooling AT suguozhu endtoendvisualgroundingviaregionproposalnetworksandbilinearpooling AT junyu endtoendvisualgroundingviaregionproposalnetworksandbilinearpooling AT xiaokangyang endtoendvisualgroundingviaregionproposalnetworksandbilinearpooling

End‐to‐end visual grounding via region proposal networks and bilinear pooling

Similar Items