End‐to‐end visual grounding via region proposal networks and bilinear pooling

Phrase‐based visual grounding aims to localise the object in the image referred by a textual query phrase. Most existing approaches adopt a two‐stage mechanism to address this problem: first, an off‐the‐shelf proposal generation model is adopted to extract region‐based visual features, and then a de...

Full description

Bibliographic Details
Main Authors: Chenchao Xiang, Zhou Yu, Suguo Zhu, Jun Yu, Xiaokang Yang
Format: Article
Language:English
Published: Wiley 2019-03-01
Series:IET Computer Vision
Subjects:
Online Access:https://doi.org/10.1049/iet-cvi.2018.5104
_version_ 1797684292694835200
author Chenchao Xiang
Zhou Yu
Suguo Zhu
Jun Yu
Xiaokang Yang
author_facet Chenchao Xiang
Zhou Yu
Suguo Zhu
Jun Yu
Xiaokang Yang
author_sort Chenchao Xiang
collection DOAJ
description Phrase‐based visual grounding aims to localise the object in the image referred by a textual query phrase. Most existing approaches adopt a two‐stage mechanism to address this problem: first, an off‐the‐shelf proposal generation model is adopted to extract region‐based visual features, and then a deep model is designed to score the proposals based on the query phrase and extracted visual features. In contrast to that, the authors design an end‐to‐end approach to tackle the visual grounding problem in this study. They use a region proposal network to generate object proposals and the corresponding visual features simultaneously, and multi‐modal factorised bilinear pooling model to fuse the multi‐modal features effectively. After that, two novel losses are posed on top of the multi‐modal features to rank and refine the proposals, respectively. To verify the effectiveness of the proposed approach, the authors conduct experiments on three real‐world visual grounding datasets, namely Flickr‐30k Entities, ReferItGame and RefCOCO. The experimental results demonstrate the significant superiority of the proposed method over the existing state‐of‐the‐arts.
first_indexed 2024-03-12T00:28:32Z
format Article
id doaj.art-c14bde2331fd48989aa7d00f23273041
institution Directory Open Access Journal
issn 1751-9632
1751-9640
language English
last_indexed 2024-03-12T00:28:32Z
publishDate 2019-03-01
publisher Wiley
record_format Article
series IET Computer Vision
spelling doaj.art-c14bde2331fd48989aa7d00f232730412023-09-15T10:31:50ZengWileyIET Computer Vision1751-96321751-96402019-03-0113213113810.1049/iet-cvi.2018.5104End‐to‐end visual grounding via region proposal networks and bilinear poolingChenchao Xiang0Zhou Yu1Suguo Zhu2Jun Yu3Xiaokang Yang4Key Laboratory of Complex Systems Modeling and SimulationSchool of Computer Science and Technology, Hangzhou Dianzi UniversityHangzhouPeople's Republic of ChinaKey Laboratory of Complex Systems Modeling and SimulationSchool of Computer Science and Technology, Hangzhou Dianzi UniversityHangzhouPeople's Republic of ChinaKey Laboratory of Complex Systems Modeling and SimulationSchool of Computer Science and Technology, Hangzhou Dianzi UniversityHangzhouPeople's Republic of ChinaKey Laboratory of Complex Systems Modeling and SimulationSchool of Computer Science and Technology, Hangzhou Dianzi UniversityHangzhouPeople's Republic of ChinaSchool of Electronic Information and Electrical Engineering, Shanghai Jiao Tong UniversityShanghaiPeople's Republic of ChinaPhrase‐based visual grounding aims to localise the object in the image referred by a textual query phrase. Most existing approaches adopt a two‐stage mechanism to address this problem: first, an off‐the‐shelf proposal generation model is adopted to extract region‐based visual features, and then a deep model is designed to score the proposals based on the query phrase and extracted visual features. In contrast to that, the authors design an end‐to‐end approach to tackle the visual grounding problem in this study. They use a region proposal network to generate object proposals and the corresponding visual features simultaneously, and multi‐modal factorised bilinear pooling model to fuse the multi‐modal features effectively. After that, two novel losses are posed on top of the multi‐modal features to rank and refine the proposals, respectively. To verify the effectiveness of the proposed approach, the authors conduct experiments on three real‐world visual grounding datasets, namely Flickr‐30k Entities, ReferItGame and RefCOCO. The experimental results demonstrate the significant superiority of the proposed method over the existing state‐of‐the‐arts.https://doi.org/10.1049/iet-cvi.2018.5104multimodal featuresreal-world visual grounding datasetsend-to-end approachphrase-based visual groundingregion proposal networkstextual query phrase
spellingShingle Chenchao Xiang
Zhou Yu
Suguo Zhu
Jun Yu
Xiaokang Yang
End‐to‐end visual grounding via region proposal networks and bilinear pooling
IET Computer Vision
multimodal features
real-world visual grounding datasets
end-to-end approach
phrase-based visual grounding
region proposal networks
textual query phrase
title End‐to‐end visual grounding via region proposal networks and bilinear pooling
title_full End‐to‐end visual grounding via region proposal networks and bilinear pooling
title_fullStr End‐to‐end visual grounding via region proposal networks and bilinear pooling
title_full_unstemmed End‐to‐end visual grounding via region proposal networks and bilinear pooling
title_short End‐to‐end visual grounding via region proposal networks and bilinear pooling
title_sort end to end visual grounding via region proposal networks and bilinear pooling
topic multimodal features
real-world visual grounding datasets
end-to-end approach
phrase-based visual grounding
region proposal networks
textual query phrase
url https://doi.org/10.1049/iet-cvi.2018.5104
work_keys_str_mv AT chenchaoxiang endtoendvisualgroundingviaregionproposalnetworksandbilinearpooling
AT zhouyu endtoendvisualgroundingviaregionproposalnetworksandbilinearpooling
AT suguozhu endtoendvisualgroundingviaregionproposalnetworksandbilinearpooling
AT junyu endtoendvisualgroundingviaregionproposalnetworksandbilinearpooling
AT xiaokangyang endtoendvisualgroundingviaregionproposalnetworksandbilinearpooling