End‐to‐end visual grounding via region proposal networks and bilinear pooling
Phrase‐based visual grounding aims to localise the object in the image referred by a textual query phrase. Most existing approaches adopt a two‐stage mechanism to address this problem: first, an off‐the‐shelf proposal generation model is adopted to extract region‐based visual features, and then a de...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Wiley
2019-03-01
|
Series: | IET Computer Vision |
Subjects: | |
Online Access: | https://doi.org/10.1049/iet-cvi.2018.5104 |
_version_ | 1797684292694835200 |
---|---|
author | Chenchao Xiang Zhou Yu Suguo Zhu Jun Yu Xiaokang Yang |
author_facet | Chenchao Xiang Zhou Yu Suguo Zhu Jun Yu Xiaokang Yang |
author_sort | Chenchao Xiang |
collection | DOAJ |
description | Phrase‐based visual grounding aims to localise the object in the image referred by a textual query phrase. Most existing approaches adopt a two‐stage mechanism to address this problem: first, an off‐the‐shelf proposal generation model is adopted to extract region‐based visual features, and then a deep model is designed to score the proposals based on the query phrase and extracted visual features. In contrast to that, the authors design an end‐to‐end approach to tackle the visual grounding problem in this study. They use a region proposal network to generate object proposals and the corresponding visual features simultaneously, and multi‐modal factorised bilinear pooling model to fuse the multi‐modal features effectively. After that, two novel losses are posed on top of the multi‐modal features to rank and refine the proposals, respectively. To verify the effectiveness of the proposed approach, the authors conduct experiments on three real‐world visual grounding datasets, namely Flickr‐30k Entities, ReferItGame and RefCOCO. The experimental results demonstrate the significant superiority of the proposed method over the existing state‐of‐the‐arts. |
first_indexed | 2024-03-12T00:28:32Z |
format | Article |
id | doaj.art-c14bde2331fd48989aa7d00f23273041 |
institution | Directory Open Access Journal |
issn | 1751-9632 1751-9640 |
language | English |
last_indexed | 2024-03-12T00:28:32Z |
publishDate | 2019-03-01 |
publisher | Wiley |
record_format | Article |
series | IET Computer Vision |
spelling | doaj.art-c14bde2331fd48989aa7d00f232730412023-09-15T10:31:50ZengWileyIET Computer Vision1751-96321751-96402019-03-0113213113810.1049/iet-cvi.2018.5104End‐to‐end visual grounding via region proposal networks and bilinear poolingChenchao Xiang0Zhou Yu1Suguo Zhu2Jun Yu3Xiaokang Yang4Key Laboratory of Complex Systems Modeling and SimulationSchool of Computer Science and Technology, Hangzhou Dianzi UniversityHangzhouPeople's Republic of ChinaKey Laboratory of Complex Systems Modeling and SimulationSchool of Computer Science and Technology, Hangzhou Dianzi UniversityHangzhouPeople's Republic of ChinaKey Laboratory of Complex Systems Modeling and SimulationSchool of Computer Science and Technology, Hangzhou Dianzi UniversityHangzhouPeople's Republic of ChinaKey Laboratory of Complex Systems Modeling and SimulationSchool of Computer Science and Technology, Hangzhou Dianzi UniversityHangzhouPeople's Republic of ChinaSchool of Electronic Information and Electrical Engineering, Shanghai Jiao Tong UniversityShanghaiPeople's Republic of ChinaPhrase‐based visual grounding aims to localise the object in the image referred by a textual query phrase. Most existing approaches adopt a two‐stage mechanism to address this problem: first, an off‐the‐shelf proposal generation model is adopted to extract region‐based visual features, and then a deep model is designed to score the proposals based on the query phrase and extracted visual features. In contrast to that, the authors design an end‐to‐end approach to tackle the visual grounding problem in this study. They use a region proposal network to generate object proposals and the corresponding visual features simultaneously, and multi‐modal factorised bilinear pooling model to fuse the multi‐modal features effectively. After that, two novel losses are posed on top of the multi‐modal features to rank and refine the proposals, respectively. To verify the effectiveness of the proposed approach, the authors conduct experiments on three real‐world visual grounding datasets, namely Flickr‐30k Entities, ReferItGame and RefCOCO. The experimental results demonstrate the significant superiority of the proposed method over the existing state‐of‐the‐arts.https://doi.org/10.1049/iet-cvi.2018.5104multimodal featuresreal-world visual grounding datasetsend-to-end approachphrase-based visual groundingregion proposal networkstextual query phrase |
spellingShingle | Chenchao Xiang Zhou Yu Suguo Zhu Jun Yu Xiaokang Yang End‐to‐end visual grounding via region proposal networks and bilinear pooling IET Computer Vision multimodal features real-world visual grounding datasets end-to-end approach phrase-based visual grounding region proposal networks textual query phrase |
title | End‐to‐end visual grounding via region proposal networks and bilinear pooling |
title_full | End‐to‐end visual grounding via region proposal networks and bilinear pooling |
title_fullStr | End‐to‐end visual grounding via region proposal networks and bilinear pooling |
title_full_unstemmed | End‐to‐end visual grounding via region proposal networks and bilinear pooling |
title_short | End‐to‐end visual grounding via region proposal networks and bilinear pooling |
title_sort | end to end visual grounding via region proposal networks and bilinear pooling |
topic | multimodal features real-world visual grounding datasets end-to-end approach phrase-based visual grounding region proposal networks textual query phrase |
url | https://doi.org/10.1049/iet-cvi.2018.5104 |
work_keys_str_mv | AT chenchaoxiang endtoendvisualgroundingviaregionproposalnetworksandbilinearpooling AT zhouyu endtoendvisualgroundingviaregionproposalnetworksandbilinearpooling AT suguozhu endtoendvisualgroundingviaregionproposalnetworksandbilinearpooling AT junyu endtoendvisualgroundingviaregionproposalnetworksandbilinearpooling AT xiaokangyang endtoendvisualgroundingviaregionproposalnetworksandbilinearpooling |