Cross-Lingual Visual Grounding

Visual grounding is a vision and language understanding task aiming at locating a region in an image according to a specific query phrase. However, most previous studies only address this task for the English language. Although there are previous cross-lingual vision and language studies, they work...

Full description

Bibliographic Details
Main Authors:	Wenjian Dong, Mayu Otani, Noa Garcia, Yuta Nakashima, Chenhui Chu
Format:	Article
Language:	English
Published:	IEEE 2021-01-01
Series:	IEEE Access
Subjects:	Visual grounding cross-lingual vision and language
Online Access:	https://ieeexplore.ieee.org/document/9305199/

_version_	1818448927604604928
author	Wenjian Dong Mayu Otani Noa Garcia Yuta Nakashima Chenhui Chu
author_facet	Wenjian Dong Mayu Otani Noa Garcia Yuta Nakashima Chenhui Chu
author_sort	Wenjian Dong
collection	DOAJ
description	Visual grounding is a vision and language understanding task aiming at locating a region in an image according to a specific query phrase. However, most previous studies only address this task for the English language. Although there are previous cross-lingual vision and language studies, they work on image and video captioning, and visual question answering. In this paper, we present the first work on cross-lingual visual grounding to expand the task to different languages to study an effective yet efficient way for visual grounding on other languages. We construct a visual grounding dataset for French via crowdsourcing. Our dataset consists of 14k, 3k, and 3k query phrases with their corresponding image regions for 5k, 1k, and 1k training, validation and test images, respectively. In addition, we propose a cross-lingual visual grounding approach that transfers the knowledge from a learnt English model to a French model. Despite that the size of our French dataset is 1/6 of the English dataset, experiments indicate that our model achieves an accuracy of 65.17%, which is comparable to the accuracy 69.04% of the English model. Our dataset and codes are available at https://github.com/ids-cv/Multi-Lingual-Visual-Grounding.
first_indexed	2024-12-14T20:27:17Z
format	Article
id	doaj.art-83a005f4ea804919a06c27bf6ede3169
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-12-14T20:27:17Z
publishDate	2021-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-83a005f4ea804919a06c27bf6ede31692022-12-21T22:48:36ZengIEEEIEEE Access2169-35362021-01-01934935810.1109/ACCESS.2020.30467199305199Cross-Lingual Visual GroundingWenjian Dong0Mayu Otani1Noa Garcia2Yuta Nakashima3https://orcid.org/0000-0001-8000-3567Chenhui Chu4https://orcid.org/0000-0001-9848-6384École Polytechnique, Palaiseau, FranceCyberAgent, Inc., Shibuya, JapanInstitute for Datability Science, Osaka University, Suita, JapanInstitute for Datability Science, Osaka University, Suita, JapanGraduate School of Informatics, Kyoto University, Kyoto, JapanVisual grounding is a vision and language understanding task aiming at locating a region in an image according to a specific query phrase. However, most previous studies only address this task for the English language. Although there are previous cross-lingual vision and language studies, they work on image and video captioning, and visual question answering. In this paper, we present the first work on cross-lingual visual grounding to expand the task to different languages to study an effective yet efficient way for visual grounding on other languages. We construct a visual grounding dataset for French via crowdsourcing. Our dataset consists of 14k, 3k, and 3k query phrases with their corresponding image regions for 5k, 1k, and 1k training, validation and test images, respectively. In addition, we propose a cross-lingual visual grounding approach that transfers the knowledge from a learnt English model to a French model. Despite that the size of our French dataset is 1/6 of the English dataset, experiments indicate that our model achieves an accuracy of 65.17%, which is comparable to the accuracy 69.04% of the English model. Our dataset and codes are available at https://github.com/ids-cv/Multi-Lingual-Visual-Grounding.https://ieeexplore.ieee.org/document/9305199/Visual groundingcross-lingualvision and language
spellingShingle	Wenjian Dong Mayu Otani Noa Garcia Yuta Nakashima Chenhui Chu Cross-Lingual Visual Grounding IEEE Access Visual grounding cross-lingual vision and language
title	Cross-Lingual Visual Grounding
title_full	Cross-Lingual Visual Grounding
title_fullStr	Cross-Lingual Visual Grounding
title_full_unstemmed	Cross-Lingual Visual Grounding
title_short	Cross-Lingual Visual Grounding
title_sort	cross lingual visual grounding
topic	Visual grounding cross-lingual vision and language
url	https://ieeexplore.ieee.org/document/9305199/
work_keys_str_mv	AT wenjiandong crosslingualvisualgrounding AT mayuotani crosslingualvisualgrounding AT noagarcia crosslingualvisualgrounding AT yutanakashima crosslingualvisualgrounding AT chenhuichu crosslingualvisualgrounding

Cross-Lingual Visual Grounding

Similar Items