Pre-Trained Word Embedding and Language Model Improve Multimodal Machine Translation: A Case Study in Multi30K

Multimodal machine translation (MMT) is an attractive application of neural machine translation (NMT) that is commonly incorporated with image information. However, the MMT models proposed thus far have only comparable or slightly better performance than their text-only counterparts. One potential c...

Full description

Bibliographic Details
Main Authors:	Tosho Hirasawa, Masahiro Kaneko, Aizhan Imankulova, Mamoru Komachi
Format:	Article
Language:	English
Published:	IEEE 2022-01-01
Series:	IEEE Access
Subjects:	Multimodal machine translation natural language processing neural machine translation
Online Access:	https://ieeexplore.ieee.org/document/9803016/

_version_	1818549666116009984
author	Tosho Hirasawa Masahiro Kaneko Aizhan Imankulova Mamoru Komachi
author_facet	Tosho Hirasawa Masahiro Kaneko Aizhan Imankulova Mamoru Komachi
author_sort	Tosho Hirasawa
collection	DOAJ
description	Multimodal machine translation (MMT) is an attractive application of neural machine translation (NMT) that is commonly incorporated with image information. However, the MMT models proposed thus far have only comparable or slightly better performance than their text-only counterparts. One potential cause of this infeasibility is a lack of large-scale data. Most previous studies mitigate this limitation by employing large-scale textual parallel corpora, which are more accessible than multimodal parallel corpora, in various ways. However, these corpora are still available on only a limited scale in low-resource language pairs or domains. In this study, we leveraged monolingual (or multimodal monolingual) corpora, which are available at scale in most languages and domains, to improve MMT models. Our approach follows that of previous unimodal works that use monolingual corpora to train the word embedding or language model and incorporate them into NMT systems. While these methods demonstrated the advantage of using pre-trained representations, there is still room for MMT models to improve. To this end, our system employs debiasing procedures for the word embedding and multimodal extension of the language model (visual-language model, VLM) to make better use of the pre-trained knowledge in the MMT task. The results of evaluations conducted on the de facto MMT dataset for the English–German translation indicate that the improvement obtained using well-tailored word embedding and VLM is approximately +1.84 BLEU and +1.63 BLEU, respectively. The evaluation on multiple language pairs reveals their adoptability across the languages. Beyond the success of our system, we also conducted an extensive analysis on VLM manipulation and showed promising areas for developing better MMT models by exploiting VLM; some benefits brought by either modality are missing, and MMT with VLM generates less fluent translations. Our code is available at <uri>https://github.com/toshohirasawa/mmt-with-monolingual-data</uri>.
first_indexed	2024-12-12T08:36:19Z
format	Article
id	doaj.art-c996a3532f6d4e9eac466d2220e98638
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-12-12T08:36:19Z
publishDate	2022-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-c996a3532f6d4e9eac466d2220e986382022-12-22T00:30:56ZengIEEEIEEE Access2169-35362022-01-0110676536766810.1109/ACCESS.2022.31852439803016Pre-Trained Word Embedding and Language Model Improve Multimodal Machine Translation: A Case Study in Multi30KTosho Hirasawa0https://orcid.org/0000-0003-4657-8214Masahiro Kaneko1https://orcid.org/0000-0002-5117-5447Aizhan Imankulova2Mamoru Komachi3https://orcid.org/0000-0003-1166-1739Graduate School of System Design, Tokyo Metropolitan University, Hino, Tokyo, JapanGraduate School of System Design, Tokyo Metropolitan University, Hino, Tokyo, JapanGraduate School of System Design, Tokyo Metropolitan University, Hino, Tokyo, JapanGraduate School of System Design, Tokyo Metropolitan University, Hino, Tokyo, JapanMultimodal machine translation (MMT) is an attractive application of neural machine translation (NMT) that is commonly incorporated with image information. However, the MMT models proposed thus far have only comparable or slightly better performance than their text-only counterparts. One potential cause of this infeasibility is a lack of large-scale data. Most previous studies mitigate this limitation by employing large-scale textual parallel corpora, which are more accessible than multimodal parallel corpora, in various ways. However, these corpora are still available on only a limited scale in low-resource language pairs or domains. In this study, we leveraged monolingual (or multimodal monolingual) corpora, which are available at scale in most languages and domains, to improve MMT models. Our approach follows that of previous unimodal works that use monolingual corpora to train the word embedding or language model and incorporate them into NMT systems. While these methods demonstrated the advantage of using pre-trained representations, there is still room for MMT models to improve. To this end, our system employs debiasing procedures for the word embedding and multimodal extension of the language model (visual-language model, VLM) to make better use of the pre-trained knowledge in the MMT task. The results of evaluations conducted on the de facto MMT dataset for the English–German translation indicate that the improvement obtained using well-tailored word embedding and VLM is approximately +1.84 BLEU and +1.63 BLEU, respectively. The evaluation on multiple language pairs reveals their adoptability across the languages. Beyond the success of our system, we also conducted an extensive analysis on VLM manipulation and showed promising areas for developing better MMT models by exploiting VLM; some benefits brought by either modality are missing, and MMT with VLM generates less fluent translations. Our code is available at <uri>https://github.com/toshohirasawa/mmt-with-monolingual-data</uri>.https://ieeexplore.ieee.org/document/9803016/Multimodal machine translationnatural language processingneural machine translation
spellingShingle	Tosho Hirasawa Masahiro Kaneko Aizhan Imankulova Mamoru Komachi Pre-Trained Word Embedding and Language Model Improve Multimodal Machine Translation: A Case Study in Multi30K IEEE Access Multimodal machine translation natural language processing neural machine translation
title	Pre-Trained Word Embedding and Language Model Improve Multimodal Machine Translation: A Case Study in Multi30K
title_full	Pre-Trained Word Embedding and Language Model Improve Multimodal Machine Translation: A Case Study in Multi30K
title_fullStr	Pre-Trained Word Embedding and Language Model Improve Multimodal Machine Translation: A Case Study in Multi30K
title_full_unstemmed	Pre-Trained Word Embedding and Language Model Improve Multimodal Machine Translation: A Case Study in Multi30K
title_short	Pre-Trained Word Embedding and Language Model Improve Multimodal Machine Translation: A Case Study in Multi30K
title_sort	pre trained word embedding and language model improve multimodal machine translation a case study in multi30k
topic	Multimodal machine translation natural language processing neural machine translation
url	https://ieeexplore.ieee.org/document/9803016/
work_keys_str_mv	AT toshohirasawa pretrainedwordembeddingandlanguagemodelimprovemultimodalmachinetranslationacasestudyinmulti30k AT masahirokaneko pretrainedwordembeddingandlanguagemodelimprovemultimodalmachinetranslationacasestudyinmulti30k AT aizhanimankulova pretrainedwordembeddingandlanguagemodelimprovemultimodalmachinetranslationacasestudyinmulti30k AT mamorukomachi pretrainedwordembeddingandlanguagemodelimprovemultimodalmachinetranslationacasestudyinmulti30k

Pre-Trained Word Embedding and Language Model Improve Multimodal Machine Translation: A Case Study in Multi30K

Similar Items