Chinese–Vietnamese Pseudo-Parallel Sentences Extraction Based on Image Information Fusion

Parallel sentences play a crucial role in various NLP tasks, particularly for cross-lingual tasks such as machine translation. However, due to the time-consuming and laborious nature of manual construction, many low-resource languages still suffer from a lack of large-scale parallel data. The object...

Full description

Bibliographic Details
Main Authors: Yonghua Wen, Junjun Guo, Zhiqiang Yu, Zhengtao Yu
Format: Article
Language:English
Published: MDPI AG 2023-05-01
Series:Information
Subjects:
Online Access:https://www.mdpi.com/2078-2489/14/5/298
_version_ 1797599765691629568
author Yonghua Wen
Junjun Guo
Zhiqiang Yu
Zhengtao Yu
author_facet Yonghua Wen
Junjun Guo
Zhiqiang Yu
Zhengtao Yu
author_sort Yonghua Wen
collection DOAJ
description Parallel sentences play a crucial role in various NLP tasks, particularly for cross-lingual tasks such as machine translation. However, due to the time-consuming and laborious nature of manual construction, many low-resource languages still suffer from a lack of large-scale parallel data. The objective of pseudo-parallel sentence extraction is to automatically identify sentence pairs in different languages that convey similar meanings. Earlier methods heavily relied on parallel data, which is unsuitable for low-resource scenarios. The current mainstream research direction is to use transfer learning or unsupervised learning based on cross-lingual word embeddings and multilingual pre-trained models; however, these methods are ineffective for languages with substantial differences. To address this issue, we propose a sentence extraction method that leverages image information fusion to extract Chinese–Vietnamese pseudo-parallel sentences from collections of bilingual texts. Our method first employs an adaptive image and text feature fusion strategy to efficiently extract the bilingual parallel sentence pair, and then, a multimodal fusion method is presented to balance the information between the image and text modalities. The experiments on multiple benchmarks show that our method achieves promising results compared to a competitive baseline by infusing additional external image information.
first_indexed 2024-03-11T03:38:57Z
format Article
id doaj.art-c925931d160b4024acbc8c406b9a5809
institution Directory Open Access Journal
issn 2078-2489
language English
last_indexed 2024-03-11T03:38:57Z
publishDate 2023-05-01
publisher MDPI AG
record_format Article
series Information
spelling doaj.art-c925931d160b4024acbc8c406b9a58092023-11-18T01:48:18ZengMDPI AGInformation2078-24892023-05-0114529810.3390/info14050298Chinese–Vietnamese Pseudo-Parallel Sentences Extraction Based on Image Information FusionYonghua Wen0Junjun Guo1Zhiqiang Yu2Zhengtao Yu3Faculty of Information Engineering and Automation, Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming 650500, ChinaFaculty of Information Engineering and Automation, Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming 650500, ChinaFaculty of Information Engineering and Automation, Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming 650500, ChinaFaculty of Information Engineering and Automation, Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming 650500, ChinaParallel sentences play a crucial role in various NLP tasks, particularly for cross-lingual tasks such as machine translation. However, due to the time-consuming and laborious nature of manual construction, many low-resource languages still suffer from a lack of large-scale parallel data. The objective of pseudo-parallel sentence extraction is to automatically identify sentence pairs in different languages that convey similar meanings. Earlier methods heavily relied on parallel data, which is unsuitable for low-resource scenarios. The current mainstream research direction is to use transfer learning or unsupervised learning based on cross-lingual word embeddings and multilingual pre-trained models; however, these methods are ineffective for languages with substantial differences. To address this issue, we propose a sentence extraction method that leverages image information fusion to extract Chinese–Vietnamese pseudo-parallel sentences from collections of bilingual texts. Our method first employs an adaptive image and text feature fusion strategy to efficiently extract the bilingual parallel sentence pair, and then, a multimodal fusion method is presented to balance the information between the image and text modalities. The experiments on multiple benchmarks show that our method achieves promising results compared to a competitive baseline by infusing additional external image information.https://www.mdpi.com/2078-2489/14/5/298neural machine translationpseudo-parallel sentence extractionimage information fusion
spellingShingle Yonghua Wen
Junjun Guo
Zhiqiang Yu
Zhengtao Yu
Chinese–Vietnamese Pseudo-Parallel Sentences Extraction Based on Image Information Fusion
Information
neural machine translation
pseudo-parallel sentence extraction
image information fusion
title Chinese–Vietnamese Pseudo-Parallel Sentences Extraction Based on Image Information Fusion
title_full Chinese–Vietnamese Pseudo-Parallel Sentences Extraction Based on Image Information Fusion
title_fullStr Chinese–Vietnamese Pseudo-Parallel Sentences Extraction Based on Image Information Fusion
title_full_unstemmed Chinese–Vietnamese Pseudo-Parallel Sentences Extraction Based on Image Information Fusion
title_short Chinese–Vietnamese Pseudo-Parallel Sentences Extraction Based on Image Information Fusion
title_sort chinese vietnamese pseudo parallel sentences extraction based on image information fusion
topic neural machine translation
pseudo-parallel sentence extraction
image information fusion
url https://www.mdpi.com/2078-2489/14/5/298
work_keys_str_mv AT yonghuawen chinesevietnamesepseudoparallelsentencesextractionbasedonimageinformationfusion
AT junjunguo chinesevietnamesepseudoparallelsentencesextractionbasedonimageinformationfusion
AT zhiqiangyu chinesevietnamesepseudoparallelsentencesextractionbasedonimageinformationfusion
AT zhengtaoyu chinesevietnamesepseudoparallelsentencesextractionbasedonimageinformationfusion