Factors Behind the Effectiveness of an Unsupervised Neural Machine Translation System between Korean and Japanese

Korean and Japanese have different writing scripts but share the same Subject-Object-Verb (SOV) word order. In this study, we pre-train a language-generation model using a Masked Sequence-to-Sequence pre-training (MASS) method on Korean and Japanese monolingual corpora. When building the pre-trained...

Full description

Bibliographic Details
Main Authors: Yong-Seok Choi, Yo-Han Park, Seung Yun, Sang-Hun Kim, Kong-Joo Lee
Format: Article
Language:English
Published: MDPI AG 2021-08-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/11/16/7662
_version_ 1797524732200878080
author Yong-Seok Choi
Yo-Han Park
Seung Yun
Sang-Hun Kim
Kong-Joo Lee
author_facet Yong-Seok Choi
Yo-Han Park
Seung Yun
Sang-Hun Kim
Kong-Joo Lee
author_sort Yong-Seok Choi
collection DOAJ
description Korean and Japanese have different writing scripts but share the same Subject-Object-Verb (SOV) word order. In this study, we pre-train a language-generation model using a Masked Sequence-to-Sequence pre-training (MASS) method on Korean and Japanese monolingual corpora. When building the pre-trained generation model, we allow the smallest number of shared vocabularies between the two languages. Then, we build an unsupervised Neural Machine Translation (NMT) system between Korean and Japanese based on the pre-trained generation model. Despite the different writing scripts and few shared vocabularies, the unsupervised NMT system performs well compared to other pairs of languages. Our interest is in the common characteristics of both languages that make the unsupervised NMT perform so well. In this study, we propose a new method to analyze cross-attentions between a source and target language to estimate the language differences from the perspective of machine translation. We calculate cross-attention measurements between Korean–Japanese and Korean–English pairs and compare their performances and characteristics. The Korean–Japanese pair has little difference in word order and a morphological system, and thus the unsupervised NMT between Korean and Japanese can be trained well even without parallel sentences and shared vocabularies.
first_indexed 2024-03-10T09:01:35Z
format Article
id doaj.art-9fd6481ef06b45f39bd4bc031d1819f7
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-10T09:01:35Z
publishDate 2021-08-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-9fd6481ef06b45f39bd4bc031d1819f72023-11-22T06:45:27ZengMDPI AGApplied Sciences2076-34172021-08-011116766210.3390/app11167662Factors Behind the Effectiveness of an Unsupervised Neural Machine Translation System between Korean and JapaneseYong-Seok Choi0Yo-Han Park1Seung Yun2Sang-Hun Kim3Kong-Joo Lee4Department of Radio and Information Communications Engineering, ChungNam National University, 99 Daejak-ro, Yuseong-gu, Daejeon 34134, KoreaDepartment of Radio and Information Communications Engineering, ChungNam National University, 99 Daejak-ro, Yuseong-gu, Daejeon 34134, KoreaArtificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute (ETRI), 218 Gajeong-ro, Yuseong-gu, Daejeon 34129, KoreaArtificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute (ETRI), 218 Gajeong-ro, Yuseong-gu, Daejeon 34129, KoreaDepartment of Radio and Information Communications Engineering, ChungNam National University, 99 Daejak-ro, Yuseong-gu, Daejeon 34134, KoreaKorean and Japanese have different writing scripts but share the same Subject-Object-Verb (SOV) word order. In this study, we pre-train a language-generation model using a Masked Sequence-to-Sequence pre-training (MASS) method on Korean and Japanese monolingual corpora. When building the pre-trained generation model, we allow the smallest number of shared vocabularies between the two languages. Then, we build an unsupervised Neural Machine Translation (NMT) system between Korean and Japanese based on the pre-trained generation model. Despite the different writing scripts and few shared vocabularies, the unsupervised NMT system performs well compared to other pairs of languages. Our interest is in the common characteristics of both languages that make the unsupervised NMT perform so well. In this study, we propose a new method to analyze cross-attentions between a source and target language to estimate the language differences from the perspective of machine translation. We calculate cross-attention measurements between Korean–Japanese and Korean–English pairs and compare their performances and characteristics. The Korean–Japanese pair has little difference in word order and a morphological system, and thus the unsupervised NMT between Korean and Japanese can be trained well even without parallel sentences and shared vocabularies.https://www.mdpi.com/2076-3417/11/16/7662MASSpre-trained generation modelunsupervised neural machine translationlanguage typologywriting scriptSOV word order
spellingShingle Yong-Seok Choi
Yo-Han Park
Seung Yun
Sang-Hun Kim
Kong-Joo Lee
Factors Behind the Effectiveness of an Unsupervised Neural Machine Translation System between Korean and Japanese
Applied Sciences
MASS
pre-trained generation model
unsupervised neural machine translation
language typology
writing script
SOV word order
title Factors Behind the Effectiveness of an Unsupervised Neural Machine Translation System between Korean and Japanese
title_full Factors Behind the Effectiveness of an Unsupervised Neural Machine Translation System between Korean and Japanese
title_fullStr Factors Behind the Effectiveness of an Unsupervised Neural Machine Translation System between Korean and Japanese
title_full_unstemmed Factors Behind the Effectiveness of an Unsupervised Neural Machine Translation System between Korean and Japanese
title_short Factors Behind the Effectiveness of an Unsupervised Neural Machine Translation System between Korean and Japanese
title_sort factors behind the effectiveness of an unsupervised neural machine translation system between korean and japanese
topic MASS
pre-trained generation model
unsupervised neural machine translation
language typology
writing script
SOV word order
url https://www.mdpi.com/2076-3417/11/16/7662
work_keys_str_mv AT yongseokchoi factorsbehindtheeffectivenessofanunsupervisedneuralmachinetranslationsystembetweenkoreanandjapanese
AT yohanpark factorsbehindtheeffectivenessofanunsupervisedneuralmachinetranslationsystembetweenkoreanandjapanese
AT seungyun factorsbehindtheeffectivenessofanunsupervisedneuralmachinetranslationsystembetweenkoreanandjapanese
AT sanghunkim factorsbehindtheeffectivenessofanunsupervisedneuralmachinetranslationsystembetweenkoreanandjapanese
AT kongjoolee factorsbehindtheeffectivenessofanunsupervisedneuralmachinetranslationsystembetweenkoreanandjapanese