Factors Behind the Effectiveness of an Unsupervised Neural Machine Translation System between Korean and Japanese
Korean and Japanese have different writing scripts but share the same Subject-Object-Verb (SOV) word order. In this study, we pre-train a language-generation model using a Masked Sequence-to-Sequence pre-training (MASS) method on Korean and Japanese monolingual corpora. When building the pre-trained...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2021-08-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/11/16/7662 |
_version_ | 1797524732200878080 |
---|---|
author | Yong-Seok Choi Yo-Han Park Seung Yun Sang-Hun Kim Kong-Joo Lee |
author_facet | Yong-Seok Choi Yo-Han Park Seung Yun Sang-Hun Kim Kong-Joo Lee |
author_sort | Yong-Seok Choi |
collection | DOAJ |
description | Korean and Japanese have different writing scripts but share the same Subject-Object-Verb (SOV) word order. In this study, we pre-train a language-generation model using a Masked Sequence-to-Sequence pre-training (MASS) method on Korean and Japanese monolingual corpora. When building the pre-trained generation model, we allow the smallest number of shared vocabularies between the two languages. Then, we build an unsupervised Neural Machine Translation (NMT) system between Korean and Japanese based on the pre-trained generation model. Despite the different writing scripts and few shared vocabularies, the unsupervised NMT system performs well compared to other pairs of languages. Our interest is in the common characteristics of both languages that make the unsupervised NMT perform so well. In this study, we propose a new method to analyze cross-attentions between a source and target language to estimate the language differences from the perspective of machine translation. We calculate cross-attention measurements between Korean–Japanese and Korean–English pairs and compare their performances and characteristics. The Korean–Japanese pair has little difference in word order and a morphological system, and thus the unsupervised NMT between Korean and Japanese can be trained well even without parallel sentences and shared vocabularies. |
first_indexed | 2024-03-10T09:01:35Z |
format | Article |
id | doaj.art-9fd6481ef06b45f39bd4bc031d1819f7 |
institution | Directory Open Access Journal |
issn | 2076-3417 |
language | English |
last_indexed | 2024-03-10T09:01:35Z |
publishDate | 2021-08-01 |
publisher | MDPI AG |
record_format | Article |
series | Applied Sciences |
spelling | doaj.art-9fd6481ef06b45f39bd4bc031d1819f72023-11-22T06:45:27ZengMDPI AGApplied Sciences2076-34172021-08-011116766210.3390/app11167662Factors Behind the Effectiveness of an Unsupervised Neural Machine Translation System between Korean and JapaneseYong-Seok Choi0Yo-Han Park1Seung Yun2Sang-Hun Kim3Kong-Joo Lee4Department of Radio and Information Communications Engineering, ChungNam National University, 99 Daejak-ro, Yuseong-gu, Daejeon 34134, KoreaDepartment of Radio and Information Communications Engineering, ChungNam National University, 99 Daejak-ro, Yuseong-gu, Daejeon 34134, KoreaArtificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute (ETRI), 218 Gajeong-ro, Yuseong-gu, Daejeon 34129, KoreaArtificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute (ETRI), 218 Gajeong-ro, Yuseong-gu, Daejeon 34129, KoreaDepartment of Radio and Information Communications Engineering, ChungNam National University, 99 Daejak-ro, Yuseong-gu, Daejeon 34134, KoreaKorean and Japanese have different writing scripts but share the same Subject-Object-Verb (SOV) word order. In this study, we pre-train a language-generation model using a Masked Sequence-to-Sequence pre-training (MASS) method on Korean and Japanese monolingual corpora. When building the pre-trained generation model, we allow the smallest number of shared vocabularies between the two languages. Then, we build an unsupervised Neural Machine Translation (NMT) system between Korean and Japanese based on the pre-trained generation model. Despite the different writing scripts and few shared vocabularies, the unsupervised NMT system performs well compared to other pairs of languages. Our interest is in the common characteristics of both languages that make the unsupervised NMT perform so well. In this study, we propose a new method to analyze cross-attentions between a source and target language to estimate the language differences from the perspective of machine translation. We calculate cross-attention measurements between Korean–Japanese and Korean–English pairs and compare their performances and characteristics. The Korean–Japanese pair has little difference in word order and a morphological system, and thus the unsupervised NMT between Korean and Japanese can be trained well even without parallel sentences and shared vocabularies.https://www.mdpi.com/2076-3417/11/16/7662MASSpre-trained generation modelunsupervised neural machine translationlanguage typologywriting scriptSOV word order |
spellingShingle | Yong-Seok Choi Yo-Han Park Seung Yun Sang-Hun Kim Kong-Joo Lee Factors Behind the Effectiveness of an Unsupervised Neural Machine Translation System between Korean and Japanese Applied Sciences MASS pre-trained generation model unsupervised neural machine translation language typology writing script SOV word order |
title | Factors Behind the Effectiveness of an Unsupervised Neural Machine Translation System between Korean and Japanese |
title_full | Factors Behind the Effectiveness of an Unsupervised Neural Machine Translation System between Korean and Japanese |
title_fullStr | Factors Behind the Effectiveness of an Unsupervised Neural Machine Translation System between Korean and Japanese |
title_full_unstemmed | Factors Behind the Effectiveness of an Unsupervised Neural Machine Translation System between Korean and Japanese |
title_short | Factors Behind the Effectiveness of an Unsupervised Neural Machine Translation System between Korean and Japanese |
title_sort | factors behind the effectiveness of an unsupervised neural machine translation system between korean and japanese |
topic | MASS pre-trained generation model unsupervised neural machine translation language typology writing script SOV word order |
url | https://www.mdpi.com/2076-3417/11/16/7662 |
work_keys_str_mv | AT yongseokchoi factorsbehindtheeffectivenessofanunsupervisedneuralmachinetranslationsystembetweenkoreanandjapanese AT yohanpark factorsbehindtheeffectivenessofanunsupervisedneuralmachinetranslationsystembetweenkoreanandjapanese AT seungyun factorsbehindtheeffectivenessofanunsupervisedneuralmachinetranslationsystembetweenkoreanandjapanese AT sanghunkim factorsbehindtheeffectivenessofanunsupervisedneuralmachinetranslationsystembetweenkoreanandjapanese AT kongjoolee factorsbehindtheeffectivenessofanunsupervisedneuralmachinetranslationsystembetweenkoreanandjapanese |