Morphological segmentation method for Turkic language neural machine translation
Dictionaries play an important role in neural machine translation (NMT). However, a large dictionary requires a significant amount of memory, which limits the application of NMT and can cause a memory error. This limitation can be solved by segmenting each word into morphemes in parallel source corp...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Taylor & Francis Group
2020-01-01
|
Series: | Cogent Engineering |
Subjects: | |
Online Access: | http://dx.doi.org/10.1080/23311916.2020.1856500 |
_version_ | 1797762885452038144 |
---|---|
author | U. Tukeyev A. Karibayeva Z h. Zhumanov |
author_facet | U. Tukeyev A. Karibayeva Z h. Zhumanov |
author_sort | U. Tukeyev |
collection | DOAJ |
description | Dictionaries play an important role in neural machine translation (NMT). However, a large dictionary requires a significant amount of memory, which limits the application of NMT and can cause a memory error. This limitation can be solved by segmenting each word into morphemes in parallel source corpora. Therefore, this study introduces a new morphological segmentation approach for Turkic languages based on the complete set of endings (CSE), which reduces the vocabulary volume of the source corpora. Herein, we demonstrate the proposed CSE-based morphological segmentation method for the Kazakh, Kyrgyz, and Uzbek languages and present the results of computational NMT experiments for the Kazakh language. The NMT experiment results show that in comparison with byte-pair encoding (BPE)-based segmentation, the proposed CSE-based segmentation increases the bilingual evaluation understudy score of 0.5 and 0.2 points on average for Kazakh–English and English–Kazakh pairs, respectively. Furthermore, in comparison with the BPE-based segmentation, the proposed CSE-based segmentation approach reduced the vocabulary size in NMT by more than a factor of two. This feature of the proposed segmentation approach will be crucial for NMT as the size of the source corpora is increased to improve translation quality. |
first_indexed | 2024-03-12T19:33:42Z |
format | Article |
id | doaj.art-a7fecba605c9481998eec25cd64f0095 |
institution | Directory Open Access Journal |
issn | 2331-1916 |
language | English |
last_indexed | 2024-03-12T19:33:42Z |
publishDate | 2020-01-01 |
publisher | Taylor & Francis Group |
record_format | Article |
series | Cogent Engineering |
spelling | doaj.art-a7fecba605c9481998eec25cd64f00952023-08-02T04:19:50ZengTaylor & Francis GroupCogent Engineering2331-19162020-01-017110.1080/23311916.2020.18565001856500Morphological segmentation method for Turkic language neural machine translationU. Tukeyev0A. Karibayeva1Z h. Zhumanov2Al-Farabi Kazakh National UniversityAl-Farabi Kazakh National UniversityAl-Farabi Kazakh National UniversityDictionaries play an important role in neural machine translation (NMT). However, a large dictionary requires a significant amount of memory, which limits the application of NMT and can cause a memory error. This limitation can be solved by segmenting each word into morphemes in parallel source corpora. Therefore, this study introduces a new morphological segmentation approach for Turkic languages based on the complete set of endings (CSE), which reduces the vocabulary volume of the source corpora. Herein, we demonstrate the proposed CSE-based morphological segmentation method for the Kazakh, Kyrgyz, and Uzbek languages and present the results of computational NMT experiments for the Kazakh language. The NMT experiment results show that in comparison with byte-pair encoding (BPE)-based segmentation, the proposed CSE-based segmentation increases the bilingual evaluation understudy score of 0.5 and 0.2 points on average for Kazakh–English and English–Kazakh pairs, respectively. Furthermore, in comparison with the BPE-based segmentation, the proposed CSE-based segmentation approach reduced the vocabulary size in NMT by more than a factor of two. This feature of the proposed segmentation approach will be crucial for NMT as the size of the source corpora is increased to improve translation quality.http://dx.doi.org/10.1080/23311916.2020.1856500neural machine translationmorphological segmentationturkic languageskazakhkyrgyzuzbek |
spellingShingle | U. Tukeyev A. Karibayeva Z h. Zhumanov Morphological segmentation method for Turkic language neural machine translation Cogent Engineering neural machine translation morphological segmentation turkic languages kazakh kyrgyz uzbek |
title | Morphological segmentation method for Turkic language neural machine translation |
title_full | Morphological segmentation method for Turkic language neural machine translation |
title_fullStr | Morphological segmentation method for Turkic language neural machine translation |
title_full_unstemmed | Morphological segmentation method for Turkic language neural machine translation |
title_short | Morphological segmentation method for Turkic language neural machine translation |
title_sort | morphological segmentation method for turkic language neural machine translation |
topic | neural machine translation morphological segmentation turkic languages kazakh kyrgyz uzbek |
url | http://dx.doi.org/10.1080/23311916.2020.1856500 |
work_keys_str_mv | AT utukeyev morphologicalsegmentationmethodforturkiclanguageneuralmachinetranslation AT akaribayeva morphologicalsegmentationmethodforturkiclanguageneuralmachinetranslation AT zhzhumanov morphologicalsegmentationmethodforturkiclanguageneuralmachinetranslation |