Morphological segmentation method for Turkic language neural machine translation

Dictionaries play an important role in neural machine translation (NMT). However, a large dictionary requires a significant amount of memory, which limits the application of NMT and can cause a memory error. This limitation can be solved by segmenting each word into morphemes in parallel source corp...

Full description

Bibliographic Details
Main Authors:	U. Tukeyev, A. Karibayeva, Z h. Zhumanov
Format:	Article
Language:	English
Published:	Taylor & Francis Group 2020-01-01
Series:	Cogent Engineering
Subjects:	neural machine translation morphological segmentation turkic languages kazakh kyrgyz uzbek
Online Access:	http://dx.doi.org/10.1080/23311916.2020.1856500

_version_	1797762885452038144
author	U. Tukeyev A. Karibayeva Z h. Zhumanov
author_facet	U. Tukeyev A. Karibayeva Z h. Zhumanov
author_sort	U. Tukeyev
collection	DOAJ
description	Dictionaries play an important role in neural machine translation (NMT). However, a large dictionary requires a significant amount of memory, which limits the application of NMT and can cause a memory error. This limitation can be solved by segmenting each word into morphemes in parallel source corpora. Therefore, this study introduces a new morphological segmentation approach for Turkic languages based on the complete set of endings (CSE), which reduces the vocabulary volume of the source corpora. Herein, we demonstrate the proposed CSE-based morphological segmentation method for the Kazakh, Kyrgyz, and Uzbek languages and present the results of computational NMT experiments for the Kazakh language. The NMT experiment results show that in comparison with byte-pair encoding (BPE)-based segmentation, the proposed CSE-based segmentation increases the bilingual evaluation understudy score of 0.5 and 0.2 points on average for Kazakh–English and English–Kazakh pairs, respectively. Furthermore, in comparison with the BPE-based segmentation, the proposed CSE-based segmentation approach reduced the vocabulary size in NMT by more than a factor of two. This feature of the proposed segmentation approach will be crucial for NMT as the size of the source corpora is increased to improve translation quality.
first_indexed	2024-03-12T19:33:42Z
format	Article
id	doaj.art-a7fecba605c9481998eec25cd64f0095
institution	Directory Open Access Journal
issn	2331-1916
language	English
last_indexed	2024-03-12T19:33:42Z
publishDate	2020-01-01
publisher	Taylor & Francis Group
record_format	Article
series	Cogent Engineering
spelling	doaj.art-a7fecba605c9481998eec25cd64f00952023-08-02T04:19:50ZengTaylor & Francis GroupCogent Engineering2331-19162020-01-017110.1080/23311916.2020.18565001856500Morphological segmentation method for Turkic language neural machine translationU. Tukeyev0A. Karibayeva1Z h. Zhumanov2Al-Farabi Kazakh National UniversityAl-Farabi Kazakh National UniversityAl-Farabi Kazakh National UniversityDictionaries play an important role in neural machine translation (NMT). However, a large dictionary requires a significant amount of memory, which limits the application of NMT and can cause a memory error. This limitation can be solved by segmenting each word into morphemes in parallel source corpora. Therefore, this study introduces a new morphological segmentation approach for Turkic languages based on the complete set of endings (CSE), which reduces the vocabulary volume of the source corpora. Herein, we demonstrate the proposed CSE-based morphological segmentation method for the Kazakh, Kyrgyz, and Uzbek languages and present the results of computational NMT experiments for the Kazakh language. The NMT experiment results show that in comparison with byte-pair encoding (BPE)-based segmentation, the proposed CSE-based segmentation increases the bilingual evaluation understudy score of 0.5 and 0.2 points on average for Kazakh–English and English–Kazakh pairs, respectively. Furthermore, in comparison with the BPE-based segmentation, the proposed CSE-based segmentation approach reduced the vocabulary size in NMT by more than a factor of two. This feature of the proposed segmentation approach will be crucial for NMT as the size of the source corpora is increased to improve translation quality.http://dx.doi.org/10.1080/23311916.2020.1856500neural machine translationmorphological segmentationturkic languageskazakhkyrgyzuzbek
spellingShingle	U. Tukeyev A. Karibayeva Z h. Zhumanov Morphological segmentation method for Turkic language neural machine translation Cogent Engineering neural machine translation morphological segmentation turkic languages kazakh kyrgyz uzbek
title	Morphological segmentation method for Turkic language neural machine translation
title_full	Morphological segmentation method for Turkic language neural machine translation
title_fullStr	Morphological segmentation method for Turkic language neural machine translation
title_full_unstemmed	Morphological segmentation method for Turkic language neural machine translation
title_short	Morphological segmentation method for Turkic language neural machine translation
title_sort	morphological segmentation method for turkic language neural machine translation
topic	neural machine translation morphological segmentation turkic languages kazakh kyrgyz uzbek
url	http://dx.doi.org/10.1080/23311916.2020.1856500
work_keys_str_mv	AT utukeyev morphologicalsegmentationmethodforturkiclanguageneuralmachinetranslation AT akaribayeva morphologicalsegmentationmethodforturkiclanguageneuralmachinetranslation AT zhzhumanov morphologicalsegmentationmethodforturkiclanguageneuralmachinetranslation

Morphological segmentation method for Turkic language neural machine translation

Similar Items