The neural machine translation models for the low-resource Kazakh–English language pair

The development of the machine translation field was driven by people’s need to communicate with each other globally by automatically translating words, sentences, and texts from one language into another. The neural machine translation approach has become one of the most significant in recent years...

Full description

Bibliographic Details
Main Authors:	Vladislav Karyukin, Diana Rakhimova, Aidana Karibayeva, Aliya Turganbayeva, Asem Turarbek
Format:	Article
Language:	English
Published:	PeerJ Inc. 2023-02-01
Series:	PeerJ Computer Science
Subjects:	Neural machine translation Forward translation Backward translation Seq2Seq RNN BRNN
Online Access:	https://peerj.com/articles/cs-1224.pdf

_version_	1811166561597128704
author	Vladislav Karyukin Diana Rakhimova Aidana Karibayeva Aliya Turganbayeva Asem Turarbek
author_facet	Vladislav Karyukin Diana Rakhimova Aidana Karibayeva Aliya Turganbayeva Asem Turarbek
author_sort	Vladislav Karyukin
collection	DOAJ
description	The development of the machine translation field was driven by people’s need to communicate with each other globally by automatically translating words, sentences, and texts from one language into another. The neural machine translation approach has become one of the most significant in recent years. This approach requires large parallel corpora not available for low-resource languages, such as the Kazakh language, which makes it difficult to achieve the high performance of the neural machine translation models. This article explores the existing methods for dealing with low-resource languages by artificially increasing the size of the corpora and improving the performance of the Kazakh–English machine translation models. These methods are called forward translation, backward translation, and transfer learning. Then the Sequence-to-Sequence (recurrent neural network and bidirectional recurrent neural network) and Transformer neural machine translation architectures with their features and specifications are concerned for conducting experiments in training models on parallel corpora. The experimental part focuses on building translation models for the high-quality translation of formal social, political, and scientific texts with the synthetic parallel sentences from existing monolingual data in the Kazakh language using the forward translation approach and combining them with the parallel corpora parsed from the official government websites. The total corpora of 380,000 parallel Kazakh–English sentences are trained on the recurrent neural network, bidirectional recurrent neural network, and Transformer models of the OpenNMT framework. The quality of the trained model is evaluated with the BLEU, WER, and TER metrics. Moreover, the sample translations were also analyzed. The RNN and BRNN models showed a more precise translation than the Transformer model. The Byte-Pair Encoding tokenization technique showed better metrics scores and translation than the word tokenization technique. The Bidirectional recurrent neural network with the Byte-Pair Encoding technique showed the best performance with 0.49 BLEU, 0.51 WER, and 0.45 TER.
first_indexed	2024-04-10T15:54:33Z
format	Article
id	doaj.art-0e7d121028c54c5caf59ed5a18e59c3c
institution	Directory Open Access Journal
issn	2376-5992
language	English
last_indexed	2024-04-10T15:54:33Z
publishDate	2023-02-01
publisher	PeerJ Inc.
record_format	Article
series	PeerJ Computer Science
spelling	doaj.art-0e7d121028c54c5caf59ed5a18e59c3c2023-02-10T15:05:12ZengPeerJ Inc.PeerJ Computer Science2376-59922023-02-019e122410.7717/peerj-cs.1224The neural machine translation models for the low-resource Kazakh–English language pairVladislav Karyukin0Diana Rakhimova1Aidana Karibayeva2Aliya Turganbayeva3Asem Turarbek4Department of Information Systems, Al-Farabi Kazakh National University, Almaty, KazakhstanDepartment of Information Systems, Al-Farabi Kazakh National University, Almaty, KazakhstanDepartment of Information Systems, Al-Farabi Kazakh National University, Almaty, KazakhstanDepartment of Information Systems, Al-Farabi Kazakh National University, Almaty, KazakhstanDepartment of Information Systems, Al-Farabi Kazakh National University, Almaty, KazakhstanThe development of the machine translation field was driven by people’s need to communicate with each other globally by automatically translating words, sentences, and texts from one language into another. The neural machine translation approach has become one of the most significant in recent years. This approach requires large parallel corpora not available for low-resource languages, such as the Kazakh language, which makes it difficult to achieve the high performance of the neural machine translation models. This article explores the existing methods for dealing with low-resource languages by artificially increasing the size of the corpora and improving the performance of the Kazakh–English machine translation models. These methods are called forward translation, backward translation, and transfer learning. Then the Sequence-to-Sequence (recurrent neural network and bidirectional recurrent neural network) and Transformer neural machine translation architectures with their features and specifications are concerned for conducting experiments in training models on parallel corpora. The experimental part focuses on building translation models for the high-quality translation of formal social, political, and scientific texts with the synthetic parallel sentences from existing monolingual data in the Kazakh language using the forward translation approach and combining them with the parallel corpora parsed from the official government websites. The total corpora of 380,000 parallel Kazakh–English sentences are trained on the recurrent neural network, bidirectional recurrent neural network, and Transformer models of the OpenNMT framework. The quality of the trained model is evaluated with the BLEU, WER, and TER metrics. Moreover, the sample translations were also analyzed. The RNN and BRNN models showed a more precise translation than the Transformer model. The Byte-Pair Encoding tokenization technique showed better metrics scores and translation than the word tokenization technique. The Bidirectional recurrent neural network with the Byte-Pair Encoding technique showed the best performance with 0.49 BLEU, 0.51 WER, and 0.45 TER.https://peerj.com/articles/cs-1224.pdfNeural machine translationForward translationBackward translationSeq2SeqRNNBRNN
spellingShingle	Vladislav Karyukin Diana Rakhimova Aidana Karibayeva Aliya Turganbayeva Asem Turarbek The neural machine translation models for the low-resource Kazakh–English language pair PeerJ Computer Science Neural machine translation Forward translation Backward translation Seq2Seq RNN BRNN
title	The neural machine translation models for the low-resource Kazakh–English language pair
title_full	The neural machine translation models for the low-resource Kazakh–English language pair
title_fullStr	The neural machine translation models for the low-resource Kazakh–English language pair
title_full_unstemmed	The neural machine translation models for the low-resource Kazakh–English language pair
title_short	The neural machine translation models for the low-resource Kazakh–English language pair
title_sort	neural machine translation models for the low resource kazakh english language pair
topic	Neural machine translation Forward translation Backward translation Seq2Seq RNN BRNN
url	https://peerj.com/articles/cs-1224.pdf
work_keys_str_mv	AT vladislavkaryukin theneuralmachinetranslationmodelsforthelowresourcekazakhenglishlanguagepair AT dianarakhimova theneuralmachinetranslationmodelsforthelowresourcekazakhenglishlanguagepair AT aidanakaribayeva theneuralmachinetranslationmodelsforthelowresourcekazakhenglishlanguagepair AT aliyaturganbayeva theneuralmachinetranslationmodelsforthelowresourcekazakhenglishlanguagepair AT asemturarbek theneuralmachinetranslationmodelsforthelowresourcekazakhenglishlanguagepair AT vladislavkaryukin neuralmachinetranslationmodelsforthelowresourcekazakhenglishlanguagepair AT dianarakhimova neuralmachinetranslationmodelsforthelowresourcekazakhenglishlanguagepair AT aidanakaribayeva neuralmachinetranslationmodelsforthelowresourcekazakhenglishlanguagepair AT aliyaturganbayeva neuralmachinetranslationmodelsforthelowresourcekazakhenglishlanguagepair AT asemturarbek neuralmachinetranslationmodelsforthelowresourcekazakhenglishlanguagepair

The neural machine translation models for the low-resource Kazakh–English language pair

Similar Items