Low-Resource Neural Machine Translation Improvement Using Source-Side Monolingual Data

Despite the many proposals to solve the neural machine translation (NMT) problem of low-resource languages, it continues to be difficult. The issue becomes even more complicated when few resources cover only a single domain. In this paper, we discuss the applicability of a source-side monolingual da...

Full description

Bibliographic Details
Main Authors: Atnafu Lambebo Tonja, Olga Kolesnikova, Alexander Gelbukh, Grigori Sidorov
Format: Article
Language:English
Published: MDPI AG 2023-01-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/13/2/1201
_version_ 1797446450019303424
author Atnafu Lambebo Tonja
Olga Kolesnikova
Alexander Gelbukh
Grigori Sidorov
author_facet Atnafu Lambebo Tonja
Olga Kolesnikova
Alexander Gelbukh
Grigori Sidorov
author_sort Atnafu Lambebo Tonja
collection DOAJ
description Despite the many proposals to solve the neural machine translation (NMT) problem of low-resource languages, it continues to be difficult. The issue becomes even more complicated when few resources cover only a single domain. In this paper, we discuss the applicability of a source-side monolingual dataset of low-resource languages to improve the NMT system for such languages. In our experiments, we used Wolaytta–English translation as a low-resource language. We discuss the use of self-learning and fine-tuning approaches to improve the NMT system for Wolaytta–English translation using both authentic and synthetic datasets. The self-learning approach showed +2.7 and +2.4 BLEU score improvements for Wolaytta–English and English–Wolaytta translations, respectively, over the best-performing baseline model. Further fine-tuning the best-performing self-learning model showed +1.2 and +0.6 BLEU score improvements for Wolaytta–English and English–Wolaytta translations, respectively. We reflect on our contributions and plan for the future of this difficult field of study.
first_indexed 2024-03-09T13:40:41Z
format Article
id doaj.art-c1e6be9241ce46189c902baa838d2f2a
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-09T13:40:41Z
publishDate 2023-01-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-c1e6be9241ce46189c902baa838d2f2a2023-11-30T21:07:37ZengMDPI AGApplied Sciences2076-34172023-01-01132120110.3390/app13021201Low-Resource Neural Machine Translation Improvement Using Source-Side Monolingual DataAtnafu Lambebo Tonja0Olga Kolesnikova1Alexander Gelbukh2Grigori Sidorov3Instituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC), Mexico City 07738, MexicoInstituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC), Mexico City 07738, MexicoInstituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC), Mexico City 07738, MexicoInstituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC), Mexico City 07738, MexicoDespite the many proposals to solve the neural machine translation (NMT) problem of low-resource languages, it continues to be difficult. The issue becomes even more complicated when few resources cover only a single domain. In this paper, we discuss the applicability of a source-side monolingual dataset of low-resource languages to improve the NMT system for such languages. In our experiments, we used Wolaytta–English translation as a low-resource language. We discuss the use of self-learning and fine-tuning approaches to improve the NMT system for Wolaytta–English translation using both authentic and synthetic datasets. The self-learning approach showed +2.7 and +2.4 BLEU score improvements for Wolaytta–English and English–Wolaytta translations, respectively, over the best-performing baseline model. Further fine-tuning the best-performing self-learning model showed +1.2 and +0.6 BLEU score improvements for Wolaytta–English and English–Wolaytta translations, respectively. We reflect on our contributions and plan for the future of this difficult field of study.https://www.mdpi.com/2076-3417/13/2/1201Wolaytta–English NMTEnglish–Wolaytta NMTlow-resource NMTself-learningneural machine translationmonolingual data for low-resource languages
spellingShingle Atnafu Lambebo Tonja
Olga Kolesnikova
Alexander Gelbukh
Grigori Sidorov
Low-Resource Neural Machine Translation Improvement Using Source-Side Monolingual Data
Applied Sciences
Wolaytta–English NMT
English–Wolaytta NMT
low-resource NMT
self-learning
neural machine translation
monolingual data for low-resource languages
title Low-Resource Neural Machine Translation Improvement Using Source-Side Monolingual Data
title_full Low-Resource Neural Machine Translation Improvement Using Source-Side Monolingual Data
title_fullStr Low-Resource Neural Machine Translation Improvement Using Source-Side Monolingual Data
title_full_unstemmed Low-Resource Neural Machine Translation Improvement Using Source-Side Monolingual Data
title_short Low-Resource Neural Machine Translation Improvement Using Source-Side Monolingual Data
title_sort low resource neural machine translation improvement using source side monolingual data
topic Wolaytta–English NMT
English–Wolaytta NMT
low-resource NMT
self-learning
neural machine translation
monolingual data for low-resource languages
url https://www.mdpi.com/2076-3417/13/2/1201
work_keys_str_mv AT atnafulambebotonja lowresourceneuralmachinetranslationimprovementusingsourcesidemonolingualdata
AT olgakolesnikova lowresourceneuralmachinetranslationimprovementusingsourcesidemonolingualdata
AT alexandergelbukh lowresourceneuralmachinetranslationimprovementusingsourcesidemonolingualdata
AT grigorisidorov lowresourceneuralmachinetranslationimprovementusingsourcesidemonolingualdata