Evaluating automatic sentence alignment approaches on English-Slovak sentences

Abstract Parallel texts represent a very valuable resource in many applications of natural language processing. The fundamental step in creating parallel corpus is the alignment. Sentence alignment is the issue of finding correspondence between source sentences and their equivalent translations in t...

Full description

Bibliographic Details
Main Authors: Frantisek Forgac, Dasa Munkova, Michal Munk, Livia Kelebercova
Format: Article
Language:English
Published: Nature Portfolio 2023-11-01
Series:Scientific Reports
Online Access:https://doi.org/10.1038/s41598-023-47479-w
_version_ 1827711428712202240
author Frantisek Forgac
Dasa Munkova
Michal Munk
Livia Kelebercova
author_facet Frantisek Forgac
Dasa Munkova
Michal Munk
Livia Kelebercova
author_sort Frantisek Forgac
collection DOAJ
description Abstract Parallel texts represent a very valuable resource in many applications of natural language processing. The fundamental step in creating parallel corpus is the alignment. Sentence alignment is the issue of finding correspondence between source sentences and their equivalent translations in the target text. A number of automatic sentence alignment approaches were proposed including neural networks, which can be divided into length-based, lexicon-based, and translation-based. In our study, we used five different aligners, namely Bilingual sentence aligner (BSA), Hunalign, Bleualign, Vecalign, and Bertalign. We evaluated both, the performance of the Bertalign in terms of accuracy against the up to now employed aligners as well as among each other in the language pair English-Sovak. We created our custom corpus consisting of texts collected in 2021 and 2022. Vecalign and Bertalign performed statistically significantly best and BSA the worst. Hunalign and Bleualign achieved the same performance in terms of F1 score. However, Bleualign achieved the most diverse results in terms of performance.
first_indexed 2024-03-10T17:55:31Z
format Article
id doaj.art-0693716c6dec4b8f95412f786d596a0a
institution Directory Open Access Journal
issn 2045-2322
language English
last_indexed 2024-03-10T17:55:31Z
publishDate 2023-11-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj.art-0693716c6dec4b8f95412f786d596a0a2023-11-20T09:12:36ZengNature PortfolioScientific Reports2045-23222023-11-0113111210.1038/s41598-023-47479-wEvaluating automatic sentence alignment approaches on English-Slovak sentencesFrantisek Forgac0Dasa Munkova1Michal Munk2Livia Kelebercova3Faculty of Natural Sciences and Informatics, Constantine the Philosopher University in NitraFaculty of Natural Sciences and Informatics, Constantine the Philosopher University in NitraFaculty of Natural Sciences and Informatics, Constantine the Philosopher University in NitraFaculty of Natural Sciences and Informatics, Constantine the Philosopher University in NitraAbstract Parallel texts represent a very valuable resource in many applications of natural language processing. The fundamental step in creating parallel corpus is the alignment. Sentence alignment is the issue of finding correspondence between source sentences and their equivalent translations in the target text. A number of automatic sentence alignment approaches were proposed including neural networks, which can be divided into length-based, lexicon-based, and translation-based. In our study, we used five different aligners, namely Bilingual sentence aligner (BSA), Hunalign, Bleualign, Vecalign, and Bertalign. We evaluated both, the performance of the Bertalign in terms of accuracy against the up to now employed aligners as well as among each other in the language pair English-Sovak. We created our custom corpus consisting of texts collected in 2021 and 2022. Vecalign and Bertalign performed statistically significantly best and BSA the worst. Hunalign and Bleualign achieved the same performance in terms of F1 score. However, Bleualign achieved the most diverse results in terms of performance.https://doi.org/10.1038/s41598-023-47479-w
spellingShingle Frantisek Forgac
Dasa Munkova
Michal Munk
Livia Kelebercova
Evaluating automatic sentence alignment approaches on English-Slovak sentences
Scientific Reports
title Evaluating automatic sentence alignment approaches on English-Slovak sentences
title_full Evaluating automatic sentence alignment approaches on English-Slovak sentences
title_fullStr Evaluating automatic sentence alignment approaches on English-Slovak sentences
title_full_unstemmed Evaluating automatic sentence alignment approaches on English-Slovak sentences
title_short Evaluating automatic sentence alignment approaches on English-Slovak sentences
title_sort evaluating automatic sentence alignment approaches on english slovak sentences
url https://doi.org/10.1038/s41598-023-47479-w
work_keys_str_mv AT frantisekforgac evaluatingautomaticsentencealignmentapproachesonenglishslovaksentences
AT dasamunkova evaluatingautomaticsentencealignmentapproachesonenglishslovaksentences
AT michalmunk evaluatingautomaticsentencealignmentapproachesonenglishslovaksentences
AT liviakelebercova evaluatingautomaticsentencealignmentapproachesonenglishslovaksentences