Correcting Diacritics and Typos with a ByT5 Transformer Model

Due to the fast pace of life and online communications and the prevalence of English and the QWERTY keyboard, people tend to forgo using diacritics, make typographical errors (typos) when typing in other languages. Restoring diacritics and correcting spelling is important for proper language use and...

Full description

Bibliographic Details
Main Authors: Lukas Stankevičius, Mantas Lukoševičius, Jurgita Kapočiūtė-Dzikienė, Monika Briedienė, Tomas Krilavičius
Format: Article
Language:English
Published: MDPI AG 2022-03-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/12/5/2636
_version_ 1797475609832587264
author Lukas Stankevičius
Mantas Lukoševičius
Jurgita Kapočiūtė-Dzikienė
Monika Briedienė
Tomas Krilavičius
author_facet Lukas Stankevičius
Mantas Lukoševičius
Jurgita Kapočiūtė-Dzikienė
Monika Briedienė
Tomas Krilavičius
author_sort Lukas Stankevičius
collection DOAJ
description Due to the fast pace of life and online communications and the prevalence of English and the QWERTY keyboard, people tend to forgo using diacritics, make typographical errors (typos) when typing in other languages. Restoring diacritics and correcting spelling is important for proper language use and the disambiguation of texts for both humans and downstream algorithms. However, both of these problems are typically addressed separately: the state-of-the-art diacritics restoration methods do not tolerate other typos, but classical spellcheckers also cannot deal adequately with all the diacritics missing.In this work, we tackle both problems at once by employing the newly-developed universal ByT5 byte-level seq2seq transformer model that requires no language-specific model structures. For a comparison, we perform diacritics restoration on benchmark datasets of 12 languages, with the addition of Lithuanian. The experimental investigation proves that our approach is able to achieve results (>98%) comparable to the previous state-of-the-art, despite being trained less and on fewer data. Our approach is also able to restore diacritics in words not seen during training with >76% accuracy. Our simultaneous diacritics restoration and typos correction approach reaches >94% alpha-word accuracy on the 13 languages. It has no direct competitors and strongly outperforms classical spell-checking or dictionary-based approaches. We also demonstrate all the accuracies to further improve with more training. Taken together, this shows the great real-world application potential of our suggested methods to more data, languages, and error classes.
first_indexed 2024-03-09T20:46:36Z
format Article
id doaj.art-549cbc52a74349358529ba95f22c7ec4
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-09T20:46:36Z
publishDate 2022-03-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-549cbc52a74349358529ba95f22c7ec42023-11-23T22:43:53ZengMDPI AGApplied Sciences2076-34172022-03-01125263610.3390/app12052636Correcting Diacritics and Typos with a ByT5 Transformer ModelLukas Stankevičius0Mantas Lukoševičius1Jurgita Kapočiūtė-Dzikienė2Monika Briedienė3Tomas Krilavičius4Faculty of Informatics, Kaunas University of Technology, LT-51368 Kaunas, LithuaniaFaculty of Informatics, Kaunas University of Technology, LT-51368 Kaunas, LithuaniaFaculty of Informatics, Vytautas Magnus University, LT-44404 Kaunas, LithuaniaFaculty of Informatics, Vytautas Magnus University, LT-44404 Kaunas, LithuaniaFaculty of Informatics, Vytautas Magnus University, LT-44404 Kaunas, LithuaniaDue to the fast pace of life and online communications and the prevalence of English and the QWERTY keyboard, people tend to forgo using diacritics, make typographical errors (typos) when typing in other languages. Restoring diacritics and correcting spelling is important for proper language use and the disambiguation of texts for both humans and downstream algorithms. However, both of these problems are typically addressed separately: the state-of-the-art diacritics restoration methods do not tolerate other typos, but classical spellcheckers also cannot deal adequately with all the diacritics missing.In this work, we tackle both problems at once by employing the newly-developed universal ByT5 byte-level seq2seq transformer model that requires no language-specific model structures. For a comparison, we perform diacritics restoration on benchmark datasets of 12 languages, with the addition of Lithuanian. The experimental investigation proves that our approach is able to achieve results (>98%) comparable to the previous state-of-the-art, despite being trained less and on fewer data. Our approach is also able to restore diacritics in words not seen during training with >76% accuracy. Our simultaneous diacritics restoration and typos correction approach reaches >94% alpha-word accuracy on the 13 languages. It has no direct competitors and strongly outperforms classical spell-checking or dictionary-based approaches. We also demonstrate all the accuracies to further improve with more training. Taken together, this shows the great real-world application potential of our suggested methods to more data, languages, and error classes.https://www.mdpi.com/2076-3417/12/5/2636natural language processingdiacritics restorationtypo correctiontransformer modelsByT5QWERTY
spellingShingle Lukas Stankevičius
Mantas Lukoševičius
Jurgita Kapočiūtė-Dzikienė
Monika Briedienė
Tomas Krilavičius
Correcting Diacritics and Typos with a ByT5 Transformer Model
Applied Sciences
natural language processing
diacritics restoration
typo correction
transformer models
ByT5
QWERTY
title Correcting Diacritics and Typos with a ByT5 Transformer Model
title_full Correcting Diacritics and Typos with a ByT5 Transformer Model
title_fullStr Correcting Diacritics and Typos with a ByT5 Transformer Model
title_full_unstemmed Correcting Diacritics and Typos with a ByT5 Transformer Model
title_short Correcting Diacritics and Typos with a ByT5 Transformer Model
title_sort correcting diacritics and typos with a byt5 transformer model
topic natural language processing
diacritics restoration
typo correction
transformer models
ByT5
QWERTY
url https://www.mdpi.com/2076-3417/12/5/2636
work_keys_str_mv AT lukasstankevicius correctingdiacriticsandtyposwithabyt5transformermodel
AT mantaslukosevicius correctingdiacriticsandtyposwithabyt5transformermodel
AT jurgitakapociutedzikiene correctingdiacriticsandtyposwithabyt5transformermodel
AT monikabriediene correctingdiacriticsandtyposwithabyt5transformermodel
AT tomaskrilavicius correctingdiacriticsandtyposwithabyt5transformermodel