Kurdish Kurmanji Lemmatization and Spell-checker with Spell-correction
There are many studies about using lemmatization and spell-checker with spell-correction regarding English, Arabic, and Persian languages but only few studies found regarding low-resource languages such as Kurdish language and more specifically for Kurmanji dialect, which increased the need of creat...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
University of Human Development
2023-02-01
|
Series: | UHD Journal of Science and Technology |
Subjects: | |
Online Access: | https://journals.uhd.edu.iq/index.php/uhdjst/article/view/1076 |
_version_ | 1827947306589093888 |
---|---|
author | Hanar Hoshyar Mustafa Rebwar M. Nabi |
author_facet | Hanar Hoshyar Mustafa Rebwar M. Nabi |
author_sort | Hanar Hoshyar Mustafa |
collection | DOAJ |
description | There are many studies about using lemmatization and spell-checker with spell-correction regarding English, Arabic, and Persian languages but only few studies found regarding low-resource languages such as Kurdish language and more specifically for Kurmanji dialect, which increased the need of creating such systems. Lemmatization is the process of determining a base or dictionary form (lemma) for a specific surface pattern, whereas spell-checkers and spell-correctors determine whether a word is correctly spelled also correct a range of spelling errors, respectively. This research aims to present a lemmatization and a word-level error correction system for Kurdish Kurmanji Dialect, which are the first tools for this dialect based on our knowledge. The proposed approach for lemmatization is built on morphological rules, and a hybrid approach that relies on the n-gram language model and the Jaccard Coefficient Similarity algorithm was applied to the spell-checker and spell-correction. The process results for lemmatization, as detailed in this article, rates of 97.7% and 99.3% accuracy for noun and verb lemmatization, correspondingly. Furthermore, for spell-checker and spell-correction, accordingly, accuracy rates of 100% and 90.77% are attained. |
first_indexed | 2024-04-09T12:37:59Z |
format | Article |
id | doaj.art-c87b4b686eb74ab0a3a3b669ff94120d |
institution | Directory Open Access Journal |
issn | 2521-4209 2521-4217 |
language | English |
last_indexed | 2024-04-09T12:37:59Z |
publishDate | 2023-02-01 |
publisher | University of Human Development |
record_format | Article |
series | UHD Journal of Science and Technology |
spelling | doaj.art-c87b4b686eb74ab0a3a3b669ff94120d2023-05-15T08:33:25ZengUniversity of Human DevelopmentUHD Journal of Science and Technology2521-42092521-42172023-02-0171435210.21928/uhdjst.v7n1y2023.pp43-521207Kurdish Kurmanji Lemmatization and Spell-checker with Spell-correctionHanar Hoshyar Mustafa0Rebwar M. Nabi1Technical College of Informatics, Sulaimani Polytechnic University, Sulaimani, Kurdistan Region, IraqTechnical College of Informatics, Sulaimani Polytechnic University, Sulaimani, Kurdistan Region, IraqThere are many studies about using lemmatization and spell-checker with spell-correction regarding English, Arabic, and Persian languages but only few studies found regarding low-resource languages such as Kurdish language and more specifically for Kurmanji dialect, which increased the need of creating such systems. Lemmatization is the process of determining a base or dictionary form (lemma) for a specific surface pattern, whereas spell-checkers and spell-correctors determine whether a word is correctly spelled also correct a range of spelling errors, respectively. This research aims to present a lemmatization and a word-level error correction system for Kurdish Kurmanji Dialect, which are the first tools for this dialect based on our knowledge. The proposed approach for lemmatization is built on morphological rules, and a hybrid approach that relies on the n-gram language model and the Jaccard Coefficient Similarity algorithm was applied to the spell-checker and spell-correction. The process results for lemmatization, as detailed in this article, rates of 97.7% and 99.3% accuracy for noun and verb lemmatization, correspondingly. Furthermore, for spell-checker and spell-correction, accordingly, accuracy rates of 100% and 90.77% are attained.https://journals.uhd.edu.iq/index.php/uhdjst/article/view/1076kurdish languagekurmanji dialectkurdish lemmatizerkurdish spell-checker and spell-correctionkurdish dataset |
spellingShingle | Hanar Hoshyar Mustafa Rebwar M. Nabi Kurdish Kurmanji Lemmatization and Spell-checker with Spell-correction UHD Journal of Science and Technology kurdish language kurmanji dialect kurdish lemmatizer kurdish spell-checker and spell-correction kurdish dataset |
title | Kurdish Kurmanji Lemmatization and Spell-checker with Spell-correction |
title_full | Kurdish Kurmanji Lemmatization and Spell-checker with Spell-correction |
title_fullStr | Kurdish Kurmanji Lemmatization and Spell-checker with Spell-correction |
title_full_unstemmed | Kurdish Kurmanji Lemmatization and Spell-checker with Spell-correction |
title_short | Kurdish Kurmanji Lemmatization and Spell-checker with Spell-correction |
title_sort | kurdish kurmanji lemmatization and spell checker with spell correction |
topic | kurdish language kurmanji dialect kurdish lemmatizer kurdish spell-checker and spell-correction kurdish dataset |
url | https://journals.uhd.edu.iq/index.php/uhdjst/article/view/1076 |
work_keys_str_mv | AT hanarhoshyarmustafa kurdishkurmanjilemmatizationandspellcheckerwithspellcorrection AT rebwarmnabi kurdishkurmanjilemmatizationandspellcheckerwithspellcorrection |