A Character String-Based Stemming for Morphologically Derivative Languages

Morphologically derivative languages form words by fusing stems and suffixes, stems are important to be extracted in order to make cross lingual alignment and knowledge transfer. As there are phonetic harmony and disharmony when linguistic particles are combined, both phonetic and morphological chan...

Full description

Bibliographic Details
Main Authors: Gvzelnur Imin, Mijit Ablimit, Hankiz Yilahun, Askar Hamdulla
Format: Article
Language:English
Published: MDPI AG 2022-03-01
Series:Information
Subjects:
Online Access:https://www.mdpi.com/2078-2489/13/4/170
_version_ 1797410689471479808
author Gvzelnur Imin
Mijit Ablimit
Hankiz Yilahun
Askar Hamdulla
author_facet Gvzelnur Imin
Mijit Ablimit
Hankiz Yilahun
Askar Hamdulla
author_sort Gvzelnur Imin
collection DOAJ
description Morphologically derivative languages form words by fusing stems and suffixes, stems are important to be extracted in order to make cross lingual alignment and knowledge transfer. As there are phonetic harmony and disharmony when linguistic particles are combined, both phonetic and morphological changes need to be analyzed. This paper proposes a multilingual stemming method that learns morpho-phonetic changes automatically based on character based embedding and sequential modeling. Firstly, the character feature embedding at the sentence level is used as input, and the BiLSTM model is used to obtain the forward and reverse context sequence, and the attention mechanism is added to this model for weight learning, and the global feature information is extracted to capture the stem and affix boundaries; finally CRF model is used to learn more information from sequence features to describe context information more effectively. In order to verify the effectiveness of the above model, the model in this paper is compared with the traditional model on two different data sets of three derivative languages: Uyghur, Kazakh and Kirghiz. The experimental results show that the model in this paper has the best stemming effect on multilingual sentence-level datasets, which leads to more effective stemming. In addition, the proposed model outperforms other traditional models, and fully consider the data characteristics, and has certain advantages with less human intervention.
first_indexed 2024-03-09T04:33:57Z
format Article
id doaj.art-6b70b0b7063a436da6c757d1a327ad3d
institution Directory Open Access Journal
issn 2078-2489
language English
last_indexed 2024-03-09T04:33:57Z
publishDate 2022-03-01
publisher MDPI AG
record_format Article
series Information
spelling doaj.art-6b70b0b7063a436da6c757d1a327ad3d2023-12-03T13:31:15ZengMDPI AGInformation2078-24892022-03-0113417010.3390/info13040170A Character String-Based Stemming for Morphologically Derivative LanguagesGvzelnur Imin0Mijit Ablimit1Hankiz Yilahun2Askar Hamdulla3College of Information Science and Engineering, Xinjiang University, Urumqi 830046, ChinaCollege of Information Science and Engineering, Xinjiang University, Urumqi 830046, ChinaCollege of Information Science and Engineering, Xinjiang University, Urumqi 830046, ChinaCollege of Information Science and Engineering, Xinjiang University, Urumqi 830046, ChinaMorphologically derivative languages form words by fusing stems and suffixes, stems are important to be extracted in order to make cross lingual alignment and knowledge transfer. As there are phonetic harmony and disharmony when linguistic particles are combined, both phonetic and morphological changes need to be analyzed. This paper proposes a multilingual stemming method that learns morpho-phonetic changes automatically based on character based embedding and sequential modeling. Firstly, the character feature embedding at the sentence level is used as input, and the BiLSTM model is used to obtain the forward and reverse context sequence, and the attention mechanism is added to this model for weight learning, and the global feature information is extracted to capture the stem and affix boundaries; finally CRF model is used to learn more information from sequence features to describe context information more effectively. In order to verify the effectiveness of the above model, the model in this paper is compared with the traditional model on two different data sets of three derivative languages: Uyghur, Kazakh and Kirghiz. The experimental results show that the model in this paper has the best stemming effect on multilingual sentence-level datasets, which leads to more effective stemming. In addition, the proposed model outperforms other traditional models, and fully consider the data characteristics, and has certain advantages with less human intervention.https://www.mdpi.com/2078-2489/13/4/170agglutinative languagemultilingual languagestemmingattention mechanismBiLSTM-Attention-CRFcontext
spellingShingle Gvzelnur Imin
Mijit Ablimit
Hankiz Yilahun
Askar Hamdulla
A Character String-Based Stemming for Morphologically Derivative Languages
Information
agglutinative language
multilingual language
stemming
attention mechanism
BiLSTM-Attention-CRF
context
title A Character String-Based Stemming for Morphologically Derivative Languages
title_full A Character String-Based Stemming for Morphologically Derivative Languages
title_fullStr A Character String-Based Stemming for Morphologically Derivative Languages
title_full_unstemmed A Character String-Based Stemming for Morphologically Derivative Languages
title_short A Character String-Based Stemming for Morphologically Derivative Languages
title_sort character string based stemming for morphologically derivative languages
topic agglutinative language
multilingual language
stemming
attention mechanism
BiLSTM-Attention-CRF
context
url https://www.mdpi.com/2078-2489/13/4/170
work_keys_str_mv AT gvzelnurimin acharacterstringbasedstemmingformorphologicallyderivativelanguages
AT mijitablimit acharacterstringbasedstemmingformorphologicallyderivativelanguages
AT hankizyilahun acharacterstringbasedstemmingformorphologicallyderivativelanguages
AT askarhamdulla acharacterstringbasedstemmingformorphologicallyderivativelanguages
AT gvzelnurimin characterstringbasedstemmingformorphologicallyderivativelanguages
AT mijitablimit characterstringbasedstemmingformorphologicallyderivativelanguages
AT hankizyilahun characterstringbasedstemmingformorphologicallyderivativelanguages
AT askarhamdulla characterstringbasedstemmingformorphologicallyderivativelanguages