Uyghur–Kazakh–Kirghiz Text Keyword Extraction Based on Morpheme Segmentation

In this study, based on a morpheme segmentation framework, we researched a text keyword extraction method for Uyghur, Kazakh and Kirghiz languages, which have similar grammatical and lexical structures. In these languages, affixes and a stem are joined together to form a word. A stem is a word parti...

Full description

Bibliographic Details
Main Authors:	Sardar Parhat, Mutallip Sattar, Askar Hamdulla, Abdurahman Kadir
Format:	Article
Language:	English
Published:	MDPI AG 2023-05-01
Series:	Information
Subjects:	Uyghur–Kazakh–Kirghiz keyword extraction morpheme segmentation stem extraction stem vector TextRank
Online Access:	https://www.mdpi.com/2078-2489/14/5/283

_version_	1797599719185186816
author	Sardar Parhat Mutallip Sattar Askar Hamdulla Abdurahman Kadir
author_facet	Sardar Parhat Mutallip Sattar Askar Hamdulla Abdurahman Kadir
author_sort	Sardar Parhat
collection	DOAJ
description	In this study, based on a morpheme segmentation framework, we researched a text keyword extraction method for Uyghur, Kazakh and Kirghiz languages, which have similar grammatical and lexical structures. In these languages, affixes and a stem are joined together to form a word. A stem is a word particle with a notional meaning, while the affixes perform grammatical functions. Because of these derivative properties, the vocabularies used for these languages are huge. Therefore, pre-processing is a necessary step in NLP tasks for Uyghur, Kazakh and Kirghiz. Morpheme segmentation enabled us to remove the suffixes as the auxiliary unit while retaining the meaningful stem and it reduced the dimension of the feature space present in the keyword extraction task for Uyghur, Kazakh and Kirghiz texts. We transformed the morpheme segmentation task into the problem of labeling the morpheme sequences, and we used the Bi-LSTM network to bidirectionally obtain the position feature information of character sequences. We applied CRF to effectively learn the information of the preceding and following label sequences to build a highly accurate Bi-LSTM_CRF morpheme segmentation model, and we prepared morpheme-based experimental text sets by using this model. Subsequently, we used the stem vectors’ similarity to modify the TextRank algorithm, subsequent to the training of the stem embedding vector using the Doc2vec algorithm, and then we performed a text keyword extraction experiment. In this experiment, the highest F1 scores of 43.8%, 44% and 43.9% were obtained for three datasets. The experimental results show that the morpheme-based approach provides much better results than the word-based approach, which shows the stem vector similarity weighting is an efficient method for the text keyword extraction task, thus proving the efficiency of morpheme sequence for morphologically derivative languages.
first_indexed	2024-03-11T03:38:15Z
format	Article
id	doaj.art-1fd6d349676741bcbd4e395ba006ce07
institution	Directory Open Access Journal
issn	2078-2489
language	English
last_indexed	2024-03-11T03:38:15Z
publishDate	2023-05-01
publisher	MDPI AG
record_format	Article
series	Information
spelling	doaj.art-1fd6d349676741bcbd4e395ba006ce072023-11-18T01:48:03ZengMDPI AGInformation2078-24892023-05-0114528310.3390/info14050283Uyghur–Kazakh–Kirghiz Text Keyword Extraction Based on Morpheme SegmentationSardar Parhat0Mutallip Sattar1Askar Hamdulla2Abdurahman Kadir3College of Information Management, Xinjiang University of Finance and Economics, Urumqi 830012, ChinaCollege of Information Management, Xinjiang University of Finance and Economics, Urumqi 830012, ChinaCollege of Information Science and Engineering, Xinjiang University, Urumqi 830046, ChinaCollege of Information Management, Xinjiang University of Finance and Economics, Urumqi 830012, ChinaIn this study, based on a morpheme segmentation framework, we researched a text keyword extraction method for Uyghur, Kazakh and Kirghiz languages, which have similar grammatical and lexical structures. In these languages, affixes and a stem are joined together to form a word. A stem is a word particle with a notional meaning, while the affixes perform grammatical functions. Because of these derivative properties, the vocabularies used for these languages are huge. Therefore, pre-processing is a necessary step in NLP tasks for Uyghur, Kazakh and Kirghiz. Morpheme segmentation enabled us to remove the suffixes as the auxiliary unit while retaining the meaningful stem and it reduced the dimension of the feature space present in the keyword extraction task for Uyghur, Kazakh and Kirghiz texts. We transformed the morpheme segmentation task into the problem of labeling the morpheme sequences, and we used the Bi-LSTM network to bidirectionally obtain the position feature information of character sequences. We applied CRF to effectively learn the information of the preceding and following label sequences to build a highly accurate Bi-LSTM_CRF morpheme segmentation model, and we prepared morpheme-based experimental text sets by using this model. Subsequently, we used the stem vectors’ similarity to modify the TextRank algorithm, subsequent to the training of the stem embedding vector using the Doc2vec algorithm, and then we performed a text keyword extraction experiment. In this experiment, the highest F1 scores of 43.8%, 44% and 43.9% were obtained for three datasets. The experimental results show that the morpheme-based approach provides much better results than the word-based approach, which shows the stem vector similarity weighting is an efficient method for the text keyword extraction task, thus proving the efficiency of morpheme sequence for morphologically derivative languages.https://www.mdpi.com/2078-2489/14/5/283Uyghur–Kazakh–Kirghizkeyword extractionmorpheme segmentationstem extractionstem vectorTextRank
spellingShingle	Sardar Parhat Mutallip Sattar Askar Hamdulla Abdurahman Kadir Uyghur–Kazakh–Kirghiz Text Keyword Extraction Based on Morpheme Segmentation Information Uyghur–Kazakh–Kirghiz keyword extraction morpheme segmentation stem extraction stem vector TextRank
title	Uyghur–Kazakh–Kirghiz Text Keyword Extraction Based on Morpheme Segmentation
title_full	Uyghur–Kazakh–Kirghiz Text Keyword Extraction Based on Morpheme Segmentation
title_fullStr	Uyghur–Kazakh–Kirghiz Text Keyword Extraction Based on Morpheme Segmentation
title_full_unstemmed	Uyghur–Kazakh–Kirghiz Text Keyword Extraction Based on Morpheme Segmentation
title_short	Uyghur–Kazakh–Kirghiz Text Keyword Extraction Based on Morpheme Segmentation
title_sort	uyghur kazakh kirghiz text keyword extraction based on morpheme segmentation
topic	Uyghur–Kazakh–Kirghiz keyword extraction morpheme segmentation stem extraction stem vector TextRank
url	https://www.mdpi.com/2078-2489/14/5/283
work_keys_str_mv	AT sardarparhat uyghurkazakhkirghiztextkeywordextractionbasedonmorphemesegmentation AT mutallipsattar uyghurkazakhkirghiztextkeywordextractionbasedonmorphemesegmentation AT askarhamdulla uyghurkazakhkirghiztextkeywordextractionbasedonmorphemesegmentation AT abdurahmankadir uyghurkazakhkirghiztextkeywordextractionbasedonmorphemesegmentation

Uyghur–Kazakh–Kirghiz Text Keyword Extraction Based on Morpheme Segmentation

Similar Items