Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling

Document clustering is to group documents according to certain semantic features. Topic model has a richer semantic structure and considerable potential for helping users to know document corpora. Unfortunately, this potential is stymied on text documents which have overlapping nature, due to their...

Full description

Bibliographic Details
Main Authors:	Mubashar Mustafa, Feng Zeng, Hussain Ghulam, Hafiz Muhammad Arslan
Format:	Article
Language:	English
Published:	MDPI AG 2020-11-01
Series:	Information
Subjects:	topic modeling document clustering non-negative matrix factorization (NMF) Urdu language latent dirichlet allocation (LDA)
Online Access:	https://www.mdpi.com/2078-2489/11/11/518

_version_	1827702850465038336
author	Mubashar Mustafa Feng Zeng Hussain Ghulam Hafiz Muhammad Arslan
author_facet	Mubashar Mustafa Feng Zeng Hussain Ghulam Hafiz Muhammad Arslan
author_sort	Mubashar Mustafa
collection	DOAJ
description	Document clustering is to group documents according to certain semantic features. Topic model has a richer semantic structure and considerable potential for helping users to know document corpora. Unfortunately, this potential is stymied on text documents which have overlapping nature, due to their purely unsupervised nature. To solve this problem, some semi-supervised models have been proposed for English language. However, no such work is available for poor resource language Urdu. Therefore, document clustering has become a challenging task in Urdu language, which has its own morphology, syntax and semantics. In this study, we proposed a semi-supervised framework for Urdu documents clustering to deal with the Urdu morphology challenges. The proposed model is a combination of pre-processing techniques, seeded-LDA model and Gibbs sampling, we named it seeded-Urdu Latent Dirichlet Allocation (seeded-ULDA). We apply the proposed model and other methods to Urdu news datasets for categorizing. For the datasets, two conditions are considered for document clustering, one is “Dataset without overlapping” in which all classes have distinct nature. The other is “Dataset with overlapping” in which the categories are overlapping and the classes are connected to each other. The aim of this study is threefold: it first shows that unsupervised models (Latent Dirichlet Allocation (LDA), Non-negative matrix factorization (NMF) and K-means) are giving satisfying results on the dataset without overlapping. Second, it shows that these unsupervised models are not performing well on the dataset with overlapping, because, on this dataset, these algorithms find some topics that are neither entirely meaningful nor effective in extrinsic tasks. Third, our proposed semi-supervised model Seeded-ULDA performs well on both datasets because this model is straightforward and effective to instruct topic models to find topics of specific interest. It is shown in this paper that the semi-supervised model, Seeded-ULDA, provides significant results as compared to unsupervised algorithms.
first_indexed	2024-03-10T15:05:20Z
format	Article
id	doaj.art-a9a930230e9748a384a9e46f89f899d5
institution	Directory Open Access Journal
issn	2078-2489
language	English
last_indexed	2024-03-10T15:05:20Z
publishDate	2020-11-01
publisher	MDPI AG
record_format	Article
series	Information
spelling	doaj.art-a9a930230e9748a384a9e46f89f899d52023-11-20T19:52:00ZengMDPI AGInformation2078-24892020-11-01111151810.3390/info11110518Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic ModelingMubashar Mustafa0Feng Zeng1Hussain Ghulam2Hafiz Muhammad Arslan3School of Computer Science and Engineering, Central South University, 410083 Changsha, ChinaSchool of Computer Science and Engineering, Central South University, 410083 Changsha, ChinaSchool of Computer Science and Engineering, Central South University, 410083 Changsha, ChinaSchool of Software Engineering, Northeastern University, 110819 Shenyang, ChinaDocument clustering is to group documents according to certain semantic features. Topic model has a richer semantic structure and considerable potential for helping users to know document corpora. Unfortunately, this potential is stymied on text documents which have overlapping nature, due to their purely unsupervised nature. To solve this problem, some semi-supervised models have been proposed for English language. However, no such work is available for poor resource language Urdu. Therefore, document clustering has become a challenging task in Urdu language, which has its own morphology, syntax and semantics. In this study, we proposed a semi-supervised framework for Urdu documents clustering to deal with the Urdu morphology challenges. The proposed model is a combination of pre-processing techniques, seeded-LDA model and Gibbs sampling, we named it seeded-Urdu Latent Dirichlet Allocation (seeded-ULDA). We apply the proposed model and other methods to Urdu news datasets for categorizing. For the datasets, two conditions are considered for document clustering, one is “Dataset without overlapping” in which all classes have distinct nature. The other is “Dataset with overlapping” in which the categories are overlapping and the classes are connected to each other. The aim of this study is threefold: it first shows that unsupervised models (Latent Dirichlet Allocation (LDA), Non-negative matrix factorization (NMF) and K-means) are giving satisfying results on the dataset without overlapping. Second, it shows that these unsupervised models are not performing well on the dataset with overlapping, because, on this dataset, these algorithms find some topics that are neither entirely meaningful nor effective in extrinsic tasks. Third, our proposed semi-supervised model Seeded-ULDA performs well on both datasets because this model is straightforward and effective to instruct topic models to find topics of specific interest. It is shown in this paper that the semi-supervised model, Seeded-ULDA, provides significant results as compared to unsupervised algorithms.https://www.mdpi.com/2078-2489/11/11/518topic modelingdocument clusteringnon-negative matrix factorization (NMF)Urdu languagelatent dirichlet allocation (LDA)
spellingShingle	Mubashar Mustafa Feng Zeng Hussain Ghulam Hafiz Muhammad Arslan Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling Information topic modeling document clustering non-negative matrix factorization (NMF) Urdu language latent dirichlet allocation (LDA)
title	Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling
title_full	Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling
title_fullStr	Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling
title_full_unstemmed	Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling
title_short	Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling
title_sort	urdu documents clustering with unsupervised and semi supervised probabilistic topic modeling
topic	topic modeling document clustering non-negative matrix factorization (NMF) Urdu language latent dirichlet allocation (LDA)
url	https://www.mdpi.com/2078-2489/11/11/518
work_keys_str_mv	AT mubasharmustafa urdudocumentsclusteringwithunsupervisedandsemisupervisedprobabilistictopicmodeling AT fengzeng urdudocumentsclusteringwithunsupervisedandsemisupervisedprobabilistictopicmodeling AT hussainghulam urdudocumentsclusteringwithunsupervisedandsemisupervisedprobabilistictopicmodeling AT hafizmuhammadarslan urdudocumentsclusteringwithunsupervisedandsemisupervisedprobabilistictopicmodeling

Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling

Similar Items