Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling

Document clustering is to group documents according to certain semantic features. Topic model has a richer semantic structure and considerable potential for helping users to know document corpora. Unfortunately, this potential is stymied on text documents which have overlapping nature, due to their...

Full description

Bibliographic Details
Main Authors: Mubashar Mustafa, Feng Zeng, Hussain Ghulam, Hafiz Muhammad Arslan
Format: Article
Language:English
Published: MDPI AG 2020-11-01
Series:Information
Subjects:
Online Access:https://www.mdpi.com/2078-2489/11/11/518
_version_ 1827702850465038336
author Mubashar Mustafa
Feng Zeng
Hussain Ghulam
Hafiz Muhammad Arslan
author_facet Mubashar Mustafa
Feng Zeng
Hussain Ghulam
Hafiz Muhammad Arslan
author_sort Mubashar Mustafa
collection DOAJ
description Document clustering is to group documents according to certain semantic features. Topic model has a richer semantic structure and considerable potential for helping users to know document corpora. Unfortunately, this potential is stymied on text documents which have overlapping nature, due to their purely unsupervised nature. To solve this problem, some semi-supervised models have been proposed for English language. However, no such work is available for poor resource language Urdu. Therefore, document clustering has become a challenging task in Urdu language, which has its own morphology, syntax and semantics. In this study, we proposed a semi-supervised framework for Urdu documents clustering to deal with the Urdu morphology challenges. The proposed model is a combination of pre-processing techniques, seeded-LDA model and Gibbs sampling, we named it seeded-Urdu Latent Dirichlet Allocation (seeded-ULDA). We apply the proposed model and other methods to Urdu news datasets for categorizing. For the datasets, two conditions are considered for document clustering, one is “Dataset without overlapping” in which all classes have distinct nature. The other is “Dataset with overlapping” in which the categories are overlapping and the classes are connected to each other. The aim of this study is threefold: it first shows that unsupervised models (Latent Dirichlet Allocation (LDA), Non-negative matrix factorization (NMF) and K-means) are giving satisfying results on the dataset without overlapping. Second, it shows that these unsupervised models are not performing well on the dataset with overlapping, because, on this dataset, these algorithms find some topics that are neither entirely meaningful nor effective in extrinsic tasks. Third, our proposed semi-supervised model Seeded-ULDA performs well on both datasets because this model is straightforward and effective to instruct topic models to find topics of specific interest. It is shown in this paper that the semi-supervised model, Seeded-ULDA, provides significant results as compared to unsupervised algorithms.
first_indexed 2024-03-10T15:05:20Z
format Article
id doaj.art-a9a930230e9748a384a9e46f89f899d5
institution Directory Open Access Journal
issn 2078-2489
language English
last_indexed 2024-03-10T15:05:20Z
publishDate 2020-11-01
publisher MDPI AG
record_format Article
series Information
spelling doaj.art-a9a930230e9748a384a9e46f89f899d52023-11-20T19:52:00ZengMDPI AGInformation2078-24892020-11-01111151810.3390/info11110518Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic ModelingMubashar Mustafa0Feng Zeng1Hussain Ghulam2Hafiz Muhammad Arslan3School of Computer Science and Engineering, Central South University, 410083 Changsha, ChinaSchool of Computer Science and Engineering, Central South University, 410083 Changsha, ChinaSchool of Computer Science and Engineering, Central South University, 410083 Changsha, ChinaSchool of Software Engineering, Northeastern University, 110819 Shenyang, ChinaDocument clustering is to group documents according to certain semantic features. Topic model has a richer semantic structure and considerable potential for helping users to know document corpora. Unfortunately, this potential is stymied on text documents which have overlapping nature, due to their purely unsupervised nature. To solve this problem, some semi-supervised models have been proposed for English language. However, no such work is available for poor resource language Urdu. Therefore, document clustering has become a challenging task in Urdu language, which has its own morphology, syntax and semantics. In this study, we proposed a semi-supervised framework for Urdu documents clustering to deal with the Urdu morphology challenges. The proposed model is a combination of pre-processing techniques, seeded-LDA model and Gibbs sampling, we named it seeded-Urdu Latent Dirichlet Allocation (seeded-ULDA). We apply the proposed model and other methods to Urdu news datasets for categorizing. For the datasets, two conditions are considered for document clustering, one is “Dataset without overlapping” in which all classes have distinct nature. The other is “Dataset with overlapping” in which the categories are overlapping and the classes are connected to each other. The aim of this study is threefold: it first shows that unsupervised models (Latent Dirichlet Allocation (LDA), Non-negative matrix factorization (NMF) and K-means) are giving satisfying results on the dataset without overlapping. Second, it shows that these unsupervised models are not performing well on the dataset with overlapping, because, on this dataset, these algorithms find some topics that are neither entirely meaningful nor effective in extrinsic tasks. Third, our proposed semi-supervised model Seeded-ULDA performs well on both datasets because this model is straightforward and effective to instruct topic models to find topics of specific interest. It is shown in this paper that the semi-supervised model, Seeded-ULDA, provides significant results as compared to unsupervised algorithms.https://www.mdpi.com/2078-2489/11/11/518topic modelingdocument clusteringnon-negative matrix factorization (NMF)Urdu languagelatent dirichlet allocation (LDA)
spellingShingle Mubashar Mustafa
Feng Zeng
Hussain Ghulam
Hafiz Muhammad Arslan
Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling
Information
topic modeling
document clustering
non-negative matrix factorization (NMF)
Urdu language
latent dirichlet allocation (LDA)
title Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling
title_full Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling
title_fullStr Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling
title_full_unstemmed Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling
title_short Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling
title_sort urdu documents clustering with unsupervised and semi supervised probabilistic topic modeling
topic topic modeling
document clustering
non-negative matrix factorization (NMF)
Urdu language
latent dirichlet allocation (LDA)
url https://www.mdpi.com/2078-2489/11/11/518
work_keys_str_mv AT mubasharmustafa urdudocumentsclusteringwithunsupervisedandsemisupervisedprobabilistictopicmodeling
AT fengzeng urdudocumentsclusteringwithunsupervisedandsemisupervisedprobabilistictopicmodeling
AT hussainghulam urdudocumentsclusteringwithunsupervisedandsemisupervisedprobabilistictopicmodeling
AT hafizmuhammadarslan urdudocumentsclusteringwithunsupervisedandsemisupervisedprobabilistictopicmodeling