Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling
Document clustering is to group documents according to certain semantic features. Topic model has a richer semantic structure and considerable potential for helping users to know document corpora. Unfortunately, this potential is stymied on text documents which have overlapping nature, due to their...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2020-11-01
|
Series: | Information |
Subjects: | |
Online Access: | https://www.mdpi.com/2078-2489/11/11/518 |
_version_ | 1827702850465038336 |
---|---|
author | Mubashar Mustafa Feng Zeng Hussain Ghulam Hafiz Muhammad Arslan |
author_facet | Mubashar Mustafa Feng Zeng Hussain Ghulam Hafiz Muhammad Arslan |
author_sort | Mubashar Mustafa |
collection | DOAJ |
description | Document clustering is to group documents according to certain semantic features. Topic model has a richer semantic structure and considerable potential for helping users to know document corpora. Unfortunately, this potential is stymied on text documents which have overlapping nature, due to their purely unsupervised nature. To solve this problem, some semi-supervised models have been proposed for English language. However, no such work is available for poor resource language Urdu. Therefore, document clustering has become a challenging task in Urdu language, which has its own morphology, syntax and semantics. In this study, we proposed a semi-supervised framework for Urdu documents clustering to deal with the Urdu morphology challenges. The proposed model is a combination of pre-processing techniques, seeded-LDA model and Gibbs sampling, we named it seeded-Urdu Latent Dirichlet Allocation (seeded-ULDA). We apply the proposed model and other methods to Urdu news datasets for categorizing. For the datasets, two conditions are considered for document clustering, one is “Dataset without overlapping” in which all classes have distinct nature. The other is “Dataset with overlapping” in which the categories are overlapping and the classes are connected to each other. The aim of this study is threefold: it first shows that unsupervised models (Latent Dirichlet Allocation (LDA), Non-negative matrix factorization (NMF) and K-means) are giving satisfying results on the dataset without overlapping. Second, it shows that these unsupervised models are not performing well on the dataset with overlapping, because, on this dataset, these algorithms find some topics that are neither entirely meaningful nor effective in extrinsic tasks. Third, our proposed semi-supervised model Seeded-ULDA performs well on both datasets because this model is straightforward and effective to instruct topic models to find topics of specific interest. It is shown in this paper that the semi-supervised model, Seeded-ULDA, provides significant results as compared to unsupervised algorithms. |
first_indexed | 2024-03-10T15:05:20Z |
format | Article |
id | doaj.art-a9a930230e9748a384a9e46f89f899d5 |
institution | Directory Open Access Journal |
issn | 2078-2489 |
language | English |
last_indexed | 2024-03-10T15:05:20Z |
publishDate | 2020-11-01 |
publisher | MDPI AG |
record_format | Article |
series | Information |
spelling | doaj.art-a9a930230e9748a384a9e46f89f899d52023-11-20T19:52:00ZengMDPI AGInformation2078-24892020-11-01111151810.3390/info11110518Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic ModelingMubashar Mustafa0Feng Zeng1Hussain Ghulam2Hafiz Muhammad Arslan3School of Computer Science and Engineering, Central South University, 410083 Changsha, ChinaSchool of Computer Science and Engineering, Central South University, 410083 Changsha, ChinaSchool of Computer Science and Engineering, Central South University, 410083 Changsha, ChinaSchool of Software Engineering, Northeastern University, 110819 Shenyang, ChinaDocument clustering is to group documents according to certain semantic features. Topic model has a richer semantic structure and considerable potential for helping users to know document corpora. Unfortunately, this potential is stymied on text documents which have overlapping nature, due to their purely unsupervised nature. To solve this problem, some semi-supervised models have been proposed for English language. However, no such work is available for poor resource language Urdu. Therefore, document clustering has become a challenging task in Urdu language, which has its own morphology, syntax and semantics. In this study, we proposed a semi-supervised framework for Urdu documents clustering to deal with the Urdu morphology challenges. The proposed model is a combination of pre-processing techniques, seeded-LDA model and Gibbs sampling, we named it seeded-Urdu Latent Dirichlet Allocation (seeded-ULDA). We apply the proposed model and other methods to Urdu news datasets for categorizing. For the datasets, two conditions are considered for document clustering, one is “Dataset without overlapping” in which all classes have distinct nature. The other is “Dataset with overlapping” in which the categories are overlapping and the classes are connected to each other. The aim of this study is threefold: it first shows that unsupervised models (Latent Dirichlet Allocation (LDA), Non-negative matrix factorization (NMF) and K-means) are giving satisfying results on the dataset without overlapping. Second, it shows that these unsupervised models are not performing well on the dataset with overlapping, because, on this dataset, these algorithms find some topics that are neither entirely meaningful nor effective in extrinsic tasks. Third, our proposed semi-supervised model Seeded-ULDA performs well on both datasets because this model is straightforward and effective to instruct topic models to find topics of specific interest. It is shown in this paper that the semi-supervised model, Seeded-ULDA, provides significant results as compared to unsupervised algorithms.https://www.mdpi.com/2078-2489/11/11/518topic modelingdocument clusteringnon-negative matrix factorization (NMF)Urdu languagelatent dirichlet allocation (LDA) |
spellingShingle | Mubashar Mustafa Feng Zeng Hussain Ghulam Hafiz Muhammad Arslan Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling Information topic modeling document clustering non-negative matrix factorization (NMF) Urdu language latent dirichlet allocation (LDA) |
title | Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling |
title_full | Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling |
title_fullStr | Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling |
title_full_unstemmed | Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling |
title_short | Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling |
title_sort | urdu documents clustering with unsupervised and semi supervised probabilistic topic modeling |
topic | topic modeling document clustering non-negative matrix factorization (NMF) Urdu language latent dirichlet allocation (LDA) |
url | https://www.mdpi.com/2078-2489/11/11/518 |
work_keys_str_mv | AT mubasharmustafa urdudocumentsclusteringwithunsupervisedandsemisupervisedprobabilistictopicmodeling AT fengzeng urdudocumentsclusteringwithunsupervisedandsemisupervisedprobabilistictopicmodeling AT hussainghulam urdudocumentsclusteringwithunsupervisedandsemisupervisedprobabilistictopicmodeling AT hafizmuhammadarslan urdudocumentsclusteringwithunsupervisedandsemisupervisedprobabilistictopicmodeling |