Leveraging Global and Local Topic Popularities for LDA-Based Document Clustering

Document clustering is of high importance for many natural language technologies. A wide range of computational traditional topic models, such as LDA (Latent Dirichlet Allocation) and its variants, have made great progress. However, traditional LDA-based clustering algorithms might not give good res...

Full description

Bibliographic Details
Main Authors: Peng Yang, Yu Yao, Huajian Zhou
Format: Article
Language:English
Published: IEEE 2020-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8970318/
_version_ 1818664557359398912
author Peng Yang
Yu Yao
Huajian Zhou
author_facet Peng Yang
Yu Yao
Huajian Zhou
author_sort Peng Yang
collection DOAJ
description Document clustering is of high importance for many natural language technologies. A wide range of computational traditional topic models, such as LDA (Latent Dirichlet Allocation) and its variants, have made great progress. However, traditional LDA-based clustering algorithms might not give good results due to such probabilistic models require prior distributions which are always difficult to define. In this paper, we propose a probabilistic model named tpLDA, which incorporates different levels of topic popularity information to determine the prior LDA distribution, discover the latent topics and achieve better clustering. Specifically, global topic popularity is introduced to reduce the potential distraction in local cluster popularity and the local cluster popularity draws more attention on certain parts of the global topic popularity. The two popularities contribute complementary information and their integration can dynamically adjust statistical parameters of the model. Experimental evaluations on real data sets show that, compared with state-of-the-art approaches, our proposed framework dramatically improves the accuracy of documents clustering.
first_indexed 2024-12-17T05:34:38Z
format Article
id doaj.art-9f3c3223e5204868a5b830f2f0d9aed6
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-12-17T05:34:38Z
publishDate 2020-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-9f3c3223e5204868a5b830f2f0d9aed62022-12-21T22:01:39ZengIEEEIEEE Access2169-35362020-01-018247342474510.1109/ACCESS.2020.29695258970318Leveraging Global and Local Topic Popularities for LDA-Based Document ClusteringPeng Yang0https://orcid.org/0000-0002-1184-8117Yu Yao1Huajian Zhou2Key Laboratory of Computer Network and Information Integration, Ministry of Education, Southeast University, Nanjing, ChinaKey Laboratory of Computer Network and Information Integration, Ministry of Education, Southeast University, Nanjing, ChinaKey Laboratory of Computer Network and Information Integration, Ministry of Education, Southeast University, Nanjing, ChinaDocument clustering is of high importance for many natural language technologies. A wide range of computational traditional topic models, such as LDA (Latent Dirichlet Allocation) and its variants, have made great progress. However, traditional LDA-based clustering algorithms might not give good results due to such probabilistic models require prior distributions which are always difficult to define. In this paper, we propose a probabilistic model named tpLDA, which incorporates different levels of topic popularity information to determine the prior LDA distribution, discover the latent topics and achieve better clustering. Specifically, global topic popularity is introduced to reduce the potential distraction in local cluster popularity and the local cluster popularity draws more attention on certain parts of the global topic popularity. The two popularities contribute complementary information and their integration can dynamically adjust statistical parameters of the model. Experimental evaluations on real data sets show that, compared with state-of-the-art approaches, our proposed framework dramatically improves the accuracy of documents clustering.https://ieeexplore.ieee.org/document/8970318/Document clusteringlatent Dirichlet allocationmachine learningtopic modeling
spellingShingle Peng Yang
Yu Yao
Huajian Zhou
Leveraging Global and Local Topic Popularities for LDA-Based Document Clustering
IEEE Access
Document clustering
latent Dirichlet allocation
machine learning
topic modeling
title Leveraging Global and Local Topic Popularities for LDA-Based Document Clustering
title_full Leveraging Global and Local Topic Popularities for LDA-Based Document Clustering
title_fullStr Leveraging Global and Local Topic Popularities for LDA-Based Document Clustering
title_full_unstemmed Leveraging Global and Local Topic Popularities for LDA-Based Document Clustering
title_short Leveraging Global and Local Topic Popularities for LDA-Based Document Clustering
title_sort leveraging global and local topic popularities for lda based document clustering
topic Document clustering
latent Dirichlet allocation
machine learning
topic modeling
url https://ieeexplore.ieee.org/document/8970318/
work_keys_str_mv AT pengyang leveragingglobalandlocaltopicpopularitiesforldabaseddocumentclustering
AT yuyao leveragingglobalandlocaltopicpopularitiesforldabaseddocumentclustering
AT huajianzhou leveragingglobalandlocaltopicpopularitiesforldabaseddocumentclustering