Leveraging Global and Local Topic Popularities for LDA-Based Document Clustering
Document clustering is of high importance for many natural language technologies. A wide range of computational traditional topic models, such as LDA (Latent Dirichlet Allocation) and its variants, have made great progress. However, traditional LDA-based clustering algorithms might not give good res...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2020-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/8970318/ |
_version_ | 1818664557359398912 |
---|---|
author | Peng Yang Yu Yao Huajian Zhou |
author_facet | Peng Yang Yu Yao Huajian Zhou |
author_sort | Peng Yang |
collection | DOAJ |
description | Document clustering is of high importance for many natural language technologies. A wide range of computational traditional topic models, such as LDA (Latent Dirichlet Allocation) and its variants, have made great progress. However, traditional LDA-based clustering algorithms might not give good results due to such probabilistic models require prior distributions which are always difficult to define. In this paper, we propose a probabilistic model named tpLDA, which incorporates different levels of topic popularity information to determine the prior LDA distribution, discover the latent topics and achieve better clustering. Specifically, global topic popularity is introduced to reduce the potential distraction in local cluster popularity and the local cluster popularity draws more attention on certain parts of the global topic popularity. The two popularities contribute complementary information and their integration can dynamically adjust statistical parameters of the model. Experimental evaluations on real data sets show that, compared with state-of-the-art approaches, our proposed framework dramatically improves the accuracy of documents clustering. |
first_indexed | 2024-12-17T05:34:38Z |
format | Article |
id | doaj.art-9f3c3223e5204868a5b830f2f0d9aed6 |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-12-17T05:34:38Z |
publishDate | 2020-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-9f3c3223e5204868a5b830f2f0d9aed62022-12-21T22:01:39ZengIEEEIEEE Access2169-35362020-01-018247342474510.1109/ACCESS.2020.29695258970318Leveraging Global and Local Topic Popularities for LDA-Based Document ClusteringPeng Yang0https://orcid.org/0000-0002-1184-8117Yu Yao1Huajian Zhou2Key Laboratory of Computer Network and Information Integration, Ministry of Education, Southeast University, Nanjing, ChinaKey Laboratory of Computer Network and Information Integration, Ministry of Education, Southeast University, Nanjing, ChinaKey Laboratory of Computer Network and Information Integration, Ministry of Education, Southeast University, Nanjing, ChinaDocument clustering is of high importance for many natural language technologies. A wide range of computational traditional topic models, such as LDA (Latent Dirichlet Allocation) and its variants, have made great progress. However, traditional LDA-based clustering algorithms might not give good results due to such probabilistic models require prior distributions which are always difficult to define. In this paper, we propose a probabilistic model named tpLDA, which incorporates different levels of topic popularity information to determine the prior LDA distribution, discover the latent topics and achieve better clustering. Specifically, global topic popularity is introduced to reduce the potential distraction in local cluster popularity and the local cluster popularity draws more attention on certain parts of the global topic popularity. The two popularities contribute complementary information and their integration can dynamically adjust statistical parameters of the model. Experimental evaluations on real data sets show that, compared with state-of-the-art approaches, our proposed framework dramatically improves the accuracy of documents clustering.https://ieeexplore.ieee.org/document/8970318/Document clusteringlatent Dirichlet allocationmachine learningtopic modeling |
spellingShingle | Peng Yang Yu Yao Huajian Zhou Leveraging Global and Local Topic Popularities for LDA-Based Document Clustering IEEE Access Document clustering latent Dirichlet allocation machine learning topic modeling |
title | Leveraging Global and Local Topic Popularities for LDA-Based Document Clustering |
title_full | Leveraging Global and Local Topic Popularities for LDA-Based Document Clustering |
title_fullStr | Leveraging Global and Local Topic Popularities for LDA-Based Document Clustering |
title_full_unstemmed | Leveraging Global and Local Topic Popularities for LDA-Based Document Clustering |
title_short | Leveraging Global and Local Topic Popularities for LDA-Based Document Clustering |
title_sort | leveraging global and local topic popularities for lda based document clustering |
topic | Document clustering latent Dirichlet allocation machine learning topic modeling |
url | https://ieeexplore.ieee.org/document/8970318/ |
work_keys_str_mv | AT pengyang leveragingglobalandlocaltopicpopularitiesforldabaseddocumentclustering AT yuyao leveragingglobalandlocaltopicpopularitiesforldabaseddocumentclustering AT huajianzhou leveragingglobalandlocaltopicpopularitiesforldabaseddocumentclustering |