PGLDA: enhancing the precision of topic modelling using poisson gamma (PG) and latent dirichlet allocation (LDA) for text information retrieval
The Poisson document length distribution has been used extensively in the past for modeling topics with the expectation that its effect will disintegrate at the end of the model definition. This procedure often leads to down Playing word correlation with topics and reducing retrieved documents...
Main Author: | |
---|---|
Format: | Thesis |
Language: | English English English |
Published: |
2021
|
Subjects: | |
Online Access: | http://eprints.uthm.edu.my/4890/1/24p%20IBRAHIM%20BALA%20BAKARI.pdf http://eprints.uthm.edu.my/4890/2/IBRAHIM%20BALA%20BAKARI%20COPYRIGHT%20DECLARATION.pdf http://eprints.uthm.edu.my/4890/3/IBRAHIM%20BALA%20BAKARI%20WATERMARK.pdf |
_version_ | 1796869103323971584 |
---|---|
author | Bakari, Ibrahim Bala |
author_facet | Bakari, Ibrahim Bala |
author_sort | Bakari, Ibrahim Bala |
collection | UTHM |
description | The Poisson document length distribution has been used extensively in the past for
modeling topics with the expectation that its effect will disintegrate at the end of the
model definition. This procedure often leads to down Playing word correlation with
topics and reducing retrieved documents' precision or accuracy. The existing
document model, such as the Latent Dirichlet Allocation (LDA) model, does not
accommodate words' semantic representation. Therefore, in this thesis, the PoissonGamma
Latent Dirichlet Allocation (PGLDA) model for modeling word
dependencies in topic modeling is introduced. The PGLDA model relaxes the words
independence assumption in the existing Latent Dirichlet Allocation (LDA) model
by introducing the Gamma distribution that captures the correlation between adjacent
words in documents. The PGLDA is hybridized with the distributed representation of
documents (Doc2Vec) and topics (Topic2Vec) to form a new model named
PGLDA2Vec. The hybridization process was achieved by averaging the Doc2Vec
and Topic2Vec vectors to form new word representation vectors, combined with
topics with the largest estimated probability using PGLDA. Model estimations for
PGLDA and PGLDA2Vec models were achieved by combining the Laplacian
approximation of log-likelihood for PGLDA and Feed-Forward Neural Network
(FFN) approaches of Doc2Vec and Topic2Vec. The proposed PGLDA and the
hybrid PGLDA2Vec models were assessed using precision, micro F1 scores,
perplexity, and coherence score. The empirical analysis results using three real-world
datasets (20 Newsgroups, AG'News, and Reuters) showed that the hybrid
PGLDA2Vec model with an average precision of 86.6%, and an average F1 score of
96.3%, across the three datasets is better than other competing models reviewed. |
first_indexed | 2024-03-05T21:49:49Z |
format | Thesis |
id | uthm.eprints-4890 |
institution | Universiti Tun Hussein Onn Malaysia |
language | English English English |
last_indexed | 2024-03-05T21:49:49Z |
publishDate | 2021 |
record_format | dspace |
spelling | uthm.eprints-48902022-02-03T03:08:46Z http://eprints.uthm.edu.my/4890/ PGLDA: enhancing the precision of topic modelling using poisson gamma (PG) and latent dirichlet allocation (LDA) for text information retrieval Bakari, Ibrahim Bala QA76 Computer software T Technology (General) The Poisson document length distribution has been used extensively in the past for modeling topics with the expectation that its effect will disintegrate at the end of the model definition. This procedure often leads to down Playing word correlation with topics and reducing retrieved documents' precision or accuracy. The existing document model, such as the Latent Dirichlet Allocation (LDA) model, does not accommodate words' semantic representation. Therefore, in this thesis, the PoissonGamma Latent Dirichlet Allocation (PGLDA) model for modeling word dependencies in topic modeling is introduced. The PGLDA model relaxes the words independence assumption in the existing Latent Dirichlet Allocation (LDA) model by introducing the Gamma distribution that captures the correlation between adjacent words in documents. The PGLDA is hybridized with the distributed representation of documents (Doc2Vec) and topics (Topic2Vec) to form a new model named PGLDA2Vec. The hybridization process was achieved by averaging the Doc2Vec and Topic2Vec vectors to form new word representation vectors, combined with topics with the largest estimated probability using PGLDA. Model estimations for PGLDA and PGLDA2Vec models were achieved by combining the Laplacian approximation of log-likelihood for PGLDA and Feed-Forward Neural Network (FFN) approaches of Doc2Vec and Topic2Vec. The proposed PGLDA and the hybrid PGLDA2Vec models were assessed using precision, micro F1 scores, perplexity, and coherence score. The empirical analysis results using three real-world datasets (20 Newsgroups, AG'News, and Reuters) showed that the hybrid PGLDA2Vec model with an average precision of 86.6%, and an average F1 score of 96.3%, across the three datasets is better than other competing models reviewed. 2021-07 Thesis NonPeerReviewed text en http://eprints.uthm.edu.my/4890/1/24p%20IBRAHIM%20BALA%20BAKARI.pdf text en http://eprints.uthm.edu.my/4890/2/IBRAHIM%20BALA%20BAKARI%20COPYRIGHT%20DECLARATION.pdf text en http://eprints.uthm.edu.my/4890/3/IBRAHIM%20BALA%20BAKARI%20WATERMARK.pdf Bakari, Ibrahim Bala (2021) PGLDA: enhancing the precision of topic modelling using poisson gamma (PG) and latent dirichlet allocation (LDA) for text information retrieval. Doctoral thesis, Universiti Tun Hussein Malaysia. |
spellingShingle | QA76 Computer software T Technology (General) Bakari, Ibrahim Bala PGLDA: enhancing the precision of topic modelling using poisson gamma (PG) and latent dirichlet allocation (LDA) for text information retrieval |
title | PGLDA: enhancing the precision of topic modelling using poisson gamma (PG) and latent dirichlet allocation (LDA) for text information retrieval |
title_full | PGLDA: enhancing the precision of topic modelling using poisson gamma (PG) and latent dirichlet allocation (LDA) for text information retrieval |
title_fullStr | PGLDA: enhancing the precision of topic modelling using poisson gamma (PG) and latent dirichlet allocation (LDA) for text information retrieval |
title_full_unstemmed | PGLDA: enhancing the precision of topic modelling using poisson gamma (PG) and latent dirichlet allocation (LDA) for text information retrieval |
title_short | PGLDA: enhancing the precision of topic modelling using poisson gamma (PG) and latent dirichlet allocation (LDA) for text information retrieval |
title_sort | pglda enhancing the precision of topic modelling using poisson gamma pg and latent dirichlet allocation lda for text information retrieval |
topic | QA76 Computer software T Technology (General) |
url | http://eprints.uthm.edu.my/4890/1/24p%20IBRAHIM%20BALA%20BAKARI.pdf http://eprints.uthm.edu.my/4890/2/IBRAHIM%20BALA%20BAKARI%20COPYRIGHT%20DECLARATION.pdf http://eprints.uthm.edu.my/4890/3/IBRAHIM%20BALA%20BAKARI%20WATERMARK.pdf |
work_keys_str_mv | AT bakariibrahimbala pgldaenhancingtheprecisionoftopicmodellingusingpoissongammapgandlatentdirichletallocationldafortextinformationretrieval |