Combination of Bayesian and Latent Semantic Analysis with Domain Specific Knowledge

With the development of information technology, electronic publications become popular. However, it is a challenge to retrieve information from electronic publications because the large amount of words, the synonymy problem and the polysemi problem. In this paper, we introduced a new algorithm calle...

Full description

Bibliographic Details
Main Authors: Shen Lu, Richard S. Segall
Format: Article
Language:English
Published: International Institute of Informatics and Cybernetics 2016-06-01
Series:Journal of Systemics, Cybernetics and Informatics
Subjects:
Online Access:http://www.iiisci.org/Journal/CV$/sci/pdfs/SA330TP16.pdf
Description
Summary:With the development of information technology, electronic publications become popular. However, it is a challenge to retrieve information from electronic publications because the large amount of words, the synonymy problem and the polysemi problem. In this paper, we introduced a new algorithm called Bayesian Latent Semantic Analysis (BLSA). We chose to model text not based on terms but associations between words. Also, the significance of interesting features were improved by expand the number of similar terms with glossaries. Latent Semantic Analysis (LSA) was chosen to discover significant features. Bayesian post probability was used to discover segmentation boundaries. Also, Dirchlet distribution was chosen to present the vector of topic distribution and calculate the maximum probability of the topics. Experimental results showed us that both Pk [8] and WindowsDiff [27] decreased 10% by using BLSA in comparison to the Lexical Cohesion with the original data. <br><br>[8] Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K. and Harshman, R. (1990), 'Indexing by latent semantic analysis'<strong>, <em>Journal of the American Society for Information Science</em></strong>, vol. 41, n.6, pp. 391-407.<br> [27] Pevzner, L. and Hearst, M.A. (2002). A critique and improvement of an evaluation metric for text segmentation, <strong><em>Computational Linguistics</em></strong>, vol. 28, no. 1, pp. 19-36.
ISSN:1690-4524