An evolutionary-based term reduction approach to bilingual clustering of Malay-English corpora

The document clustering process groups the unstructured text documents into a predefined set of clusters in order to provide more information to the users. There are many studies conducted in clustering monolingual documents. With the enrichment of current technologies, the study of bilingual cluste...

Full description

Bibliographic Details
Main Authors: Rayner Alfred, Leow, Ching Leong, Joe Henry Obit
Format: Conference or Workshop Item
Language:English
English
Published: Springer International Publishing 2017
Subjects:
Online Access:https://eprints.ums.edu.my/id/eprint/29090/1/An%20Evolutionary-Based%20Term%20Reduction%20Approach%20to%20Bilingual%20Clustering%20of%20Malay-English%20Corpora%20ABSTRACT.pdf
https://eprints.ums.edu.my/id/eprint/29090/2/An%20Evolutionary-Based%20Term%20Reduction%20Approach%20to%20Bilingual%20Clustering%20of%20Malay-English%20Corpora.pdf
Description
Summary:The document clustering process groups the unstructured text documents into a predefined set of clusters in order to provide more information to the users. There are many studies conducted in clustering monolingual documents. With the enrichment of current technologies, the study of bilingual clustering would not be a problem. However clustering bilingual document is still facing the same problem faced by a monolingual document clustering which is the “curse of dimensionality”. Hence, this encourages the study of term reduction technique in clustering bilingual documents. The objective in this study is to study the effects of reducing terms considered in clustering bilingual corpus in parallel for English and Malay documents. In this study, a genetic algorithm (GA) is used in order to reduce the number of feature selected. A single-point crossover with a crossover rate of 0.8 is used. Not only that, this study also assesses the effects of applying different mutation rate (e.g., 0.1 and 0.01) in selecting the number of features used in clustering bilingual documents. The result shows that the implementation of GA does improve the clustering mapping compared to the initial clustering mapping. Not only that, this study also discovers that GA with a mutation rate of 0.01 produces the best parallel clustering mapping results compared to GA with a mutation rate of 0.1.