An evolutionary-based term reduction approach to bilingual clustering of Malay-English corpora
The document clustering process groups the unstructured text documents into a predefined set of clusters in order to provide more information to the users. There are many studies conducted in clustering monolingual documents. With the enrichment of current technologies, the study of bilingual cluste...
Main Authors: | , , |
---|---|
Format: | Conference or Workshop Item |
Language: | English English |
Published: |
Springer International Publishing
2017
|
Subjects: | |
Online Access: | https://eprints.ums.edu.my/id/eprint/29090/1/An%20Evolutionary-Based%20Term%20Reduction%20Approach%20to%20Bilingual%20Clustering%20of%20Malay-English%20Corpora%20ABSTRACT.pdf https://eprints.ums.edu.my/id/eprint/29090/2/An%20Evolutionary-Based%20Term%20Reduction%20Approach%20to%20Bilingual%20Clustering%20of%20Malay-English%20Corpora.pdf |
Summary: | The document clustering process groups the unstructured text documents into a predefined set of clusters in order to provide more information to the users. There are many studies conducted in clustering monolingual documents. With the enrichment of current technologies, the study of bilingual clustering would not be a problem. However clustering bilingual document is still facing the same problem faced by a monolingual document clustering which is the “curse of dimensionality”. Hence, this encourages the study of term reduction technique in clustering bilingual documents. The objective in this study is to study the effects of reducing terms considered in clustering bilingual corpus in parallel for English and Malay documents. In this study, a genetic algorithm (GA) is used in order to reduce the number of feature selected. A single-point crossover with a crossover rate of 0.8 is used. Not only that, this study also assesses the effects of applying different mutation rate (e.g., 0.1 and 0.01) in selecting the number of features used in clustering bilingual documents. The result shows that the implementation of GA does improve the clustering mapping compared to the initial clustering mapping. Not only that, this study also discovers that GA with a mutation rate of 0.01 produces the best parallel clustering mapping results compared to GA with a mutation rate of 0.1. |
---|