Development of a genetic-based hierarchical agglomerative clustering technique for parallel clustering of bilingual corpora based on reduced terms

In this project, we report on our work on applying Hierarchical Agglomerative Clustering (HAC) to a large corpus of documents where each appears both in Malay and English. We cluster these documents for each language and compare the results both with respect to the content of clusters produced. On t...

Full description

Bibliographic Details
Main Authors: Rayner Alfred, Jason Teo, Chung, Seng Kheau
Format: Research Report
Language:English
Published: Universiti Malaysia Sabah 2010
Subjects:
Online Access:https://eprints.ums.edu.my/id/eprint/24737/1/Development%20of%20genetic-based%20hierarchical.pdf
_version_ 1796910298479722496
author Rayner Alfred
Jason Teo
Chung, Seng Kheau
author_facet Rayner Alfred
Jason Teo
Chung, Seng Kheau
author_sort Rayner Alfred
collection UMS
description In this project, we report on our work on applying Hierarchical Agglomerative Clustering (HAC) to a large corpus of documents where each appears both in Malay and English. We cluster these documents for each language and compare the results both with respect to the content of clusters produced. On the data available, the results of clustering one language resemble the other, provided the number of clusters required is relatively small. Further? we study the effects of changing the method used to compute the inter-clusters distance that includes single link, complete link and average link distance between clusters. Finally, we describe an experiment employing a genetic algorithm to fine-tune the individual term weights in order to reproduce more closely a predefined set of clusters.
first_indexed 2024-03-06T03:02:24Z
format Research Report
id ums.eprints-24737
institution Universiti Malaysia Sabah
language English
last_indexed 2024-03-06T03:02:24Z
publishDate 2010
publisher Universiti Malaysia Sabah
record_format dspace
spelling ums.eprints-247372020-01-29T02:48:12Z https://eprints.ums.edu.my/id/eprint/24737/ Development of a genetic-based hierarchical agglomerative clustering technique for parallel clustering of bilingual corpora based on reduced terms Rayner Alfred Jason Teo Chung, Seng Kheau QA Mathematics In this project, we report on our work on applying Hierarchical Agglomerative Clustering (HAC) to a large corpus of documents where each appears both in Malay and English. We cluster these documents for each language and compare the results both with respect to the content of clusters produced. On the data available, the results of clustering one language resemble the other, provided the number of clusters required is relatively small. Further? we study the effects of changing the method used to compute the inter-clusters distance that includes single link, complete link and average link distance between clusters. Finally, we describe an experiment employing a genetic algorithm to fine-tune the individual term weights in order to reproduce more closely a predefined set of clusters. Universiti Malaysia Sabah 2010 Research Report NonPeerReviewed text en https://eprints.ums.edu.my/id/eprint/24737/1/Development%20of%20genetic-based%20hierarchical.pdf Rayner Alfred and Jason Teo and Chung, Seng Kheau (2010) Development of a genetic-based hierarchical agglomerative clustering technique for parallel clustering of bilingual corpora based on reduced terms. (Unpublished)
spellingShingle QA Mathematics
Rayner Alfred
Jason Teo
Chung, Seng Kheau
Development of a genetic-based hierarchical agglomerative clustering technique for parallel clustering of bilingual corpora based on reduced terms
title Development of a genetic-based hierarchical agglomerative clustering technique for parallel clustering of bilingual corpora based on reduced terms
title_full Development of a genetic-based hierarchical agglomerative clustering technique for parallel clustering of bilingual corpora based on reduced terms
title_fullStr Development of a genetic-based hierarchical agglomerative clustering technique for parallel clustering of bilingual corpora based on reduced terms
title_full_unstemmed Development of a genetic-based hierarchical agglomerative clustering technique for parallel clustering of bilingual corpora based on reduced terms
title_short Development of a genetic-based hierarchical agglomerative clustering technique for parallel clustering of bilingual corpora based on reduced terms
title_sort development of a genetic based hierarchical agglomerative clustering technique for parallel clustering of bilingual corpora based on reduced terms
topic QA Mathematics
url https://eprints.ums.edu.my/id/eprint/24737/1/Development%20of%20genetic-based%20hierarchical.pdf
work_keys_str_mv AT rayneralfred developmentofageneticbasedhierarchicalagglomerativeclusteringtechniqueforparallelclusteringofbilingualcorporabasedonreducedterms
AT jasonteo developmentofageneticbasedhierarchicalagglomerativeclusteringtechniqueforparallelclusteringofbilingualcorporabasedonreducedterms
AT chungsengkheau developmentofageneticbasedhierarchicalagglomerativeclusteringtechniqueforparallelclusteringofbilingualcorporabasedonreducedterms