Document Clustering Using K-Means with Term Weighting as Similarity-Based Constraints
In similarity-based constrained clustering, there have been various approaches on how to define the similarity between documents to guide the grouping of similar documents together. This paper presents an approach to use term-distribution statistics extracted from a small number of cue instances wit...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2020-06-01
|
Series: | Symmetry |
Subjects: | |
Online Access: | https://www.mdpi.com/2073-8994/12/6/967 |
_version_ | 1797565925048713216 |
---|---|
author | Uraiwan Buatoom Waree Kongprawechnon Thanaruk Theeramunkong |
author_facet | Uraiwan Buatoom Waree Kongprawechnon Thanaruk Theeramunkong |
author_sort | Uraiwan Buatoom |
collection | DOAJ |
description | In similarity-based constrained clustering, there have been various approaches on how to define the similarity between documents to guide the grouping of similar documents together. This paper presents an approach to use term-distribution statistics extracted from a small number of cue instances with their known classes, for term weightings as indirect distance constraint. As for distribution-based term weighting, three types of term-oriented standard deviations are exploited: distribution of a term in a collection (SD), average distribution of a term in a class (ACSD), and average distribution of a term among classes (CSD). These term weightings are explored with the consideration of symmetry concepts by varying the magnitude to positive and negative for promoting and demoting effects of three standard deviations. In k-means, followed the symmetry concept, both seeded and unseeded centroid initializations are investigated and compared to the centroid-based classification. Our experiment is conducted using five English text collections and one Thai text collection, i.e., Amazon, DI, WebKB1, WebKB2, and 20Newsgroup, as well as TR, a collection of Thai reform-related opinions. Compared to the conventional TFIDF, the distribution-based term weighting improves the centroid-based method, seeded k-means, and k-means with the error reduction rate of 22.45%, 31.13%, and 58.96%. |
first_indexed | 2024-03-10T19:19:48Z |
format | Article |
id | doaj.art-5631b2d183784499b84481d7549a7985 |
institution | Directory Open Access Journal |
issn | 2073-8994 |
language | English |
last_indexed | 2024-03-10T19:19:48Z |
publishDate | 2020-06-01 |
publisher | MDPI AG |
record_format | Article |
series | Symmetry |
spelling | doaj.art-5631b2d183784499b84481d7549a79852023-11-20T03:02:16ZengMDPI AGSymmetry2073-89942020-06-0112696710.3390/sym12060967Document Clustering Using K-Means with Term Weighting as Similarity-Based ConstraintsUraiwan Buatoom0Waree Kongprawechnon1Thanaruk Theeramunkong2School of Information, Computer and Communication Technology (ICT), Sirindhorn International Institute of Technology, Thammasat University, 131 Moo 5, Tiwanon Road, Bangkadi, Pathumthani 12000, ThailandSchool of Information, Computer and Communication Technology (ICT), Sirindhorn International Institute of Technology, Thammasat University, 131 Moo 5, Tiwanon Road, Bangkadi, Pathumthani 12000, ThailandSchool of Information, Computer and Communication Technology (ICT), Sirindhorn International Institute of Technology, Thammasat University, 131 Moo 5, Tiwanon Road, Bangkadi, Pathumthani 12000, ThailandIn similarity-based constrained clustering, there have been various approaches on how to define the similarity between documents to guide the grouping of similar documents together. This paper presents an approach to use term-distribution statistics extracted from a small number of cue instances with their known classes, for term weightings as indirect distance constraint. As for distribution-based term weighting, three types of term-oriented standard deviations are exploited: distribution of a term in a collection (SD), average distribution of a term in a class (ACSD), and average distribution of a term among classes (CSD). These term weightings are explored with the consideration of symmetry concepts by varying the magnitude to positive and negative for promoting and demoting effects of three standard deviations. In k-means, followed the symmetry concept, both seeded and unseeded centroid initializations are investigated and compared to the centroid-based classification. Our experiment is conducted using five English text collections and one Thai text collection, i.e., Amazon, DI, WebKB1, WebKB2, and 20Newsgroup, as well as TR, a collection of Thai reform-related opinions. Compared to the conventional TFIDF, the distribution-based term weighting improves the centroid-based method, seeded k-means, and k-means with the error reduction rate of 22.45%, 31.13%, and 58.96%.https://www.mdpi.com/2073-8994/12/6/967constrained document clusteringterm weightingclass frequencies and distributionstext miningsimilarity measurek-means |
spellingShingle | Uraiwan Buatoom Waree Kongprawechnon Thanaruk Theeramunkong Document Clustering Using K-Means with Term Weighting as Similarity-Based Constraints Symmetry constrained document clustering term weighting class frequencies and distributions text mining similarity measure k-means |
title | Document Clustering Using K-Means with Term Weighting as Similarity-Based Constraints |
title_full | Document Clustering Using K-Means with Term Weighting as Similarity-Based Constraints |
title_fullStr | Document Clustering Using K-Means with Term Weighting as Similarity-Based Constraints |
title_full_unstemmed | Document Clustering Using K-Means with Term Weighting as Similarity-Based Constraints |
title_short | Document Clustering Using K-Means with Term Weighting as Similarity-Based Constraints |
title_sort | document clustering using k means with term weighting as similarity based constraints |
topic | constrained document clustering term weighting class frequencies and distributions text mining similarity measure k-means |
url | https://www.mdpi.com/2073-8994/12/6/967 |
work_keys_str_mv | AT uraiwanbuatoom documentclusteringusingkmeanswithtermweightingassimilaritybasedconstraints AT wareekongprawechnon documentclusteringusingkmeanswithtermweightingassimilaritybasedconstraints AT thanaruktheeramunkong documentclusteringusingkmeanswithtermweightingassimilaritybasedconstraints |