Document Clustering Using K-Means with Term Weighting as Similarity-Based Constraints

In similarity-based constrained clustering, there have been various approaches on how to define the similarity between documents to guide the grouping of similar documents together. This paper presents an approach to use term-distribution statistics extracted from a small number of cue instances wit...

Full description

Bibliographic Details
Main Authors: Uraiwan Buatoom, Waree Kongprawechnon, Thanaruk Theeramunkong
Format: Article
Language:English
Published: MDPI AG 2020-06-01
Series:Symmetry
Subjects:
Online Access:https://www.mdpi.com/2073-8994/12/6/967
_version_ 1797565925048713216
author Uraiwan Buatoom
Waree Kongprawechnon
Thanaruk Theeramunkong
author_facet Uraiwan Buatoom
Waree Kongprawechnon
Thanaruk Theeramunkong
author_sort Uraiwan Buatoom
collection DOAJ
description In similarity-based constrained clustering, there have been various approaches on how to define the similarity between documents to guide the grouping of similar documents together. This paper presents an approach to use term-distribution statistics extracted from a small number of cue instances with their known classes, for term weightings as indirect distance constraint. As for distribution-based term weighting, three types of term-oriented standard deviations are exploited: distribution of a term in a collection (SD), average distribution of a term in a class (ACSD), and average distribution of a term among classes (CSD). These term weightings are explored with the consideration of symmetry concepts by varying the magnitude to positive and negative for promoting and demoting effects of three standard deviations. In k-means, followed the symmetry concept, both seeded and unseeded centroid initializations are investigated and compared to the centroid-based classification. Our experiment is conducted using five English text collections and one Thai text collection, i.e., Amazon, DI, WebKB1, WebKB2, and 20Newsgroup, as well as TR, a collection of Thai reform-related opinions. Compared to the conventional TFIDF, the distribution-based term weighting improves the centroid-based method, seeded k-means, and k-means with the error reduction rate of 22.45%, 31.13%, and 58.96%.
first_indexed 2024-03-10T19:19:48Z
format Article
id doaj.art-5631b2d183784499b84481d7549a7985
institution Directory Open Access Journal
issn 2073-8994
language English
last_indexed 2024-03-10T19:19:48Z
publishDate 2020-06-01
publisher MDPI AG
record_format Article
series Symmetry
spelling doaj.art-5631b2d183784499b84481d7549a79852023-11-20T03:02:16ZengMDPI AGSymmetry2073-89942020-06-0112696710.3390/sym12060967Document Clustering Using K-Means with Term Weighting as Similarity-Based ConstraintsUraiwan Buatoom0Waree Kongprawechnon1Thanaruk Theeramunkong2School of Information, Computer and Communication Technology (ICT), Sirindhorn International Institute of Technology, Thammasat University, 131 Moo 5, Tiwanon Road, Bangkadi, Pathumthani 12000, ThailandSchool of Information, Computer and Communication Technology (ICT), Sirindhorn International Institute of Technology, Thammasat University, 131 Moo 5, Tiwanon Road, Bangkadi, Pathumthani 12000, ThailandSchool of Information, Computer and Communication Technology (ICT), Sirindhorn International Institute of Technology, Thammasat University, 131 Moo 5, Tiwanon Road, Bangkadi, Pathumthani 12000, ThailandIn similarity-based constrained clustering, there have been various approaches on how to define the similarity between documents to guide the grouping of similar documents together. This paper presents an approach to use term-distribution statistics extracted from a small number of cue instances with their known classes, for term weightings as indirect distance constraint. As for distribution-based term weighting, three types of term-oriented standard deviations are exploited: distribution of a term in a collection (SD), average distribution of a term in a class (ACSD), and average distribution of a term among classes (CSD). These term weightings are explored with the consideration of symmetry concepts by varying the magnitude to positive and negative for promoting and demoting effects of three standard deviations. In k-means, followed the symmetry concept, both seeded and unseeded centroid initializations are investigated and compared to the centroid-based classification. Our experiment is conducted using five English text collections and one Thai text collection, i.e., Amazon, DI, WebKB1, WebKB2, and 20Newsgroup, as well as TR, a collection of Thai reform-related opinions. Compared to the conventional TFIDF, the distribution-based term weighting improves the centroid-based method, seeded k-means, and k-means with the error reduction rate of 22.45%, 31.13%, and 58.96%.https://www.mdpi.com/2073-8994/12/6/967constrained document clusteringterm weightingclass frequencies and distributionstext miningsimilarity measurek-means
spellingShingle Uraiwan Buatoom
Waree Kongprawechnon
Thanaruk Theeramunkong
Document Clustering Using K-Means with Term Weighting as Similarity-Based Constraints
Symmetry
constrained document clustering
term weighting
class frequencies and distributions
text mining
similarity measure
k-means
title Document Clustering Using K-Means with Term Weighting as Similarity-Based Constraints
title_full Document Clustering Using K-Means with Term Weighting as Similarity-Based Constraints
title_fullStr Document Clustering Using K-Means with Term Weighting as Similarity-Based Constraints
title_full_unstemmed Document Clustering Using K-Means with Term Weighting as Similarity-Based Constraints
title_short Document Clustering Using K-Means with Term Weighting as Similarity-Based Constraints
title_sort document clustering using k means with term weighting as similarity based constraints
topic constrained document clustering
term weighting
class frequencies and distributions
text mining
similarity measure
k-means
url https://www.mdpi.com/2073-8994/12/6/967
work_keys_str_mv AT uraiwanbuatoom documentclusteringusingkmeanswithtermweightingassimilaritybasedconstraints
AT wareekongprawechnon documentclusteringusingkmeanswithtermweightingassimilaritybasedconstraints
AT thanaruktheeramunkong documentclusteringusingkmeanswithtermweightingassimilaritybasedconstraints