Document Clustering Using K-Means with Term Weighting as Similarity-Based Constraints

In similarity-based constrained clustering, there have been various approaches on how to define the similarity between documents to guide the grouping of similar documents together. This paper presents an approach to use term-distribution statistics extracted from a small number of cue instances wit...

Full description

Bibliographic Details
Main Authors:	Uraiwan Buatoom, Waree Kongprawechnon, Thanaruk Theeramunkong
Format:	Article
Language:	English
Published:	MDPI AG 2020-06-01
Series:	Symmetry
Subjects:	constrained document clustering term weighting class frequencies and distributions text mining similarity measure k-means
Online Access:	https://www.mdpi.com/2073-8994/12/6/967

_version_	1797565925048713216
author	Uraiwan Buatoom Waree Kongprawechnon Thanaruk Theeramunkong
author_facet	Uraiwan Buatoom Waree Kongprawechnon Thanaruk Theeramunkong
author_sort	Uraiwan Buatoom
collection	DOAJ
description	In similarity-based constrained clustering, there have been various approaches on how to define the similarity between documents to guide the grouping of similar documents together. This paper presents an approach to use term-distribution statistics extracted from a small number of cue instances with their known classes, for term weightings as indirect distance constraint. As for distribution-based term weighting, three types of term-oriented standard deviations are exploited: distribution of a term in a collection (SD), average distribution of a term in a class (ACSD), and average distribution of a term among classes (CSD). These term weightings are explored with the consideration of symmetry concepts by varying the magnitude to positive and negative for promoting and demoting effects of three standard deviations. In k-means, followed the symmetry concept, both seeded and unseeded centroid initializations are investigated and compared to the centroid-based classification. Our experiment is conducted using five English text collections and one Thai text collection, i.e., Amazon, DI, WebKB1, WebKB2, and 20Newsgroup, as well as TR, a collection of Thai reform-related opinions. Compared to the conventional TFIDF, the distribution-based term weighting improves the centroid-based method, seeded k-means, and k-means with the error reduction rate of 22.45%, 31.13%, and 58.96%.
first_indexed	2024-03-10T19:19:48Z
format	Article
id	doaj.art-5631b2d183784499b84481d7549a7985
institution	Directory Open Access Journal
issn	2073-8994
language	English
last_indexed	2024-03-10T19:19:48Z
publishDate	2020-06-01
publisher	MDPI AG
record_format	Article
series	Symmetry
spelling	doaj.art-5631b2d183784499b84481d7549a79852023-11-20T03:02:16ZengMDPI AGSymmetry2073-89942020-06-0112696710.3390/sym12060967Document Clustering Using K-Means with Term Weighting as Similarity-Based ConstraintsUraiwan Buatoom0Waree Kongprawechnon1Thanaruk Theeramunkong2School of Information, Computer and Communication Technology (ICT), Sirindhorn International Institute of Technology, Thammasat University, 131 Moo 5, Tiwanon Road, Bangkadi, Pathumthani 12000, ThailandSchool of Information, Computer and Communication Technology (ICT), Sirindhorn International Institute of Technology, Thammasat University, 131 Moo 5, Tiwanon Road, Bangkadi, Pathumthani 12000, ThailandSchool of Information, Computer and Communication Technology (ICT), Sirindhorn International Institute of Technology, Thammasat University, 131 Moo 5, Tiwanon Road, Bangkadi, Pathumthani 12000, ThailandIn similarity-based constrained clustering, there have been various approaches on how to define the similarity between documents to guide the grouping of similar documents together. This paper presents an approach to use term-distribution statistics extracted from a small number of cue instances with their known classes, for term weightings as indirect distance constraint. As for distribution-based term weighting, three types of term-oriented standard deviations are exploited: distribution of a term in a collection (SD), average distribution of a term in a class (ACSD), and average distribution of a term among classes (CSD). These term weightings are explored with the consideration of symmetry concepts by varying the magnitude to positive and negative for promoting and demoting effects of three standard deviations. In k-means, followed the symmetry concept, both seeded and unseeded centroid initializations are investigated and compared to the centroid-based classification. Our experiment is conducted using five English text collections and one Thai text collection, i.e., Amazon, DI, WebKB1, WebKB2, and 20Newsgroup, as well as TR, a collection of Thai reform-related opinions. Compared to the conventional TFIDF, the distribution-based term weighting improves the centroid-based method, seeded k-means, and k-means with the error reduction rate of 22.45%, 31.13%, and 58.96%.https://www.mdpi.com/2073-8994/12/6/967constrained document clusteringterm weightingclass frequencies and distributionstext miningsimilarity measurek-means
spellingShingle	Uraiwan Buatoom Waree Kongprawechnon Thanaruk Theeramunkong Document Clustering Using K-Means with Term Weighting as Similarity-Based Constraints Symmetry constrained document clustering term weighting class frequencies and distributions text mining similarity measure k-means
title	Document Clustering Using K-Means with Term Weighting as Similarity-Based Constraints
title_full	Document Clustering Using K-Means with Term Weighting as Similarity-Based Constraints
title_fullStr	Document Clustering Using K-Means with Term Weighting as Similarity-Based Constraints
title_full_unstemmed	Document Clustering Using K-Means with Term Weighting as Similarity-Based Constraints
title_short	Document Clustering Using K-Means with Term Weighting as Similarity-Based Constraints
title_sort	document clustering using k means with term weighting as similarity based constraints
topic	constrained document clustering term weighting class frequencies and distributions text mining similarity measure k-means
url	https://www.mdpi.com/2073-8994/12/6/967
work_keys_str_mv	AT uraiwanbuatoom documentclusteringusingkmeanswithtermweightingassimilaritybasedconstraints AT wareekongprawechnon documentclusteringusingkmeanswithtermweightingassimilaritybasedconstraints AT thanaruktheeramunkong documentclusteringusingkmeanswithtermweightingassimilaritybasedconstraints

Document Clustering Using K-Means with Term Weighting as Similarity-Based Constraints

Similar Items