Clustering Heterogeneous Data with k-Means by Mutual Information-Based Unsupervised Feature Transformation

Traditional centroid-based clustering algorithms for heterogeneous data with numerical and non-numerical features result in different levels of inaccurate clustering. This is because the Hamming distance used for dissimilarity measurement of non-numerical values does not provide optimal distances be...

Full description

Bibliographic Details
Main Authors: Min Wei, Tommy W. S. Chow, Rosa H. M. Chan
Format: Article
Language:English
Published: MDPI AG 2015-03-01
Series:Entropy
Subjects:
Online Access:http://www.mdpi.com/1099-4300/17/3/1535
_version_ 1818006946750398464
author Min Wei
Tommy W. S. Chow
Rosa H. M. Chan
author_facet Min Wei
Tommy W. S. Chow
Rosa H. M. Chan
author_sort Min Wei
collection DOAJ
description Traditional centroid-based clustering algorithms for heterogeneous data with numerical and non-numerical features result in different levels of inaccurate clustering. This is because the Hamming distance used for dissimilarity measurement of non-numerical values does not provide optimal distances between different values, and problems arise from attempts to combine the Euclidean distance and Hamming distance. In this study, the mutual information (MI)-based unsupervised feature transformation (UFT), which can transform non-numerical features into numerical features without information loss, was utilized with the conventional k-means algorithm for heterogeneous data clustering. For the original non-numerical features, UFT can provide numerical values which preserve the structure of the original non-numerical features and have the property of continuous values at the same time. Experiments and analysis of real-world datasets showed that, the integrated UFT-k-means clustering algorithm outperformed others for heterogeneous data with both numerical and non-numerical features.
first_indexed 2024-04-14T05:08:22Z
format Article
id doaj.art-5466143994154ec7bfda375da9b9f934
institution Directory Open Access Journal
issn 1099-4300
language English
last_indexed 2024-04-14T05:08:22Z
publishDate 2015-03-01
publisher MDPI AG
record_format Article
series Entropy
spelling doaj.art-5466143994154ec7bfda375da9b9f9342022-12-22T02:10:37ZengMDPI AGEntropy1099-43002015-03-011731535154810.3390/e17031535e17031535Clustering Heterogeneous Data with k-Means by Mutual Information-Based Unsupervised Feature TransformationMin Wei0Tommy W. S. Chow1Rosa H. M. Chan2Department of Electronic Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong KongDepartment of Electronic Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong KongDepartment of Electronic Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong KongTraditional centroid-based clustering algorithms for heterogeneous data with numerical and non-numerical features result in different levels of inaccurate clustering. This is because the Hamming distance used for dissimilarity measurement of non-numerical values does not provide optimal distances between different values, and problems arise from attempts to combine the Euclidean distance and Hamming distance. In this study, the mutual information (MI)-based unsupervised feature transformation (UFT), which can transform non-numerical features into numerical features without information loss, was utilized with the conventional k-means algorithm for heterogeneous data clustering. For the original non-numerical features, UFT can provide numerical values which preserve the structure of the original non-numerical features and have the property of continuous values at the same time. Experiments and analysis of real-world datasets showed that, the integrated UFT-k-means clustering algorithm outperformed others for heterogeneous data with both numerical and non-numerical features.http://www.mdpi.com/1099-4300/17/3/1535feature transformationk-meansclustering heterogeneous datanumerical featuresnon-numerical features
spellingShingle Min Wei
Tommy W. S. Chow
Rosa H. M. Chan
Clustering Heterogeneous Data with k-Means by Mutual Information-Based Unsupervised Feature Transformation
Entropy
feature transformation
k-means
clustering heterogeneous data
numerical features
non-numerical features
title Clustering Heterogeneous Data with k-Means by Mutual Information-Based Unsupervised Feature Transformation
title_full Clustering Heterogeneous Data with k-Means by Mutual Information-Based Unsupervised Feature Transformation
title_fullStr Clustering Heterogeneous Data with k-Means by Mutual Information-Based Unsupervised Feature Transformation
title_full_unstemmed Clustering Heterogeneous Data with k-Means by Mutual Information-Based Unsupervised Feature Transformation
title_short Clustering Heterogeneous Data with k-Means by Mutual Information-Based Unsupervised Feature Transformation
title_sort clustering heterogeneous data with k means by mutual information based unsupervised feature transformation
topic feature transformation
k-means
clustering heterogeneous data
numerical features
non-numerical features
url http://www.mdpi.com/1099-4300/17/3/1535
work_keys_str_mv AT minwei clusteringheterogeneousdatawithkmeansbymutualinformationbasedunsupervisedfeaturetransformation
AT tommywschow clusteringheterogeneousdatawithkmeansbymutualinformationbasedunsupervisedfeaturetransformation
AT rosahmchan clusteringheterogeneousdatawithkmeansbymutualinformationbasedunsupervisedfeaturetransformation