Clustering Heterogeneous Data with k-Means by Mutual Information-Based Unsupervised Feature Transformation
Traditional centroid-based clustering algorithms for heterogeneous data with numerical and non-numerical features result in different levels of inaccurate clustering. This is because the Hamming distance used for dissimilarity measurement of non-numerical values does not provide optimal distances be...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2015-03-01
|
Series: | Entropy |
Subjects: | |
Online Access: | http://www.mdpi.com/1099-4300/17/3/1535 |
_version_ | 1818006946750398464 |
---|---|
author | Min Wei Tommy W. S. Chow Rosa H. M. Chan |
author_facet | Min Wei Tommy W. S. Chow Rosa H. M. Chan |
author_sort | Min Wei |
collection | DOAJ |
description | Traditional centroid-based clustering algorithms for heterogeneous data with numerical and non-numerical features result in different levels of inaccurate clustering. This is because the Hamming distance used for dissimilarity measurement of non-numerical values does not provide optimal distances between different values, and problems arise from attempts to combine the Euclidean distance and Hamming distance. In this study, the mutual information (MI)-based unsupervised feature transformation (UFT), which can transform non-numerical features into numerical features without information loss, was utilized with the conventional k-means algorithm for heterogeneous data clustering. For the original non-numerical features, UFT can provide numerical values which preserve the structure of the original non-numerical features and have the property of continuous values at the same time. Experiments and analysis of real-world datasets showed that, the integrated UFT-k-means clustering algorithm outperformed others for heterogeneous data with both numerical and non-numerical features. |
first_indexed | 2024-04-14T05:08:22Z |
format | Article |
id | doaj.art-5466143994154ec7bfda375da9b9f934 |
institution | Directory Open Access Journal |
issn | 1099-4300 |
language | English |
last_indexed | 2024-04-14T05:08:22Z |
publishDate | 2015-03-01 |
publisher | MDPI AG |
record_format | Article |
series | Entropy |
spelling | doaj.art-5466143994154ec7bfda375da9b9f9342022-12-22T02:10:37ZengMDPI AGEntropy1099-43002015-03-011731535154810.3390/e17031535e17031535Clustering Heterogeneous Data with k-Means by Mutual Information-Based Unsupervised Feature TransformationMin Wei0Tommy W. S. Chow1Rosa H. M. Chan2Department of Electronic Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong KongDepartment of Electronic Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong KongDepartment of Electronic Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong KongTraditional centroid-based clustering algorithms for heterogeneous data with numerical and non-numerical features result in different levels of inaccurate clustering. This is because the Hamming distance used for dissimilarity measurement of non-numerical values does not provide optimal distances between different values, and problems arise from attempts to combine the Euclidean distance and Hamming distance. In this study, the mutual information (MI)-based unsupervised feature transformation (UFT), which can transform non-numerical features into numerical features without information loss, was utilized with the conventional k-means algorithm for heterogeneous data clustering. For the original non-numerical features, UFT can provide numerical values which preserve the structure of the original non-numerical features and have the property of continuous values at the same time. Experiments and analysis of real-world datasets showed that, the integrated UFT-k-means clustering algorithm outperformed others for heterogeneous data with both numerical and non-numerical features.http://www.mdpi.com/1099-4300/17/3/1535feature transformationk-meansclustering heterogeneous datanumerical featuresnon-numerical features |
spellingShingle | Min Wei Tommy W. S. Chow Rosa H. M. Chan Clustering Heterogeneous Data with k-Means by Mutual Information-Based Unsupervised Feature Transformation Entropy feature transformation k-means clustering heterogeneous data numerical features non-numerical features |
title | Clustering Heterogeneous Data with k-Means by Mutual Information-Based Unsupervised Feature Transformation |
title_full | Clustering Heterogeneous Data with k-Means by Mutual Information-Based Unsupervised Feature Transformation |
title_fullStr | Clustering Heterogeneous Data with k-Means by Mutual Information-Based Unsupervised Feature Transformation |
title_full_unstemmed | Clustering Heterogeneous Data with k-Means by Mutual Information-Based Unsupervised Feature Transformation |
title_short | Clustering Heterogeneous Data with k-Means by Mutual Information-Based Unsupervised Feature Transformation |
title_sort | clustering heterogeneous data with k means by mutual information based unsupervised feature transformation |
topic | feature transformation k-means clustering heterogeneous data numerical features non-numerical features |
url | http://www.mdpi.com/1099-4300/17/3/1535 |
work_keys_str_mv | AT minwei clusteringheterogeneousdatawithkmeansbymutualinformationbasedunsupervisedfeaturetransformation AT tommywschow clusteringheterogeneousdatawithkmeansbymutualinformationbasedunsupervisedfeaturetransformation AT rosahmchan clusteringheterogeneousdatawithkmeansbymutualinformationbasedunsupervisedfeaturetransformation |