Improved k-means clustering using principal component analysis and imputation methods for breast cancer dataset
Data mining techniques have been used to analyse pattern from data sets in order to derive useful information. Classification of data sets into clusters is one of the essential process for data manipulation. One of the most popular and efficient clustering methods is K-means method. However, the K-m...
Main Author: | |
---|---|
Format: | Thesis |
Language: | English |
Published: |
2018
|
Subjects: | |
Online Access: | http://eprints.utm.my/81435/1/RoslanArminaMFC2018.pdf |
_version_ | 1796863386864058368 |
---|---|
author | Armina, Roslan |
author_facet | Armina, Roslan |
author_sort | Armina, Roslan |
collection | ePrints |
description | Data mining techniques have been used to analyse pattern from data sets in order to derive useful information. Classification of data sets into clusters is one of the essential process for data manipulation. One of the most popular and efficient clustering methods is K-means method. However, the K-means clustering method has some difficulties in the analysis of high dimension data sets with the presence of missing values. Moreover, previous studies showed that high dimensionality of the feature in data set presented poses different problems for K-means clustering. For missing value problem, imputation method is needed to minimise the effect of incomplete high dimensional data sets in K-means clustering process. This research studies the effect of imputation algorithm and dimensionality reduction techniques on the performance of K-means clustering. Three imputation methods are implemented for the missing value estimation which are K-nearest neighbours (KNN), Least Local Square (LLS), and Bayesian Principle Component Analysis (BPCA). Principal Component Analysis (PCA) is a dimension reduction method that has a dimensional reduction capability by removing the unnecessary attribute of high dimensional data sets. Hence, PCA hybrid with K-means (PCA K-means) is proposed to give a better clustering result. The experimental process was performed by using Wisconsin Breast Cancer. By using LLS imputation method, the proposed hybrid PCA K-means outperformed the standard Kmeans clustering based on the results for breast cancer data set; in terms of clustering accuracy (0.29%) and computing time (95.76%). |
first_indexed | 2024-03-05T20:25:59Z |
format | Thesis |
id | utm.eprints-81435 |
institution | Universiti Teknologi Malaysia - ePrints |
language | English |
last_indexed | 2024-03-05T20:25:59Z |
publishDate | 2018 |
record_format | dspace |
spelling | utm.eprints-814352019-08-23T05:01:06Z http://eprints.utm.my/81435/ Improved k-means clustering using principal component analysis and imputation methods for breast cancer dataset Armina, Roslan QA76 Computer software Data mining techniques have been used to analyse pattern from data sets in order to derive useful information. Classification of data sets into clusters is one of the essential process for data manipulation. One of the most popular and efficient clustering methods is K-means method. However, the K-means clustering method has some difficulties in the analysis of high dimension data sets with the presence of missing values. Moreover, previous studies showed that high dimensionality of the feature in data set presented poses different problems for K-means clustering. For missing value problem, imputation method is needed to minimise the effect of incomplete high dimensional data sets in K-means clustering process. This research studies the effect of imputation algorithm and dimensionality reduction techniques on the performance of K-means clustering. Three imputation methods are implemented for the missing value estimation which are K-nearest neighbours (KNN), Least Local Square (LLS), and Bayesian Principle Component Analysis (BPCA). Principal Component Analysis (PCA) is a dimension reduction method that has a dimensional reduction capability by removing the unnecessary attribute of high dimensional data sets. Hence, PCA hybrid with K-means (PCA K-means) is proposed to give a better clustering result. The experimental process was performed by using Wisconsin Breast Cancer. By using LLS imputation method, the proposed hybrid PCA K-means outperformed the standard Kmeans clustering based on the results for breast cancer data set; in terms of clustering accuracy (0.29%) and computing time (95.76%). 2018 Thesis NonPeerReviewed application/pdf en http://eprints.utm.my/81435/1/RoslanArminaMFC2018.pdf Armina, Roslan (2018) Improved k-means clustering using principal component analysis and imputation methods for breast cancer dataset. Masters thesis, Universiti Teknologi Malaysia. http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:119420 |
spellingShingle | QA76 Computer software Armina, Roslan Improved k-means clustering using principal component analysis and imputation methods for breast cancer dataset |
title | Improved k-means clustering using principal component analysis and imputation methods for breast cancer dataset |
title_full | Improved k-means clustering using principal component analysis and imputation methods for breast cancer dataset |
title_fullStr | Improved k-means clustering using principal component analysis and imputation methods for breast cancer dataset |
title_full_unstemmed | Improved k-means clustering using principal component analysis and imputation methods for breast cancer dataset |
title_short | Improved k-means clustering using principal component analysis and imputation methods for breast cancer dataset |
title_sort | improved k means clustering using principal component analysis and imputation methods for breast cancer dataset |
topic | QA76 Computer software |
url | http://eprints.utm.my/81435/1/RoslanArminaMFC2018.pdf |
work_keys_str_mv | AT arminaroslan improvedkmeansclusteringusingprincipalcomponentanalysisandimputationmethodsforbreastcancerdataset |