Improved pattern extraction scheme for clustering multidimensional data

Multidimensional data refers to data that contains at least three attributes or dimensions. The availability of huge amount of multidimensional data that has been collected over the years has greatly challenged the ability to digest the data and to gain useful knowledge that would otherwise be lost....

Full description

Bibliographic Details
Main Author: Musdholifah, Aina
Format: Thesis
Language:English
Published: 2013
Subjects:
Online Access:http://eprints.utm.my/37952/1/AinaMusdholifahPFSKSM2013.pdf
_version_ 1796857692748251136
author Musdholifah, Aina
author_facet Musdholifah, Aina
author_sort Musdholifah, Aina
collection ePrints
description Multidimensional data refers to data that contains at least three attributes or dimensions. The availability of huge amount of multidimensional data that has been collected over the years has greatly challenged the ability to digest the data and to gain useful knowledge that would otherwise be lost. Clustering technique has enabled the manipulation of this knowledge to gain an interesting pattern analysis that could benefit the relevant parties. In this study, three crucial challenges in extracting the pattern of the multidimensional data are highlighted: the dimension of huge multidimensional data requires efficient exploration method for the pattern extraction, the need for better mechanisms to test and validate clustering results and the need for more informative visualization to interpret the “best” clusters. Densitybased clustering algorithms such as density-based spatial clustering application with noise (DBSCAN), density clustering (DENCLUE) and kernel fuzzy C-means (KFCM) that use probabilistic similarity function have been introduced by previous works to determine the number of clusters automatically. However, they have difficulties in dealing with clusters of different densities, shapes and size. In addition, they require many parameter inputs that are difficult to determine. Kernel-nearestneighbor (KNN)-density-based clustering including kernel-nearest-neighbor-based clustering (KNNClust) has been proposed to solve the problems of determining smoothing parameters for multidimensional data and to discover cluster with arbitrary shape and densities. However, KNNClust faces problem on clustering data with different size. Therefore, this research proposed a new pattern extraction scheme integrating triangular kernel function and local average density technique called TKC to improve KNN-density-based clustering algorithm. The improved scheme has been validated experimentally with two scenarios: using real multidimensional spatio-temporal data and using various classification datasets. Four different measurements were used to validate the clustering results; Dunn and Silhouette index to assess the quality, F-measure to evaluate the performance of approach in terms of accuracy, ANOVA test to analyze the cluster distribution, and processing time to measure the efficiency. The proposed scheme was benchmarked with other well-known clustering methods including KNNClust, Iterative Local Gaussian Clustering (ILGC), basic k-means, KFCM, DBSCAN and DENCLUE. The results on the classification dataset demonstrated that TKC produced clusters with higher accuracy and more efficient than other clustering methods. In addition, the analysis of the results showed that the proposed TKC scheme is capable of handling multidimensional data, validated by Silhouette and Dunn index which was close to one, indicating reliable results.
first_indexed 2024-03-05T19:01:36Z
format Thesis
id utm.eprints-37952
institution Universiti Teknologi Malaysia - ePrints
language English
last_indexed 2024-03-05T19:01:36Z
publishDate 2013
record_format dspace
spelling utm.eprints-379522018-04-12T05:38:45Z http://eprints.utm.my/37952/ Improved pattern extraction scheme for clustering multidimensional data Musdholifah, Aina QA Mathematics Multidimensional data refers to data that contains at least three attributes or dimensions. The availability of huge amount of multidimensional data that has been collected over the years has greatly challenged the ability to digest the data and to gain useful knowledge that would otherwise be lost. Clustering technique has enabled the manipulation of this knowledge to gain an interesting pattern analysis that could benefit the relevant parties. In this study, three crucial challenges in extracting the pattern of the multidimensional data are highlighted: the dimension of huge multidimensional data requires efficient exploration method for the pattern extraction, the need for better mechanisms to test and validate clustering results and the need for more informative visualization to interpret the “best” clusters. Densitybased clustering algorithms such as density-based spatial clustering application with noise (DBSCAN), density clustering (DENCLUE) and kernel fuzzy C-means (KFCM) that use probabilistic similarity function have been introduced by previous works to determine the number of clusters automatically. However, they have difficulties in dealing with clusters of different densities, shapes and size. In addition, they require many parameter inputs that are difficult to determine. Kernel-nearestneighbor (KNN)-density-based clustering including kernel-nearest-neighbor-based clustering (KNNClust) has been proposed to solve the problems of determining smoothing parameters for multidimensional data and to discover cluster with arbitrary shape and densities. However, KNNClust faces problem on clustering data with different size. Therefore, this research proposed a new pattern extraction scheme integrating triangular kernel function and local average density technique called TKC to improve KNN-density-based clustering algorithm. The improved scheme has been validated experimentally with two scenarios: using real multidimensional spatio-temporal data and using various classification datasets. Four different measurements were used to validate the clustering results; Dunn and Silhouette index to assess the quality, F-measure to evaluate the performance of approach in terms of accuracy, ANOVA test to analyze the cluster distribution, and processing time to measure the efficiency. The proposed scheme was benchmarked with other well-known clustering methods including KNNClust, Iterative Local Gaussian Clustering (ILGC), basic k-means, KFCM, DBSCAN and DENCLUE. The results on the classification dataset demonstrated that TKC produced clusters with higher accuracy and more efficient than other clustering methods. In addition, the analysis of the results showed that the proposed TKC scheme is capable of handling multidimensional data, validated by Silhouette and Dunn index which was close to one, indicating reliable results. 2013-07 Thesis NonPeerReviewed application/pdf en http://eprints.utm.my/37952/1/AinaMusdholifahPFSKSM2013.pdf Musdholifah, Aina (2013) Improved pattern extraction scheme for clustering multidimensional data. PhD thesis, Universiti Teknologi Malaysia, Faculty of Computing.
spellingShingle QA Mathematics
Musdholifah, Aina
Improved pattern extraction scheme for clustering multidimensional data
title Improved pattern extraction scheme for clustering multidimensional data
title_full Improved pattern extraction scheme for clustering multidimensional data
title_fullStr Improved pattern extraction scheme for clustering multidimensional data
title_full_unstemmed Improved pattern extraction scheme for clustering multidimensional data
title_short Improved pattern extraction scheme for clustering multidimensional data
title_sort improved pattern extraction scheme for clustering multidimensional data
topic QA Mathematics
url http://eprints.utm.my/37952/1/AinaMusdholifahPFSKSM2013.pdf
work_keys_str_mv AT musdholifahaina improvedpatternextractionschemeforclusteringmultidimensionaldata