Improved pattern extraction scheme for clustering multidimensional data
Multidimensional data refers to data that contains at least three attributes or dimensions. The availability of huge amount of multidimensional data that has been collected over the years has greatly challenged the ability to digest the data and to gain useful knowledge that would otherwise be lost....
Main Author: | |
---|---|
Format: | Thesis |
Language: | English |
Published: |
2013
|
Subjects: | |
Online Access: | http://eprints.utm.my/37952/1/AinaMusdholifahPFSKSM2013.pdf |
_version_ | 1796857692748251136 |
---|---|
author | Musdholifah, Aina |
author_facet | Musdholifah, Aina |
author_sort | Musdholifah, Aina |
collection | ePrints |
description | Multidimensional data refers to data that contains at least three attributes or dimensions. The availability of huge amount of multidimensional data that has been collected over the years has greatly challenged the ability to digest the data and to gain useful knowledge that would otherwise be lost. Clustering technique has enabled the manipulation of this knowledge to gain an interesting pattern analysis that could benefit the relevant parties. In this study, three crucial challenges in extracting the pattern of the multidimensional data are highlighted: the dimension of huge multidimensional data requires efficient exploration method for the pattern extraction, the need for better mechanisms to test and validate clustering results and the need for more informative visualization to interpret the “best” clusters. Densitybased clustering algorithms such as density-based spatial clustering application with noise (DBSCAN), density clustering (DENCLUE) and kernel fuzzy C-means (KFCM) that use probabilistic similarity function have been introduced by previous works to determine the number of clusters automatically. However, they have difficulties in dealing with clusters of different densities, shapes and size. In addition, they require many parameter inputs that are difficult to determine. Kernel-nearestneighbor (KNN)-density-based clustering including kernel-nearest-neighbor-based clustering (KNNClust) has been proposed to solve the problems of determining smoothing parameters for multidimensional data and to discover cluster with arbitrary shape and densities. However, KNNClust faces problem on clustering data with different size. Therefore, this research proposed a new pattern extraction scheme integrating triangular kernel function and local average density technique called TKC to improve KNN-density-based clustering algorithm. The improved scheme has been validated experimentally with two scenarios: using real multidimensional spatio-temporal data and using various classification datasets. Four different measurements were used to validate the clustering results; Dunn and Silhouette index to assess the quality, F-measure to evaluate the performance of approach in terms of accuracy, ANOVA test to analyze the cluster distribution, and processing time to measure the efficiency. The proposed scheme was benchmarked with other well-known clustering methods including KNNClust, Iterative Local Gaussian Clustering (ILGC), basic k-means, KFCM, DBSCAN and DENCLUE. The results on the classification dataset demonstrated that TKC produced clusters with higher accuracy and more efficient than other clustering methods. In addition, the analysis of the results showed that the proposed TKC scheme is capable of handling multidimensional data, validated by Silhouette and Dunn index which was close to one, indicating reliable results. |
first_indexed | 2024-03-05T19:01:36Z |
format | Thesis |
id | utm.eprints-37952 |
institution | Universiti Teknologi Malaysia - ePrints |
language | English |
last_indexed | 2024-03-05T19:01:36Z |
publishDate | 2013 |
record_format | dspace |
spelling | utm.eprints-379522018-04-12T05:38:45Z http://eprints.utm.my/37952/ Improved pattern extraction scheme for clustering multidimensional data Musdholifah, Aina QA Mathematics Multidimensional data refers to data that contains at least three attributes or dimensions. The availability of huge amount of multidimensional data that has been collected over the years has greatly challenged the ability to digest the data and to gain useful knowledge that would otherwise be lost. Clustering technique has enabled the manipulation of this knowledge to gain an interesting pattern analysis that could benefit the relevant parties. In this study, three crucial challenges in extracting the pattern of the multidimensional data are highlighted: the dimension of huge multidimensional data requires efficient exploration method for the pattern extraction, the need for better mechanisms to test and validate clustering results and the need for more informative visualization to interpret the “best” clusters. Densitybased clustering algorithms such as density-based spatial clustering application with noise (DBSCAN), density clustering (DENCLUE) and kernel fuzzy C-means (KFCM) that use probabilistic similarity function have been introduced by previous works to determine the number of clusters automatically. However, they have difficulties in dealing with clusters of different densities, shapes and size. In addition, they require many parameter inputs that are difficult to determine. Kernel-nearestneighbor (KNN)-density-based clustering including kernel-nearest-neighbor-based clustering (KNNClust) has been proposed to solve the problems of determining smoothing parameters for multidimensional data and to discover cluster with arbitrary shape and densities. However, KNNClust faces problem on clustering data with different size. Therefore, this research proposed a new pattern extraction scheme integrating triangular kernel function and local average density technique called TKC to improve KNN-density-based clustering algorithm. The improved scheme has been validated experimentally with two scenarios: using real multidimensional spatio-temporal data and using various classification datasets. Four different measurements were used to validate the clustering results; Dunn and Silhouette index to assess the quality, F-measure to evaluate the performance of approach in terms of accuracy, ANOVA test to analyze the cluster distribution, and processing time to measure the efficiency. The proposed scheme was benchmarked with other well-known clustering methods including KNNClust, Iterative Local Gaussian Clustering (ILGC), basic k-means, KFCM, DBSCAN and DENCLUE. The results on the classification dataset demonstrated that TKC produced clusters with higher accuracy and more efficient than other clustering methods. In addition, the analysis of the results showed that the proposed TKC scheme is capable of handling multidimensional data, validated by Silhouette and Dunn index which was close to one, indicating reliable results. 2013-07 Thesis NonPeerReviewed application/pdf en http://eprints.utm.my/37952/1/AinaMusdholifahPFSKSM2013.pdf Musdholifah, Aina (2013) Improved pattern extraction scheme for clustering multidimensional data. PhD thesis, Universiti Teknologi Malaysia, Faculty of Computing. |
spellingShingle | QA Mathematics Musdholifah, Aina Improved pattern extraction scheme for clustering multidimensional data |
title | Improved pattern extraction scheme for clustering multidimensional data |
title_full | Improved pattern extraction scheme for clustering multidimensional data |
title_fullStr | Improved pattern extraction scheme for clustering multidimensional data |
title_full_unstemmed | Improved pattern extraction scheme for clustering multidimensional data |
title_short | Improved pattern extraction scheme for clustering multidimensional data |
title_sort | improved pattern extraction scheme for clustering multidimensional data |
topic | QA Mathematics |
url | http://eprints.utm.my/37952/1/AinaMusdholifahPFSKSM2013.pdf |
work_keys_str_mv | AT musdholifahaina improvedpatternextractionschemeforclusteringmultidimensionaldata |