New rough set based maximum partitioning attribute algorithm for categorical data clustering

Clustering a set of data into homogeneous groups is a fundamental operation in data mining. Recently, consideration has been put on categorical data clustering, where the data set consists of non-numerical attributes. However, implementing several existing categorical clustering algorithms is challe...

Full description

Bibliographic Details
Main Author: Jomah Baroud, Muftah Mohamed
Format: Thesis
Language:English
Published: 2022
Subjects:
Online Access:http://eprints.utm.my/101497/1/MuftahMohamedJomahBaroudPSC2022.pdf.pdf
_version_ 1796867078493306880
author Jomah Baroud, Muftah Mohamed
author_facet Jomah Baroud, Muftah Mohamed
author_sort Jomah Baroud, Muftah Mohamed
collection ePrints
description Clustering a set of data into homogeneous groups is a fundamental operation in data mining. Recently, consideration has been put on categorical data clustering, where the data set consists of non-numerical attributes. However, implementing several existing categorical clustering algorithms is challenging as some cannot handle uncertainty while others have stability issues. The Rough Set theory (RST) is a mathematical tool for dealing with categorical data and handling uncertainty. It is also used to identify cause-effect relationships in databases as a form of learning and data mining. Therefore, this study aims to address the issues of uncertainty and stability for categorical clustering, and it proposes an improved algorithm centred on RST. The proposed method employed the partitioning measure to calculate the information system's positive and boundary regions of attributes. Firstly, an attributes partitioning method called Positive Region-based Indiscernibility (PRI) was developed to address the uncertainty issue in attribute partitioning for categorical data. The PRI method requires the positive and boundary regions-based partitioning calculation method. Next, to address the computational complexity issue in the clustering process, a clustering attribute selection method called Maximum Mean Partitioning (MMP) is introduced by computing the mean. The MMP method selects the maximum degree of the mean attribute, and the attribute with the maximum mean partitioning value is chosen as the best clustering attribute. The integration of proposed PRI and MMP methods generated a new rough set hybrid clustering algorithm for categorical data clustering algorithm named Maximum Partitioning Attribute (MPA) algorithm. This hybrid algorithm is an all-inclusive solution for uncertainty, computational complexity, cluster purity, and higher accuracy in attribute partitioning and selecting a clustering attribute. The proposed MPA algorithm is compared against the baseline algorithms, namely Maximum Significance Attribute (MSA), Information-Theoretic Dependency Roughness (ITDR), Maximum Indiscernibility Attribute (MIA), and simple classical K-Mean. In addition, seven small data sets from previously utilized research cases and 21 UCI repository and benchmark datasets are used for validation. Finally, the results were presented in tabular and graphical form, showing the proposed MPA algorithm outperforms the baseline algorithms for all data sets. Furthermore, the results showed that the proposed MPA algorithm improves the rough accuracy against MSA, ITDR, and MIA by 54.42%. Hence, the MPA algorithm has reduced the computational complexity compared to MSA, ITDR, and MIA with 77.11% less time and 58.66% minimum iterations. Similarly, a significant percentage improvement, up to 97.35%, was observed for overall purity by the MPA algorithm against MSA, ITDR, and MIA. In addition, the increment up to 34.41% of the overall accuracy of simple K-means by MPA has been obtained. Hence, it is proven that the proposed MPA has given promising solutions to address the categorical data clustering problem.
first_indexed 2024-03-05T21:21:48Z
format Thesis
id utm.eprints-101497
institution Universiti Teknologi Malaysia - ePrints
language English
last_indexed 2024-03-05T21:21:48Z
publishDate 2022
record_format dspace
spelling utm.eprints-1014972023-06-21T10:21:57Z http://eprints.utm.my/101497/ New rough set based maximum partitioning attribute algorithm for categorical data clustering Jomah Baroud, Muftah Mohamed QA75 Electronic computers. Computer science Clustering a set of data into homogeneous groups is a fundamental operation in data mining. Recently, consideration has been put on categorical data clustering, where the data set consists of non-numerical attributes. However, implementing several existing categorical clustering algorithms is challenging as some cannot handle uncertainty while others have stability issues. The Rough Set theory (RST) is a mathematical tool for dealing with categorical data and handling uncertainty. It is also used to identify cause-effect relationships in databases as a form of learning and data mining. Therefore, this study aims to address the issues of uncertainty and stability for categorical clustering, and it proposes an improved algorithm centred on RST. The proposed method employed the partitioning measure to calculate the information system's positive and boundary regions of attributes. Firstly, an attributes partitioning method called Positive Region-based Indiscernibility (PRI) was developed to address the uncertainty issue in attribute partitioning for categorical data. The PRI method requires the positive and boundary regions-based partitioning calculation method. Next, to address the computational complexity issue in the clustering process, a clustering attribute selection method called Maximum Mean Partitioning (MMP) is introduced by computing the mean. The MMP method selects the maximum degree of the mean attribute, and the attribute with the maximum mean partitioning value is chosen as the best clustering attribute. The integration of proposed PRI and MMP methods generated a new rough set hybrid clustering algorithm for categorical data clustering algorithm named Maximum Partitioning Attribute (MPA) algorithm. This hybrid algorithm is an all-inclusive solution for uncertainty, computational complexity, cluster purity, and higher accuracy in attribute partitioning and selecting a clustering attribute. The proposed MPA algorithm is compared against the baseline algorithms, namely Maximum Significance Attribute (MSA), Information-Theoretic Dependency Roughness (ITDR), Maximum Indiscernibility Attribute (MIA), and simple classical K-Mean. In addition, seven small data sets from previously utilized research cases and 21 UCI repository and benchmark datasets are used for validation. Finally, the results were presented in tabular and graphical form, showing the proposed MPA algorithm outperforms the baseline algorithms for all data sets. Furthermore, the results showed that the proposed MPA algorithm improves the rough accuracy against MSA, ITDR, and MIA by 54.42%. Hence, the MPA algorithm has reduced the computational complexity compared to MSA, ITDR, and MIA with 77.11% less time and 58.66% minimum iterations. Similarly, a significant percentage improvement, up to 97.35%, was observed for overall purity by the MPA algorithm against MSA, ITDR, and MIA. In addition, the increment up to 34.41% of the overall accuracy of simple K-means by MPA has been obtained. Hence, it is proven that the proposed MPA has given promising solutions to address the categorical data clustering problem. 2022 Thesis NonPeerReviewed application/pdf en http://eprints.utm.my/101497/1/MuftahMohamedJomahBaroudPSC2022.pdf.pdf Jomah Baroud, Muftah Mohamed (2022) New rough set based maximum partitioning attribute algorithm for categorical data clustering. PhD thesis, Universiti Teknologi Malaysia. http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:150786
spellingShingle QA75 Electronic computers. Computer science
Jomah Baroud, Muftah Mohamed
New rough set based maximum partitioning attribute algorithm for categorical data clustering
title New rough set based maximum partitioning attribute algorithm for categorical data clustering
title_full New rough set based maximum partitioning attribute algorithm for categorical data clustering
title_fullStr New rough set based maximum partitioning attribute algorithm for categorical data clustering
title_full_unstemmed New rough set based maximum partitioning attribute algorithm for categorical data clustering
title_short New rough set based maximum partitioning attribute algorithm for categorical data clustering
title_sort new rough set based maximum partitioning attribute algorithm for categorical data clustering
topic QA75 Electronic computers. Computer science
url http://eprints.utm.my/101497/1/MuftahMohamedJomahBaroudPSC2022.pdf.pdf
work_keys_str_mv AT jomahbaroudmuftahmohamed newroughsetbasedmaximumpartitioningattributealgorithmforcategoricaldataclustering