Information theoretic feature selection clustering

Clustering is part of data mining where data mining is a process in which it is used to analyze data from various angles to discover new patterns from large data sets, finding the co-relation in order to transform the information into reliable and tangible data. However, data mining is usually conce...

Full description

Bibliographic Details
Main Author: Quan, Yu Teng.
Other Authors: School of Computer Engineering
Format: Final Year Project (FYP)
Language:English
Published: 2012
Subjects:
Online Access:http://hdl.handle.net/10356/48493
Description
Summary:Clustering is part of data mining where data mining is a process in which it is used to analyze data from various angles to discover new patterns from large data sets, finding the co-relation in order to transform the information into reliable and tangible data. However, data mining is usually concerned with large and high-dimensional data and most of the current algorithms researchers have implemented are sensitive to scale or high-dimensionality or both. Type of features played an important role in data mining where some of the features are the crux for clustering while others may just obstruct the process. A way to conquer such problems is to select a subset of key features. To further improve on the accuracy of clustering, a non-parametric estimation of average class entropies can be used in search of a clustering algorithm that maximize the estimated mutual information between clusters and data points. Several methods have been found and implemented such as the k-Means algorithm which uses the properties of distance measures and reduces computing cost while lacking in accuracy. To counter the lack of accuracy while still maintaining efficiency, the Nonparametric Information Clustering (NIC) algorithm is used to divide set of objects into groups where data points will be processed and maneuvered using a specific distance towards a cluster center from each point. It is applicable in situations where input parameters are unknown as it is nonparametric. This is tested against k-Means with different sets of data and results have shown that NIC has better performance in terms of accuracy. To look further into the accuracy, an error rate function will be implemented to check the correctness of each cluster.