Information theoretic feature selection clustering

Clustering is part of data mining where data mining is a process in which it is used to analyze data from various angles to discover new patterns from large data sets, finding the co-relation in order to transform the information into reliable and tangible data. However, data mining is usually conce...

Full description

Bibliographic Details
Main Author: Quan, Yu Teng.
Other Authors: School of Computer Engineering
Format: Final Year Project (FYP)
Language:English
Published: 2012
Subjects:
Online Access:http://hdl.handle.net/10356/48493
_version_ 1811682730202628096
author Quan, Yu Teng.
author2 School of Computer Engineering
author_facet School of Computer Engineering
Quan, Yu Teng.
author_sort Quan, Yu Teng.
collection NTU
description Clustering is part of data mining where data mining is a process in which it is used to analyze data from various angles to discover new patterns from large data sets, finding the co-relation in order to transform the information into reliable and tangible data. However, data mining is usually concerned with large and high-dimensional data and most of the current algorithms researchers have implemented are sensitive to scale or high-dimensionality or both. Type of features played an important role in data mining where some of the features are the crux for clustering while others may just obstruct the process. A way to conquer such problems is to select a subset of key features. To further improve on the accuracy of clustering, a non-parametric estimation of average class entropies can be used in search of a clustering algorithm that maximize the estimated mutual information between clusters and data points. Several methods have been found and implemented such as the k-Means algorithm which uses the properties of distance measures and reduces computing cost while lacking in accuracy. To counter the lack of accuracy while still maintaining efficiency, the Nonparametric Information Clustering (NIC) algorithm is used to divide set of objects into groups where data points will be processed and maneuvered using a specific distance towards a cluster center from each point. It is applicable in situations where input parameters are unknown as it is nonparametric. This is tested against k-Means with different sets of data and results have shown that NIC has better performance in terms of accuracy. To look further into the accuracy, an error rate function will be implemented to check the correctness of each cluster.
first_indexed 2024-10-01T04:01:29Z
format Final Year Project (FYP)
id ntu-10356/48493
institution Nanyang Technological University
language English
last_indexed 2024-10-01T04:01:29Z
publishDate 2012
record_format dspace
spelling ntu-10356/484932023-03-03T20:42:15Z Information theoretic feature selection clustering Quan, Yu Teng. School of Computer Engineering Manoranjan Dash DRNTU::Engineering::Computer science and engineering::Information systems::Database management Clustering is part of data mining where data mining is a process in which it is used to analyze data from various angles to discover new patterns from large data sets, finding the co-relation in order to transform the information into reliable and tangible data. However, data mining is usually concerned with large and high-dimensional data and most of the current algorithms researchers have implemented are sensitive to scale or high-dimensionality or both. Type of features played an important role in data mining where some of the features are the crux for clustering while others may just obstruct the process. A way to conquer such problems is to select a subset of key features. To further improve on the accuracy of clustering, a non-parametric estimation of average class entropies can be used in search of a clustering algorithm that maximize the estimated mutual information between clusters and data points. Several methods have been found and implemented such as the k-Means algorithm which uses the properties of distance measures and reduces computing cost while lacking in accuracy. To counter the lack of accuracy while still maintaining efficiency, the Nonparametric Information Clustering (NIC) algorithm is used to divide set of objects into groups where data points will be processed and maneuvered using a specific distance towards a cluster center from each point. It is applicable in situations where input parameters are unknown as it is nonparametric. This is tested against k-Means with different sets of data and results have shown that NIC has better performance in terms of accuracy. To look further into the accuracy, an error rate function will be implemented to check the correctness of each cluster. Bachelor of Engineering (Computer Science) 2012-04-25T00:55:49Z 2012-04-25T00:55:49Z 2012 2012 Final Year Project (FYP) http://hdl.handle.net/10356/48493 en Nanyang Technological University 28 p. application/pdf
spellingShingle DRNTU::Engineering::Computer science and engineering::Information systems::Database management
Quan, Yu Teng.
Information theoretic feature selection clustering
title Information theoretic feature selection clustering
title_full Information theoretic feature selection clustering
title_fullStr Information theoretic feature selection clustering
title_full_unstemmed Information theoretic feature selection clustering
title_short Information theoretic feature selection clustering
title_sort information theoretic feature selection clustering
topic DRNTU::Engineering::Computer science and engineering::Information systems::Database management
url http://hdl.handle.net/10356/48493
work_keys_str_mv AT quanyuteng informationtheoreticfeatureselectionclustering