Information theoretic feature selection clustering
Clustering is part of data mining where data mining is a process in which it is used to analyze data from various angles to discover new patterns from large data sets, finding the co-relation in order to transform the information into reliable and tangible data. However, data mining is usually conce...
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project (FYP) |
Language: | English |
Published: |
2012
|
Subjects: | |
Online Access: | http://hdl.handle.net/10356/48493 |
_version_ | 1811682730202628096 |
---|---|
author | Quan, Yu Teng. |
author2 | School of Computer Engineering |
author_facet | School of Computer Engineering Quan, Yu Teng. |
author_sort | Quan, Yu Teng. |
collection | NTU |
description | Clustering is part of data mining where data mining is a process in which it is used to analyze data from various angles to discover new patterns from large data sets, finding the co-relation in order to transform the information into reliable and tangible data. However, data mining is usually concerned with large and high-dimensional data and most of the current algorithms researchers have implemented are sensitive to scale or high-dimensionality or both.
Type of features played an important role in data mining where some of the features are the crux for clustering while others may just obstruct the process. A way to conquer such problems is to select a subset of key features. To further improve on the accuracy of clustering, a non-parametric estimation of average class entropies can be used in search of a clustering algorithm that maximize the estimated mutual information between clusters and data points.
Several methods have been found and implemented such as the k-Means algorithm which uses the properties of distance measures and reduces computing cost while lacking in accuracy. To counter the lack of accuracy while still maintaining efficiency, the Nonparametric Information Clustering (NIC) algorithm is used to divide set of objects into groups where data points will be processed and maneuvered using a specific distance towards a cluster center from each point. It is applicable in situations where input parameters are unknown as it is nonparametric.
This is tested against k-Means with different sets of data and results have shown that NIC has better performance in terms of accuracy. To look further into the accuracy, an error rate function will be implemented to check the correctness of each cluster. |
first_indexed | 2024-10-01T04:01:29Z |
format | Final Year Project (FYP) |
id | ntu-10356/48493 |
institution | Nanyang Technological University |
language | English |
last_indexed | 2024-10-01T04:01:29Z |
publishDate | 2012 |
record_format | dspace |
spelling | ntu-10356/484932023-03-03T20:42:15Z Information theoretic feature selection clustering Quan, Yu Teng. School of Computer Engineering Manoranjan Dash DRNTU::Engineering::Computer science and engineering::Information systems::Database management Clustering is part of data mining where data mining is a process in which it is used to analyze data from various angles to discover new patterns from large data sets, finding the co-relation in order to transform the information into reliable and tangible data. However, data mining is usually concerned with large and high-dimensional data and most of the current algorithms researchers have implemented are sensitive to scale or high-dimensionality or both. Type of features played an important role in data mining where some of the features are the crux for clustering while others may just obstruct the process. A way to conquer such problems is to select a subset of key features. To further improve on the accuracy of clustering, a non-parametric estimation of average class entropies can be used in search of a clustering algorithm that maximize the estimated mutual information between clusters and data points. Several methods have been found and implemented such as the k-Means algorithm which uses the properties of distance measures and reduces computing cost while lacking in accuracy. To counter the lack of accuracy while still maintaining efficiency, the Nonparametric Information Clustering (NIC) algorithm is used to divide set of objects into groups where data points will be processed and maneuvered using a specific distance towards a cluster center from each point. It is applicable in situations where input parameters are unknown as it is nonparametric. This is tested against k-Means with different sets of data and results have shown that NIC has better performance in terms of accuracy. To look further into the accuracy, an error rate function will be implemented to check the correctness of each cluster. Bachelor of Engineering (Computer Science) 2012-04-25T00:55:49Z 2012-04-25T00:55:49Z 2012 2012 Final Year Project (FYP) http://hdl.handle.net/10356/48493 en Nanyang Technological University 28 p. application/pdf |
spellingShingle | DRNTU::Engineering::Computer science and engineering::Information systems::Database management Quan, Yu Teng. Information theoretic feature selection clustering |
title | Information theoretic feature selection clustering |
title_full | Information theoretic feature selection clustering |
title_fullStr | Information theoretic feature selection clustering |
title_full_unstemmed | Information theoretic feature selection clustering |
title_short | Information theoretic feature selection clustering |
title_sort | information theoretic feature selection clustering |
topic | DRNTU::Engineering::Computer science and engineering::Information systems::Database management |
url | http://hdl.handle.net/10356/48493 |
work_keys_str_mv | AT quanyuteng informationtheoreticfeatureselectionclustering |