Information theoretic feature selection clustering

Clustering is part of data mining where data mining is a process in which it is used to analyze data from various angles to discover new patterns from large data sets, finding the co-relation in order to transform the information into reliable and tangible data. However, data mining is usually conce...

Full description

Bibliographic Details
Main Author:	Quan, Yu Teng.
Other Authors:	School of Computer Engineering
Format:	Final Year Project (FYP)
Language:	English
Published:	2012
Subjects:	DRNTU::Engineering::Computer science and engineering::Information systems::Database management
Online Access:	http://hdl.handle.net/10356/48493

_version_	1811682730202628096
author	Quan, Yu Teng.
author2	School of Computer Engineering
author_facet	School of Computer Engineering Quan, Yu Teng.
author_sort	Quan, Yu Teng.
collection	NTU
description	Clustering is part of data mining where data mining is a process in which it is used to analyze data from various angles to discover new patterns from large data sets, finding the co-relation in order to transform the information into reliable and tangible data. However, data mining is usually concerned with large and high-dimensional data and most of the current algorithms researchers have implemented are sensitive to scale or high-dimensionality or both. Type of features played an important role in data mining where some of the features are the crux for clustering while others may just obstruct the process. A way to conquer such problems is to select a subset of key features. To further improve on the accuracy of clustering, a non-parametric estimation of average class entropies can be used in search of a clustering algorithm that maximize the estimated mutual information between clusters and data points. Several methods have been found and implemented such as the k-Means algorithm which uses the properties of distance measures and reduces computing cost while lacking in accuracy. To counter the lack of accuracy while still maintaining efficiency, the Nonparametric Information Clustering (NIC) algorithm is used to divide set of objects into groups where data points will be processed and maneuvered using a specific distance towards a cluster center from each point. It is applicable in situations where input parameters are unknown as it is nonparametric. This is tested against k-Means with different sets of data and results have shown that NIC has better performance in terms of accuracy. To look further into the accuracy, an error rate function will be implemented to check the correctness of each cluster.
first_indexed	2024-10-01T04:01:29Z
format	Final Year Project (FYP)
id	ntu-10356/48493
institution	Nanyang Technological University
language	English
last_indexed	2024-10-01T04:01:29Z
publishDate	2012
record_format	dspace
spelling	ntu-10356/484932023-03-03T20:42:15Z Information theoretic feature selection clustering Quan, Yu Teng. School of Computer Engineering Manoranjan Dash DRNTU::Engineering::Computer science and engineering::Information systems::Database management Clustering is part of data mining where data mining is a process in which it is used to analyze data from various angles to discover new patterns from large data sets, finding the co-relation in order to transform the information into reliable and tangible data. However, data mining is usually concerned with large and high-dimensional data and most of the current algorithms researchers have implemented are sensitive to scale or high-dimensionality or both. Type of features played an important role in data mining where some of the features are the crux for clustering while others may just obstruct the process. A way to conquer such problems is to select a subset of key features. To further improve on the accuracy of clustering, a non-parametric estimation of average class entropies can be used in search of a clustering algorithm that maximize the estimated mutual information between clusters and data points. Several methods have been found and implemented such as the k-Means algorithm which uses the properties of distance measures and reduces computing cost while lacking in accuracy. To counter the lack of accuracy while still maintaining efficiency, the Nonparametric Information Clustering (NIC) algorithm is used to divide set of objects into groups where data points will be processed and maneuvered using a specific distance towards a cluster center from each point. It is applicable in situations where input parameters are unknown as it is nonparametric. This is tested against k-Means with different sets of data and results have shown that NIC has better performance in terms of accuracy. To look further into the accuracy, an error rate function will be implemented to check the correctness of each cluster. Bachelor of Engineering (Computer Science) 2012-04-25T00:55:49Z 2012-04-25T00:55:49Z 2012 2012 Final Year Project (FYP) http://hdl.handle.net/10356/48493 en Nanyang Technological University 28 p. application/pdf
spellingShingle	DRNTU::Engineering::Computer science and engineering::Information systems::Database management Quan, Yu Teng. Information theoretic feature selection clustering
title	Information theoretic feature selection clustering
title_full	Information theoretic feature selection clustering
title_fullStr	Information theoretic feature selection clustering
title_full_unstemmed	Information theoretic feature selection clustering
title_short	Information theoretic feature selection clustering
title_sort	information theoretic feature selection clustering
topic	DRNTU::Engineering::Computer science and engineering::Information systems::Database management
url	http://hdl.handle.net/10356/48493
work_keys_str_mv	AT quanyuteng informationtheoreticfeatureselectionclustering

Information theoretic feature selection clustering

Similar Items