A Two-Phase Approach for Semi-Supervised Feature Selection

This paper proposes a novel approach for selecting a subset of features in semi-supervised datasets where only some of the patterns are labeled. The whole process is completed in two phases. In the first phase, i.e., Phase-I, the whole dataset is divided into two parts: The first part, which contain...

Full description

Bibliographic Details
Main Authors: Amit Saxena, Shreya Pare, Mahendra Singh Meena, Deepak Gupta, Akshansh Gupta, Imran Razzak, Chin-Teng Lin, Mukesh Prasad
Format: Article
Language:English
Published: MDPI AG 2020-08-01
Series:Algorithms
Subjects:
Online Access:https://www.mdpi.com/1999-4893/13/9/215
_version_ 1827707175294730240
author Amit Saxena
Shreya Pare
Mahendra Singh Meena
Deepak Gupta
Akshansh Gupta
Imran Razzak
Chin-Teng Lin
Mukesh Prasad
author_facet Amit Saxena
Shreya Pare
Mahendra Singh Meena
Deepak Gupta
Akshansh Gupta
Imran Razzak
Chin-Teng Lin
Mukesh Prasad
author_sort Amit Saxena
collection DOAJ
description This paper proposes a novel approach for selecting a subset of features in semi-supervised datasets where only some of the patterns are labeled. The whole process is completed in two phases. In the first phase, i.e., Phase-I, the whole dataset is divided into two parts: The first part, which contains labeled patterns, and the second part, which contains unlabeled patterns. In the first part, a small number of features are identified using well-known maximum relevance (from first part) and minimum redundancy (whole dataset) based feature selection approaches using the correlation coefficient. The subset of features from the identified set of features, which produces a high classification accuracy using any supervised classifier from labeled patterns, is selected for later processing. In the second phase, i.e., Phase-II, the patterns belonging to the first and second part are clustered separately into the available number of classes of the dataset. In the clusters of the first part, take the majority of patterns belonging to a cluster as the class for that cluster, which is given already. Form the pairs of cluster centroids made in the first and second part. The centroid of the second part nearest to a centroid of the first part will be paired. As the class of the first centroid is known, the same class can be assigned to the centroid of the cluster of the second part, which is unknown. The actual class of the patterns if known for the second part of the dataset can be used to test the classification accuracy of patterns in the second part. The proposed two-phase approach performs well in terms of classification accuracy and number of features selected on the given benchmarked datasets.
first_indexed 2024-03-10T16:40:55Z
format Article
id doaj.art-8525cef5828c472086dafc61ba4f893e
institution Directory Open Access Journal
issn 1999-4893
language English
last_indexed 2024-03-10T16:40:55Z
publishDate 2020-08-01
publisher MDPI AG
record_format Article
series Algorithms
spelling doaj.art-8525cef5828c472086dafc61ba4f893e2023-11-20T12:04:29ZengMDPI AGAlgorithms1999-48932020-08-0113921510.3390/a13090215A Two-Phase Approach for Semi-Supervised Feature SelectionAmit Saxena0Shreya Pare1Mahendra Singh Meena2Deepak Gupta3Akshansh Gupta4Imran Razzak5Chin-Teng Lin6Mukesh Prasad7Department of Computer Science and Information Technology, Guru Ghasidas University, Bilaspur, Chhattisgarh 495009, IndiaSchool of Computer Science, FEIT, University of Technology Sydney, Sydney, NSW 2007, AustraliaSchool of Computer Science, FEIT, University of Technology Sydney, Sydney, NSW 2007, AustraliaDepartment of Computer Science & Engineering, National Institute of Technology Arunachal Pradesh, Yupia 791112, IndiaCentral Electronics Engineering Research Institute, Delhi 110028, IndiaSchool of Information Technology, Deakin University, Geeloing, VIC 3217, AustraliaSchool of Computer Science, FEIT, University of Technology Sydney, Sydney, NSW 2007, AustraliaSchool of Computer Science, FEIT, University of Technology Sydney, Sydney, NSW 2007, AustraliaThis paper proposes a novel approach for selecting a subset of features in semi-supervised datasets where only some of the patterns are labeled. The whole process is completed in two phases. In the first phase, i.e., Phase-I, the whole dataset is divided into two parts: The first part, which contains labeled patterns, and the second part, which contains unlabeled patterns. In the first part, a small number of features are identified using well-known maximum relevance (from first part) and minimum redundancy (whole dataset) based feature selection approaches using the correlation coefficient. The subset of features from the identified set of features, which produces a high classification accuracy using any supervised classifier from labeled patterns, is selected for later processing. In the second phase, i.e., Phase-II, the patterns belonging to the first and second part are clustered separately into the available number of classes of the dataset. In the clusters of the first part, take the majority of patterns belonging to a cluster as the class for that cluster, which is given already. Form the pairs of cluster centroids made in the first and second part. The centroid of the second part nearest to a centroid of the first part will be paired. As the class of the first centroid is known, the same class can be assigned to the centroid of the cluster of the second part, which is unknown. The actual class of the patterns if known for the second part of the dataset can be used to test the classification accuracy of patterns in the second part. The proposed two-phase approach performs well in terms of classification accuracy and number of features selected on the given benchmarked datasets.https://www.mdpi.com/1999-4893/13/9/215feature selectionsemi-supervised datasetsclassificationclusteringcorrelation
spellingShingle Amit Saxena
Shreya Pare
Mahendra Singh Meena
Deepak Gupta
Akshansh Gupta
Imran Razzak
Chin-Teng Lin
Mukesh Prasad
A Two-Phase Approach for Semi-Supervised Feature Selection
Algorithms
feature selection
semi-supervised datasets
classification
clustering
correlation
title A Two-Phase Approach for Semi-Supervised Feature Selection
title_full A Two-Phase Approach for Semi-Supervised Feature Selection
title_fullStr A Two-Phase Approach for Semi-Supervised Feature Selection
title_full_unstemmed A Two-Phase Approach for Semi-Supervised Feature Selection
title_short A Two-Phase Approach for Semi-Supervised Feature Selection
title_sort two phase approach for semi supervised feature selection
topic feature selection
semi-supervised datasets
classification
clustering
correlation
url https://www.mdpi.com/1999-4893/13/9/215
work_keys_str_mv AT amitsaxena atwophaseapproachforsemisupervisedfeatureselection
AT shreyapare atwophaseapproachforsemisupervisedfeatureselection
AT mahendrasinghmeena atwophaseapproachforsemisupervisedfeatureselection
AT deepakgupta atwophaseapproachforsemisupervisedfeatureselection
AT akshanshgupta atwophaseapproachforsemisupervisedfeatureselection
AT imranrazzak atwophaseapproachforsemisupervisedfeatureselection
AT chintenglin atwophaseapproachforsemisupervisedfeatureselection
AT mukeshprasad atwophaseapproachforsemisupervisedfeatureselection
AT amitsaxena twophaseapproachforsemisupervisedfeatureselection
AT shreyapare twophaseapproachforsemisupervisedfeatureselection
AT mahendrasinghmeena twophaseapproachforsemisupervisedfeatureselection
AT deepakgupta twophaseapproachforsemisupervisedfeatureselection
AT akshanshgupta twophaseapproachforsemisupervisedfeatureselection
AT imranrazzak twophaseapproachforsemisupervisedfeatureselection
AT chintenglin twophaseapproachforsemisupervisedfeatureselection
AT mukeshprasad twophaseapproachforsemisupervisedfeatureselection