Two new feature selection metrics for text classification

Obtaining meaningful information from data has become the main problem. Hence data mining techniques have gained importance. Text classification is one of the most commonly studied areas of data mining. The main problem about text classification is the increase in the required time and a decrease in...

Full description

Bibliographic Details
Main Authors: Durmuş Özkan Şahin, Erdal Kılıç
Format: Article
Language:English
Published: Taylor & Francis Group 2019-04-01
Series:Automatika
Subjects:
Online Access:http://dx.doi.org/10.1080/00051144.2019.1602293
_version_ 1818929563845001216
author Durmuş Özkan Şahin
Erdal Kılıç
author_facet Durmuş Özkan Şahin
Erdal Kılıç
author_sort Durmuş Özkan Şahin
collection DOAJ
description Obtaining meaningful information from data has become the main problem. Hence data mining techniques have gained importance. Text classification is one of the most commonly studied areas of data mining. The main problem about text classification is the increase in the required time and a decrease in the success of classification because of data size. To determine the right feature selection methods for text classification is the main purpose of this study. Metrics that are used frequently for feature selection like Chi-square and Information Gain were applied over different data sets and performance was measured. In this study two feature selection metrics, which are based on filtration, are recommended as alternatives to the current ones. The first recommended metric is Relevance Frequency Feature Selection metric that was obtained by adding new parameters to Relevance Frequency method that is used for term weighting in text classification. The second one is the alternative Accuracy2 metric, which was obtained by changing the parameters of Accuracy2 metric. It was observed that the suggested Relevance Frequency Feature Selection and Alternative Accuracy2 metrics offer successful results as the current metrics used frequently.
first_indexed 2024-12-20T03:46:48Z
format Article
id doaj.art-b78179acaab345b584bef551ca0bdbd7
institution Directory Open Access Journal
issn 0005-1144
1848-3380
language English
last_indexed 2024-12-20T03:46:48Z
publishDate 2019-04-01
publisher Taylor & Francis Group
record_format Article
series Automatika
spelling doaj.art-b78179acaab345b584bef551ca0bdbd72022-12-21T19:54:35ZengTaylor & Francis GroupAutomatika0005-11441848-33802019-04-0160216217110.1080/00051144.2019.16022931602293Two new feature selection metrics for text classificationDurmuş Özkan Şahin0Erdal Kılıç1Ondokuz Mayıs UniversityOndokuz Mayıs UniversityObtaining meaningful information from data has become the main problem. Hence data mining techniques have gained importance. Text classification is one of the most commonly studied areas of data mining. The main problem about text classification is the increase in the required time and a decrease in the success of classification because of data size. To determine the right feature selection methods for text classification is the main purpose of this study. Metrics that are used frequently for feature selection like Chi-square and Information Gain were applied over different data sets and performance was measured. In this study two feature selection metrics, which are based on filtration, are recommended as alternatives to the current ones. The first recommended metric is Relevance Frequency Feature Selection metric that was obtained by adding new parameters to Relevance Frequency method that is used for term weighting in text classification. The second one is the alternative Accuracy2 metric, which was obtained by changing the parameters of Accuracy2 metric. It was observed that the suggested Relevance Frequency Feature Selection and Alternative Accuracy2 metrics offer successful results as the current metrics used frequently.http://dx.doi.org/10.1080/00051144.2019.1602293Text classificationtext miningfeature selectionterm selection
spellingShingle Durmuş Özkan Şahin
Erdal Kılıç
Two new feature selection metrics for text classification
Automatika
Text classification
text mining
feature selection
term selection
title Two new feature selection metrics for text classification
title_full Two new feature selection metrics for text classification
title_fullStr Two new feature selection metrics for text classification
title_full_unstemmed Two new feature selection metrics for text classification
title_short Two new feature selection metrics for text classification
title_sort two new feature selection metrics for text classification
topic Text classification
text mining
feature selection
term selection
url http://dx.doi.org/10.1080/00051144.2019.1602293
work_keys_str_mv AT durmusozkansahin twonewfeatureselectionmetricsfortextclassification
AT erdalkılıc twonewfeatureselectionmetricsfortextclassification