Feature selection by integrating document frequency with genetic algorithm for Amharic news document classification

Text classification is the process of categorizing documents based on their content into a predefined set of categories. Text classification algorithms typically represent documents as collections of words and it deals with a large number of features. The selection of appropriate features becomes im...

Full description

Bibliographic Details
Main Authors: Demeke Endalie, Getamesay Haile, Wondmagegn Taye Abebe
Format: Article
Language:English
Published: PeerJ Inc. 2022-04-01
Series:PeerJ Computer Science
Subjects:
Online Access:https://peerj.com/articles/cs-961.pdf
_version_ 1828261197460275200
author Demeke Endalie
Getamesay Haile
Wondmagegn Taye Abebe
author_facet Demeke Endalie
Getamesay Haile
Wondmagegn Taye Abebe
author_sort Demeke Endalie
collection DOAJ
description Text classification is the process of categorizing documents based on their content into a predefined set of categories. Text classification algorithms typically represent documents as collections of words and it deals with a large number of features. The selection of appropriate features becomes important when the initial feature set is quite large. In this paper, we present a hybrid of document frequency (DF) and genetic algorithm (GA)-based feature selection method for Amharic text classification. We evaluate this feature selection method on Amharic news documents obtained from the Ethiopian News Agency (ENA). The number of categories used in this study is 13. Our experimental results showed that the proposed feature selection method outperformed other feature selection methods utilized for Amharic news document classification. Combining the proposed feature selection method with Extra Tree Classifier (ETC) improves classification accuracy. It improves classification accuracy up to 1% higher than the hybrid of DF, information gain (IG), chi-square (CHI), and principal component analysis (PCA), 2.47% greater than GA and 3.86% greater than a hybrid of DF, IG, and CHI.
first_indexed 2024-04-13T03:35:32Z
format Article
id doaj.art-02efaf1af41c4dc9b97563d255700993
institution Directory Open Access Journal
issn 2376-5992
language English
last_indexed 2024-04-13T03:35:32Z
publishDate 2022-04-01
publisher PeerJ Inc.
record_format Article
series PeerJ Computer Science
spelling doaj.art-02efaf1af41c4dc9b97563d2557009932022-12-22T03:04:20ZengPeerJ Inc.PeerJ Computer Science2376-59922022-04-018e96110.7717/peerj-cs.961Feature selection by integrating document frequency with genetic algorithm for Amharic news document classificationDemeke Endalie0Getamesay Haile1Wondmagegn Taye Abebe2Faculty of Computing and Informatics, Jimma Institute of Technology, Jimma, Oromia, EthiopiaFaculty of Computing and Informatics, Jimma Institute of Technology, Jimma, Oromia, EthiopiaFaculty of Civil and Environmental Engineering, Jimma Institute of Technology, Jimma, Oromia, EthiopiaText classification is the process of categorizing documents based on their content into a predefined set of categories. Text classification algorithms typically represent documents as collections of words and it deals with a large number of features. The selection of appropriate features becomes important when the initial feature set is quite large. In this paper, we present a hybrid of document frequency (DF) and genetic algorithm (GA)-based feature selection method for Amharic text classification. We evaluate this feature selection method on Amharic news documents obtained from the Ethiopian News Agency (ENA). The number of categories used in this study is 13. Our experimental results showed that the proposed feature selection method outperformed other feature selection methods utilized for Amharic news document classification. Combining the proposed feature selection method with Extra Tree Classifier (ETC) improves classification accuracy. It improves classification accuracy up to 1% higher than the hybrid of DF, information gain (IG), chi-square (CHI), and principal component analysis (PCA), 2.47% greater than GA and 3.86% greater than a hybrid of DF, IG, and CHI.https://peerj.com/articles/cs-961.pdfChi-squareDocument frequencyExtra tree classifierFeature selectionGenetic algorithmInformation gain
spellingShingle Demeke Endalie
Getamesay Haile
Wondmagegn Taye Abebe
Feature selection by integrating document frequency with genetic algorithm for Amharic news document classification
PeerJ Computer Science
Chi-square
Document frequency
Extra tree classifier
Feature selection
Genetic algorithm
Information gain
title Feature selection by integrating document frequency with genetic algorithm for Amharic news document classification
title_full Feature selection by integrating document frequency with genetic algorithm for Amharic news document classification
title_fullStr Feature selection by integrating document frequency with genetic algorithm for Amharic news document classification
title_full_unstemmed Feature selection by integrating document frequency with genetic algorithm for Amharic news document classification
title_short Feature selection by integrating document frequency with genetic algorithm for Amharic news document classification
title_sort feature selection by integrating document frequency with genetic algorithm for amharic news document classification
topic Chi-square
Document frequency
Extra tree classifier
Feature selection
Genetic algorithm
Information gain
url https://peerj.com/articles/cs-961.pdf
work_keys_str_mv AT demekeendalie featureselectionbyintegratingdocumentfrequencywithgeneticalgorithmforamharicnewsdocumentclassification
AT getamesayhaile featureselectionbyintegratingdocumentfrequencywithgeneticalgorithmforamharicnewsdocumentclassification
AT wondmagegntayeabebe featureselectionbyintegratingdocumentfrequencywithgeneticalgorithmforamharicnewsdocumentclassification