Feature selection by integrating document frequency with genetic algorithm for Amharic news document classification
Text classification is the process of categorizing documents based on their content into a predefined set of categories. Text classification algorithms typically represent documents as collections of words and it deals with a large number of features. The selection of appropriate features becomes im...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
PeerJ Inc.
2022-04-01
|
Series: | PeerJ Computer Science |
Subjects: | |
Online Access: | https://peerj.com/articles/cs-961.pdf |
_version_ | 1828261197460275200 |
---|---|
author | Demeke Endalie Getamesay Haile Wondmagegn Taye Abebe |
author_facet | Demeke Endalie Getamesay Haile Wondmagegn Taye Abebe |
author_sort | Demeke Endalie |
collection | DOAJ |
description | Text classification is the process of categorizing documents based on their content into a predefined set of categories. Text classification algorithms typically represent documents as collections of words and it deals with a large number of features. The selection of appropriate features becomes important when the initial feature set is quite large. In this paper, we present a hybrid of document frequency (DF) and genetic algorithm (GA)-based feature selection method for Amharic text classification. We evaluate this feature selection method on Amharic news documents obtained from the Ethiopian News Agency (ENA). The number of categories used in this study is 13. Our experimental results showed that the proposed feature selection method outperformed other feature selection methods utilized for Amharic news document classification. Combining the proposed feature selection method with Extra Tree Classifier (ETC) improves classification accuracy. It improves classification accuracy up to 1% higher than the hybrid of DF, information gain (IG), chi-square (CHI), and principal component analysis (PCA), 2.47% greater than GA and 3.86% greater than a hybrid of DF, IG, and CHI. |
first_indexed | 2024-04-13T03:35:32Z |
format | Article |
id | doaj.art-02efaf1af41c4dc9b97563d255700993 |
institution | Directory Open Access Journal |
issn | 2376-5992 |
language | English |
last_indexed | 2024-04-13T03:35:32Z |
publishDate | 2022-04-01 |
publisher | PeerJ Inc. |
record_format | Article |
series | PeerJ Computer Science |
spelling | doaj.art-02efaf1af41c4dc9b97563d2557009932022-12-22T03:04:20ZengPeerJ Inc.PeerJ Computer Science2376-59922022-04-018e96110.7717/peerj-cs.961Feature selection by integrating document frequency with genetic algorithm for Amharic news document classificationDemeke Endalie0Getamesay Haile1Wondmagegn Taye Abebe2Faculty of Computing and Informatics, Jimma Institute of Technology, Jimma, Oromia, EthiopiaFaculty of Computing and Informatics, Jimma Institute of Technology, Jimma, Oromia, EthiopiaFaculty of Civil and Environmental Engineering, Jimma Institute of Technology, Jimma, Oromia, EthiopiaText classification is the process of categorizing documents based on their content into a predefined set of categories. Text classification algorithms typically represent documents as collections of words and it deals with a large number of features. The selection of appropriate features becomes important when the initial feature set is quite large. In this paper, we present a hybrid of document frequency (DF) and genetic algorithm (GA)-based feature selection method for Amharic text classification. We evaluate this feature selection method on Amharic news documents obtained from the Ethiopian News Agency (ENA). The number of categories used in this study is 13. Our experimental results showed that the proposed feature selection method outperformed other feature selection methods utilized for Amharic news document classification. Combining the proposed feature selection method with Extra Tree Classifier (ETC) improves classification accuracy. It improves classification accuracy up to 1% higher than the hybrid of DF, information gain (IG), chi-square (CHI), and principal component analysis (PCA), 2.47% greater than GA and 3.86% greater than a hybrid of DF, IG, and CHI.https://peerj.com/articles/cs-961.pdfChi-squareDocument frequencyExtra tree classifierFeature selectionGenetic algorithmInformation gain |
spellingShingle | Demeke Endalie Getamesay Haile Wondmagegn Taye Abebe Feature selection by integrating document frequency with genetic algorithm for Amharic news document classification PeerJ Computer Science Chi-square Document frequency Extra tree classifier Feature selection Genetic algorithm Information gain |
title | Feature selection by integrating document frequency with genetic algorithm for Amharic news document classification |
title_full | Feature selection by integrating document frequency with genetic algorithm for Amharic news document classification |
title_fullStr | Feature selection by integrating document frequency with genetic algorithm for Amharic news document classification |
title_full_unstemmed | Feature selection by integrating document frequency with genetic algorithm for Amharic news document classification |
title_short | Feature selection by integrating document frequency with genetic algorithm for Amharic news document classification |
title_sort | feature selection by integrating document frequency with genetic algorithm for amharic news document classification |
topic | Chi-square Document frequency Extra tree classifier Feature selection Genetic algorithm Information gain |
url | https://peerj.com/articles/cs-961.pdf |
work_keys_str_mv | AT demekeendalie featureselectionbyintegratingdocumentfrequencywithgeneticalgorithmforamharicnewsdocumentclassification AT getamesayhaile featureselectionbyintegratingdocumentfrequencywithgeneticalgorithmforamharicnewsdocumentclassification AT wondmagegntayeabebe featureselectionbyintegratingdocumentfrequencywithgeneticalgorithmforamharicnewsdocumentclassification |