Modified Genetic Algorithm for Feature Selection and Hyper Parameter Optimization: Case of XGBoost in Spam Prediction

Recently, spam on online social networks has attracted attention in the research and business world. Twitter has become the preferred medium to spread spam content. Many research efforts attempted to encounter social networks spam. Twitter brought extra challenges represented by the feature space si...

Full description

Bibliographic Details
Main Authors: Nazeeh Ghatasheh, Ismail Altaharwa, Khaled Aldebei
Format: Article
Language:English
Published: IEEE 2022-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9851666/
_version_ 1811319841995358208
author Nazeeh Ghatasheh
Ismail Altaharwa
Khaled Aldebei
author_facet Nazeeh Ghatasheh
Ismail Altaharwa
Khaled Aldebei
author_sort Nazeeh Ghatasheh
collection DOAJ
description Recently, spam on online social networks has attracted attention in the research and business world. Twitter has become the preferred medium to spread spam content. Many research efforts attempted to encounter social networks spam. Twitter brought extra challenges represented by the feature space size, and imbalanced data distributions. Usually, the related research works focus on part of these main challenges or produce black-box models. In this paper, we propose a modified genetic algorithm for simultaneous dimensionality reduction and hyper parameter optimization over imbalanced datasets. The algorithm initialized an eXtreme Gradient Boosting classifier and reduced the features space of tweets dataset; to generate a spam prediction model. The model is validated using a 50 times repeated 10-fold stratified cross-validation, and analyzed using nonparametric statistical tests. The resulted prediction model attains on average 82.32&#x0025; and 92.67&#x0025; in terms of geometric mean and accuracy respectively, utilizing less than 10&#x0025; of the total feature space. The empirical results show that the modified genetic algorithm outperforms <inline-formula> <tex-math notation="LaTeX">$Chi^{2}$ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$PCA$ </tex-math></inline-formula> feature selection methods. In addition, eXtreme Gradient Boosting outperforms many machine learning algorithms, including BERT-based deep learning model, in spam prediction. Furthermore, the proposed approach is applied to SMS spam modeling and compared to related works.
first_indexed 2024-04-13T12:50:24Z
format Article
id doaj.art-9cb56f6c1a6f4cabbb12b5fae0e2bca2
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-04-13T12:50:24Z
publishDate 2022-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-9cb56f6c1a6f4cabbb12b5fae0e2bca22022-12-22T02:46:15ZengIEEEIEEE Access2169-35362022-01-0110843658438310.1109/ACCESS.2022.31969059851666Modified Genetic Algorithm for Feature Selection and Hyper Parameter Optimization: Case of XGBoost in Spam PredictionNazeeh Ghatasheh0https://orcid.org/0000-0002-8000-0910Ismail Altaharwa1https://orcid.org/0000-0001-8775-0581Khaled Aldebei2https://orcid.org/0000-0001-6385-1134Department of Information Technology, The University of Jordan, Aqaba, JordanDepartment of Computer Information Systems, The University of Jordan, Aqaba, JordanDepartment of Information Technology, The University of Jordan, Aqaba, JordanRecently, spam on online social networks has attracted attention in the research and business world. Twitter has become the preferred medium to spread spam content. Many research efforts attempted to encounter social networks spam. Twitter brought extra challenges represented by the feature space size, and imbalanced data distributions. Usually, the related research works focus on part of these main challenges or produce black-box models. In this paper, we propose a modified genetic algorithm for simultaneous dimensionality reduction and hyper parameter optimization over imbalanced datasets. The algorithm initialized an eXtreme Gradient Boosting classifier and reduced the features space of tweets dataset; to generate a spam prediction model. The model is validated using a 50 times repeated 10-fold stratified cross-validation, and analyzed using nonparametric statistical tests. The resulted prediction model attains on average 82.32&#x0025; and 92.67&#x0025; in terms of geometric mean and accuracy respectively, utilizing less than 10&#x0025; of the total feature space. The empirical results show that the modified genetic algorithm outperforms <inline-formula> <tex-math notation="LaTeX">$Chi^{2}$ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$PCA$ </tex-math></inline-formula> feature selection methods. In addition, eXtreme Gradient Boosting outperforms many machine learning algorithms, including BERT-based deep learning model, in spam prediction. Furthermore, the proposed approach is applied to SMS spam modeling and compared to related works.https://ieeexplore.ieee.org/document/9851666/Genetic algorithmbusiness analyticsextreme gradient boostingfeature selectionhyper parameter optimizationspam prediction
spellingShingle Nazeeh Ghatasheh
Ismail Altaharwa
Khaled Aldebei
Modified Genetic Algorithm for Feature Selection and Hyper Parameter Optimization: Case of XGBoost in Spam Prediction
IEEE Access
Genetic algorithm
business analytics
extreme gradient boosting
feature selection
hyper parameter optimization
spam prediction
title Modified Genetic Algorithm for Feature Selection and Hyper Parameter Optimization: Case of XGBoost in Spam Prediction
title_full Modified Genetic Algorithm for Feature Selection and Hyper Parameter Optimization: Case of XGBoost in Spam Prediction
title_fullStr Modified Genetic Algorithm for Feature Selection and Hyper Parameter Optimization: Case of XGBoost in Spam Prediction
title_full_unstemmed Modified Genetic Algorithm for Feature Selection and Hyper Parameter Optimization: Case of XGBoost in Spam Prediction
title_short Modified Genetic Algorithm for Feature Selection and Hyper Parameter Optimization: Case of XGBoost in Spam Prediction
title_sort modified genetic algorithm for feature selection and hyper parameter optimization case of xgboost in spam prediction
topic Genetic algorithm
business analytics
extreme gradient boosting
feature selection
hyper parameter optimization
spam prediction
url https://ieeexplore.ieee.org/document/9851666/
work_keys_str_mv AT nazeehghatasheh modifiedgeneticalgorithmforfeatureselectionandhyperparameteroptimizationcaseofxgboostinspamprediction
AT ismailaltaharwa modifiedgeneticalgorithmforfeatureselectionandhyperparameteroptimizationcaseofxgboostinspamprediction
AT khaledaldebei modifiedgeneticalgorithmforfeatureselectionandhyperparameteroptimizationcaseofxgboostinspamprediction