A Comparative Analysis of Machine Learning Methods for Class Imbalance in a Smoking Cessation Intervention

Smoking is one of the major public health issues, which has a significant impact on premature death. In recent years, numerous decision support systems have been developed to deal with smoking cessation based on machine learning methods. However, the inevitable class imbalance is considered a major...

Full description

Bibliographic Details
Main Authors: Khishigsuren Davagdorj, Jong Seol Lee, Van Huy Pham, Keun Ho Ryu
Format: Article
Language:English
Published: MDPI AG 2020-05-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/10/9/3307
_version_ 1797568474073006080
author Khishigsuren Davagdorj
Jong Seol Lee
Van Huy Pham
Keun Ho Ryu
author_facet Khishigsuren Davagdorj
Jong Seol Lee
Van Huy Pham
Keun Ho Ryu
author_sort Khishigsuren Davagdorj
collection DOAJ
description Smoking is one of the major public health issues, which has a significant impact on premature death. In recent years, numerous decision support systems have been developed to deal with smoking cessation based on machine learning methods. However, the inevitable class imbalance is considered a major challenge in deploying such systems. In this paper, we study an empirical comparison of machine learning techniques to deal with the class imbalance problem in the prediction of smoking cessation intervention among the Korean population. For the class imbalance problem, the objective of this paper is to improve the prediction performance based on the utilization of synthetic oversampling techniques, which we called the synthetic minority over-sampling technique (SMOTE) and an adaptive synthetic (ADASYN). This has been achieved by the experimental design, which comprises three components. First, the selection of the best representative features is performed in two phases: the lasso method and multicollinearity analysis. Second, generate the newly balanced data utilizing SMOTE and ADASYN technique. Third, machine learning classifiers are applied to construct the prediction models among all subjects and each gender. In order to justify the effectiveness of the prediction models, the f-score, type I error, type II error, balanced accuracy and geometric mean indices are used. Comprehensive analysis demonstrates that Gradient Boosting Trees (GBT), Random Forest (RF) and multilayer perceptron neural network (MLP) classifiers achieved the best performances in all subjects and each gender when SMOTE and ADASYN were utilized. The SMOTE with GBT and RF models also provide feature importance scores that enhance the interpretability of the decision-support system. In addition, it is proven that the presented synthetic oversampling techniques with machine learning models outperformed baseline models in smoking cessation prediction.
first_indexed 2024-03-10T19:56:23Z
format Article
id doaj.art-64a77ac01c0241a0941163da8e1acf80
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-10T19:56:23Z
publishDate 2020-05-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-64a77ac01c0241a0941163da8e1acf802023-11-19T23:54:21ZengMDPI AGApplied Sciences2076-34172020-05-01109330710.3390/app10093307A Comparative Analysis of Machine Learning Methods for Class Imbalance in a Smoking Cessation InterventionKhishigsuren Davagdorj0Jong Seol Lee1Van Huy Pham2Keun Ho Ryu3Database and Bioinformatics Laboratory, College of Electrical and Computer Engineering, Chungbuk National University, Cheongju 28644, KoreaDatabase and Bioinformatics Laboratory, College of Electrical and Computer Engineering, Chungbuk National University, Cheongju 28644, KoreaFaculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City 700000, VietnamFaculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City 700000, VietnamSmoking is one of the major public health issues, which has a significant impact on premature death. In recent years, numerous decision support systems have been developed to deal with smoking cessation based on machine learning methods. However, the inevitable class imbalance is considered a major challenge in deploying such systems. In this paper, we study an empirical comparison of machine learning techniques to deal with the class imbalance problem in the prediction of smoking cessation intervention among the Korean population. For the class imbalance problem, the objective of this paper is to improve the prediction performance based on the utilization of synthetic oversampling techniques, which we called the synthetic minority over-sampling technique (SMOTE) and an adaptive synthetic (ADASYN). This has been achieved by the experimental design, which comprises three components. First, the selection of the best representative features is performed in two phases: the lasso method and multicollinearity analysis. Second, generate the newly balanced data utilizing SMOTE and ADASYN technique. Third, machine learning classifiers are applied to construct the prediction models among all subjects and each gender. In order to justify the effectiveness of the prediction models, the f-score, type I error, type II error, balanced accuracy and geometric mean indices are used. Comprehensive analysis demonstrates that Gradient Boosting Trees (GBT), Random Forest (RF) and multilayer perceptron neural network (MLP) classifiers achieved the best performances in all subjects and each gender when SMOTE and ADASYN were utilized. The SMOTE with GBT and RF models also provide feature importance scores that enhance the interpretability of the decision-support system. In addition, it is proven that the presented synthetic oversampling techniques with machine learning models outperformed baseline models in smoking cessation prediction.https://www.mdpi.com/2076-3417/10/9/3307smokingclass imbalancesynthetic oversamplingmachine learningdecision makingfeature importance
spellingShingle Khishigsuren Davagdorj
Jong Seol Lee
Van Huy Pham
Keun Ho Ryu
A Comparative Analysis of Machine Learning Methods for Class Imbalance in a Smoking Cessation Intervention
Applied Sciences
smoking
class imbalance
synthetic oversampling
machine learning
decision making
feature importance
title A Comparative Analysis of Machine Learning Methods for Class Imbalance in a Smoking Cessation Intervention
title_full A Comparative Analysis of Machine Learning Methods for Class Imbalance in a Smoking Cessation Intervention
title_fullStr A Comparative Analysis of Machine Learning Methods for Class Imbalance in a Smoking Cessation Intervention
title_full_unstemmed A Comparative Analysis of Machine Learning Methods for Class Imbalance in a Smoking Cessation Intervention
title_short A Comparative Analysis of Machine Learning Methods for Class Imbalance in a Smoking Cessation Intervention
title_sort comparative analysis of machine learning methods for class imbalance in a smoking cessation intervention
topic smoking
class imbalance
synthetic oversampling
machine learning
decision making
feature importance
url https://www.mdpi.com/2076-3417/10/9/3307
work_keys_str_mv AT khishigsurendavagdorj acomparativeanalysisofmachinelearningmethodsforclassimbalanceinasmokingcessationintervention
AT jongseollee acomparativeanalysisofmachinelearningmethodsforclassimbalanceinasmokingcessationintervention
AT vanhuypham acomparativeanalysisofmachinelearningmethodsforclassimbalanceinasmokingcessationintervention
AT keunhoryu acomparativeanalysisofmachinelearningmethodsforclassimbalanceinasmokingcessationintervention
AT khishigsurendavagdorj comparativeanalysisofmachinelearningmethodsforclassimbalanceinasmokingcessationintervention
AT jongseollee comparativeanalysisofmachinelearningmethodsforclassimbalanceinasmokingcessationintervention
AT vanhuypham comparativeanalysisofmachinelearningmethodsforclassimbalanceinasmokingcessationintervention
AT keunhoryu comparativeanalysisofmachinelearningmethodsforclassimbalanceinasmokingcessationintervention