A Comparative Analysis of Machine Learning Methods for Class Imbalance in a Smoking Cessation Intervention
Smoking is one of the major public health issues, which has a significant impact on premature death. In recent years, numerous decision support systems have been developed to deal with smoking cessation based on machine learning methods. However, the inevitable class imbalance is considered a major...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2020-05-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/10/9/3307 |
_version_ | 1797568474073006080 |
---|---|
author | Khishigsuren Davagdorj Jong Seol Lee Van Huy Pham Keun Ho Ryu |
author_facet | Khishigsuren Davagdorj Jong Seol Lee Van Huy Pham Keun Ho Ryu |
author_sort | Khishigsuren Davagdorj |
collection | DOAJ |
description | Smoking is one of the major public health issues, which has a significant impact on premature death. In recent years, numerous decision support systems have been developed to deal with smoking cessation based on machine learning methods. However, the inevitable class imbalance is considered a major challenge in deploying such systems. In this paper, we study an empirical comparison of machine learning techniques to deal with the class imbalance problem in the prediction of smoking cessation intervention among the Korean population. For the class imbalance problem, the objective of this paper is to improve the prediction performance based on the utilization of synthetic oversampling techniques, which we called the synthetic minority over-sampling technique (SMOTE) and an adaptive synthetic (ADASYN). This has been achieved by the experimental design, which comprises three components. First, the selection of the best representative features is performed in two phases: the lasso method and multicollinearity analysis. Second, generate the newly balanced data utilizing SMOTE and ADASYN technique. Third, machine learning classifiers are applied to construct the prediction models among all subjects and each gender. In order to justify the effectiveness of the prediction models, the f-score, type I error, type II error, balanced accuracy and geometric mean indices are used. Comprehensive analysis demonstrates that Gradient Boosting Trees (GBT), Random Forest (RF) and multilayer perceptron neural network (MLP) classifiers achieved the best performances in all subjects and each gender when SMOTE and ADASYN were utilized. The SMOTE with GBT and RF models also provide feature importance scores that enhance the interpretability of the decision-support system. In addition, it is proven that the presented synthetic oversampling techniques with machine learning models outperformed baseline models in smoking cessation prediction. |
first_indexed | 2024-03-10T19:56:23Z |
format | Article |
id | doaj.art-64a77ac01c0241a0941163da8e1acf80 |
institution | Directory Open Access Journal |
issn | 2076-3417 |
language | English |
last_indexed | 2024-03-10T19:56:23Z |
publishDate | 2020-05-01 |
publisher | MDPI AG |
record_format | Article |
series | Applied Sciences |
spelling | doaj.art-64a77ac01c0241a0941163da8e1acf802023-11-19T23:54:21ZengMDPI AGApplied Sciences2076-34172020-05-01109330710.3390/app10093307A Comparative Analysis of Machine Learning Methods for Class Imbalance in a Smoking Cessation InterventionKhishigsuren Davagdorj0Jong Seol Lee1Van Huy Pham2Keun Ho Ryu3Database and Bioinformatics Laboratory, College of Electrical and Computer Engineering, Chungbuk National University, Cheongju 28644, KoreaDatabase and Bioinformatics Laboratory, College of Electrical and Computer Engineering, Chungbuk National University, Cheongju 28644, KoreaFaculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City 700000, VietnamFaculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City 700000, VietnamSmoking is one of the major public health issues, which has a significant impact on premature death. In recent years, numerous decision support systems have been developed to deal with smoking cessation based on machine learning methods. However, the inevitable class imbalance is considered a major challenge in deploying such systems. In this paper, we study an empirical comparison of machine learning techniques to deal with the class imbalance problem in the prediction of smoking cessation intervention among the Korean population. For the class imbalance problem, the objective of this paper is to improve the prediction performance based on the utilization of synthetic oversampling techniques, which we called the synthetic minority over-sampling technique (SMOTE) and an adaptive synthetic (ADASYN). This has been achieved by the experimental design, which comprises three components. First, the selection of the best representative features is performed in two phases: the lasso method and multicollinearity analysis. Second, generate the newly balanced data utilizing SMOTE and ADASYN technique. Third, machine learning classifiers are applied to construct the prediction models among all subjects and each gender. In order to justify the effectiveness of the prediction models, the f-score, type I error, type II error, balanced accuracy and geometric mean indices are used. Comprehensive analysis demonstrates that Gradient Boosting Trees (GBT), Random Forest (RF) and multilayer perceptron neural network (MLP) classifiers achieved the best performances in all subjects and each gender when SMOTE and ADASYN were utilized. The SMOTE with GBT and RF models also provide feature importance scores that enhance the interpretability of the decision-support system. In addition, it is proven that the presented synthetic oversampling techniques with machine learning models outperformed baseline models in smoking cessation prediction.https://www.mdpi.com/2076-3417/10/9/3307smokingclass imbalancesynthetic oversamplingmachine learningdecision makingfeature importance |
spellingShingle | Khishigsuren Davagdorj Jong Seol Lee Van Huy Pham Keun Ho Ryu A Comparative Analysis of Machine Learning Methods for Class Imbalance in a Smoking Cessation Intervention Applied Sciences smoking class imbalance synthetic oversampling machine learning decision making feature importance |
title | A Comparative Analysis of Machine Learning Methods for Class Imbalance in a Smoking Cessation Intervention |
title_full | A Comparative Analysis of Machine Learning Methods for Class Imbalance in a Smoking Cessation Intervention |
title_fullStr | A Comparative Analysis of Machine Learning Methods for Class Imbalance in a Smoking Cessation Intervention |
title_full_unstemmed | A Comparative Analysis of Machine Learning Methods for Class Imbalance in a Smoking Cessation Intervention |
title_short | A Comparative Analysis of Machine Learning Methods for Class Imbalance in a Smoking Cessation Intervention |
title_sort | comparative analysis of machine learning methods for class imbalance in a smoking cessation intervention |
topic | smoking class imbalance synthetic oversampling machine learning decision making feature importance |
url | https://www.mdpi.com/2076-3417/10/9/3307 |
work_keys_str_mv | AT khishigsurendavagdorj acomparativeanalysisofmachinelearningmethodsforclassimbalanceinasmokingcessationintervention AT jongseollee acomparativeanalysisofmachinelearningmethodsforclassimbalanceinasmokingcessationintervention AT vanhuypham acomparativeanalysisofmachinelearningmethodsforclassimbalanceinasmokingcessationintervention AT keunhoryu acomparativeanalysisofmachinelearningmethodsforclassimbalanceinasmokingcessationintervention AT khishigsurendavagdorj comparativeanalysisofmachinelearningmethodsforclassimbalanceinasmokingcessationintervention AT jongseollee comparativeanalysisofmachinelearningmethodsforclassimbalanceinasmokingcessationintervention AT vanhuypham comparativeanalysisofmachinelearningmethodsforclassimbalanceinasmokingcessationintervention AT keunhoryu comparativeanalysisofmachinelearningmethodsforclassimbalanceinasmokingcessationintervention |