Correlation-based feature selection and Smote-Tomek Link to improve the performance of machine learning methods on cancer disease prediction

Indonesia is an archipelago with the fourth largest population in the world, with a population of 283 million. In Indonesia, breast cancer ranks first in cancer and is the highest contributor to death. Deaths caused by breast cancer can be minimized by screening and early detection to avoid the risk...

Full description

Bibliographic Details
Main Authors: Lalu Ganda Rady Putra, Khairani Marzuki, Hairani Hairani
Format: Article
Language:English
Published: Khon Kaen University 2023-11-01
Series:Engineering and Applied Science Research
Subjects:
Online Access:https://ph01.tci-thaijo.org/index.php/easr/article/view/253528/171428
_version_ 1797633269165981696
author Lalu Ganda Rady Putra
Khairani Marzuki
Hairani Hairani
author_facet Lalu Ganda Rady Putra
Khairani Marzuki
Hairani Hairani
author_sort Lalu Ganda Rady Putra
collection DOAJ
description Indonesia is an archipelago with the fourth largest population in the world, with a population of 283 million. In Indonesia, breast cancer ranks first in cancer and is the highest contributor to death. Deaths caused by breast cancer can be minimized by screening and early detection to avoid the risk of more severe cancer. Early detection of breast cancer can delay the growth of cancer cells and increase the chances of recovery. This research proposed a machine learning-based application for screening and early detection of breast cancer independently based on perceived symptoms. However, developing breast cancer early detection applications requires a very high level of accuracy to minimize prediction errors. This research focused on finding the optimal accuracy of the machine learning method so that it could predict breast cancer with a very low error rate. This research aimed to improve the performance of classification methods in breast cancer disease prediction using the correlation feature selection approach and hybrid sampling Smote-Tomek Link. This research utilized Support Vector Machine (SVM) and Naive Bayes classification methods with a combination of Smote-Tomek Link hybrid sampling approach and correlation feature selection. Hybrid Sampling Smote-Tomek Link balanced the data by minimizing noise in the data created. At the same time, the correlation feature selection method was used to select relevant or influential attributes with class attributes based on a strong correlation level (≥ 0.6) between input attributes and classes. The results of this study obtained that the SVM method with hybrid sampling and correlation feature selection obtained the best performance compared to the Naive Bayes method and previous research referred to with an accuracy of 96.80%, sensitivity of 96.80%, and specificity of 96.80%. Thus, using the Smote-Tomek Link hybrid sampling approach and correlation feature selection positively impacted the performance increase in the SVM and Naive Bayes methods for breast cancer prediction.
first_indexed 2024-03-11T11:51:42Z
format Article
id doaj.art-37b66ebbd5a54d309eb46acac92f0ddd
institution Directory Open Access Journal
issn 2539-6161
2539-6218
language English
last_indexed 2024-03-11T11:51:42Z
publishDate 2023-11-01
publisher Khon Kaen University
record_format Article
series Engineering and Applied Science Research
spelling doaj.art-37b66ebbd5a54d309eb46acac92f0ddd2023-11-09T03:33:08ZengKhon Kaen UniversityEngineering and Applied Science Research2539-61612539-62182023-11-01506577583Correlation-based feature selection and Smote-Tomek Link to improve the performance of machine learning methods on cancer disease predictionLalu Ganda Rady PutraKhairani MarzukiHairani HairaniIndonesia is an archipelago with the fourth largest population in the world, with a population of 283 million. In Indonesia, breast cancer ranks first in cancer and is the highest contributor to death. Deaths caused by breast cancer can be minimized by screening and early detection to avoid the risk of more severe cancer. Early detection of breast cancer can delay the growth of cancer cells and increase the chances of recovery. This research proposed a machine learning-based application for screening and early detection of breast cancer independently based on perceived symptoms. However, developing breast cancer early detection applications requires a very high level of accuracy to minimize prediction errors. This research focused on finding the optimal accuracy of the machine learning method so that it could predict breast cancer with a very low error rate. This research aimed to improve the performance of classification methods in breast cancer disease prediction using the correlation feature selection approach and hybrid sampling Smote-Tomek Link. This research utilized Support Vector Machine (SVM) and Naive Bayes classification methods with a combination of Smote-Tomek Link hybrid sampling approach and correlation feature selection. Hybrid Sampling Smote-Tomek Link balanced the data by minimizing noise in the data created. At the same time, the correlation feature selection method was used to select relevant or influential attributes with class attributes based on a strong correlation level (≥ 0.6) between input attributes and classes. The results of this study obtained that the SVM method with hybrid sampling and correlation feature selection obtained the best performance compared to the Naive Bayes method and previous research referred to with an accuracy of 96.80%, sensitivity of 96.80%, and specificity of 96.80%. Thus, using the Smote-Tomek Link hybrid sampling approach and correlation feature selection positively impacted the performance increase in the SVM and Naive Bayes methods for breast cancer prediction.https://ph01.tci-thaijo.org/index.php/easr/article/view/253528/171428breast cancer predictionfeature selection correlationmachine learning methodshybrid smote-tomek link
spellingShingle Lalu Ganda Rady Putra
Khairani Marzuki
Hairani Hairani
Correlation-based feature selection and Smote-Tomek Link to improve the performance of machine learning methods on cancer disease prediction
Engineering and Applied Science Research
breast cancer prediction
feature selection correlation
machine learning methods
hybrid smote-tomek link
title Correlation-based feature selection and Smote-Tomek Link to improve the performance of machine learning methods on cancer disease prediction
title_full Correlation-based feature selection and Smote-Tomek Link to improve the performance of machine learning methods on cancer disease prediction
title_fullStr Correlation-based feature selection and Smote-Tomek Link to improve the performance of machine learning methods on cancer disease prediction
title_full_unstemmed Correlation-based feature selection and Smote-Tomek Link to improve the performance of machine learning methods on cancer disease prediction
title_short Correlation-based feature selection and Smote-Tomek Link to improve the performance of machine learning methods on cancer disease prediction
title_sort correlation based feature selection and smote tomek link to improve the performance of machine learning methods on cancer disease prediction
topic breast cancer prediction
feature selection correlation
machine learning methods
hybrid smote-tomek link
url https://ph01.tci-thaijo.org/index.php/easr/article/view/253528/171428
work_keys_str_mv AT lalugandaradyputra correlationbasedfeatureselectionandsmotetomeklinktoimprovetheperformanceofmachinelearningmethodsoncancerdiseaseprediction
AT khairanimarzuki correlationbasedfeatureselectionandsmotetomeklinktoimprovetheperformanceofmachinelearningmethodsoncancerdiseaseprediction
AT hairanihairani correlationbasedfeatureselectionandsmotetomeklinktoimprovetheperformanceofmachinelearningmethodsoncancerdiseaseprediction