Utilizing data sampling techniques on algorithmic fairness for customer churn prediction with data imbalance problems [version 2; peer review: 1 approved, 2 approved with reservations]

Background: Customer churn prediction (CCP) refers to detecting which customers are likely to cancel the services provided by a service provider, for example, internet services. The class imbalance problem (CIP) in machine learning occurs when there is a huge difference in the samples of the positiv...

Full description

Bibliographic Details
Main Authors:	Maw Maw, Su-Cheng Haw, Chin-Kuan Ho
Format:	Article
Language:	English
Published:	F1000 Research Ltd 2022-06-01
Series:	F1000Research
Subjects:	Customer churn prediction Data sampling techniques Algorithmic fairness Class imbalance problem eng
Online Access:	https://f1000research.com/articles/10-988/v2

_version_	1817980796469772288
author	Maw Maw Su-Cheng Haw Chin-Kuan Ho
author_facet	Maw Maw Su-Cheng Haw Chin-Kuan Ho
author_sort	Maw Maw
collection	DOAJ
description	Background: Customer churn prediction (CCP) refers to detecting which customers are likely to cancel the services provided by a service provider, for example, internet services. The class imbalance problem (CIP) in machine learning occurs when there is a huge difference in the samples of the positive class compared to the negative class. It is one of the major obstacles in CCP as it deteriorates performance in the classification process. Utilizing data sampling techniques (DSTs) helps to resolve the CIP to some extent. Methods: In this paper, we review the effect of using DSTs on algorithmic fairness, i.e., to investigate whether the results pose any discrimination between male and female groups and compare the results before and after using DSTs. Three real-world datasets with unequal balancing rates were prepared and four ubiquitous DSTs were applied to them. Six popular classification techniques were utilized in the classification process. Both classifier’s performance and algorithmic fairness are evaluated with notable metrics. Results: The results indicated that the Random Forest classifier outperforms other classifiers in all three datasets and, that using SMOTE and ADASYN techniques causes more discrimination in the female group. The rate of unintentional discrimination seems to be higher in the original data of extremely unbalanced datasets under the following classifiers: Logistics Regression, LightGBM, and XGBoost. Conclusions: Algorithmic fairness has become a broadly studied area in recent years, yet there is very little systematic study on the effect of using DSTs on algorithmic fairness. This study presents important findings to further the use of algorithmic fairness in CCP research.
first_indexed	2024-04-13T22:57:48Z
format	Article
id	doaj.art-949971e5dc10471a88e59c36c794e40b
institution	Directory Open Access Journal
issn	2046-1402
language	English
last_indexed	2024-04-13T22:57:48Z
publishDate	2022-06-01
publisher	F1000 Research Ltd
record_format	Article
series	F1000Research
spelling	doaj.art-949971e5dc10471a88e59c36c794e40b2022-12-22T02:25:57ZengF1000 Research LtdF1000Research2046-14022022-06-0110134711Utilizing data sampling techniques on algorithmic fairness for customer churn prediction with data imbalance problems [version 2; peer review: 1 approved, 2 approved with reservations]Maw Maw0Su-Cheng Haw1https://orcid.org/0000-0002-7190-0837Chin-Kuan Ho2Faculty of Computing and Informatics, Multimedia University, Cyberjaya, Selangor, 63100, MalaysiaFaculty of Computing and Informatics, Multimedia University, Cyberjaya, Selangor, 63100, MalaysiaFaculty of Computing and Informatics, Multimedia University, Cyberjaya, Selangor, 63100, MalaysiaBackground: Customer churn prediction (CCP) refers to detecting which customers are likely to cancel the services provided by a service provider, for example, internet services. The class imbalance problem (CIP) in machine learning occurs when there is a huge difference in the samples of the positive class compared to the negative class. It is one of the major obstacles in CCP as it deteriorates performance in the classification process. Utilizing data sampling techniques (DSTs) helps to resolve the CIP to some extent. Methods: In this paper, we review the effect of using DSTs on algorithmic fairness, i.e., to investigate whether the results pose any discrimination between male and female groups and compare the results before and after using DSTs. Three real-world datasets with unequal balancing rates were prepared and four ubiquitous DSTs were applied to them. Six popular classification techniques were utilized in the classification process. Both classifier’s performance and algorithmic fairness are evaluated with notable metrics. Results: The results indicated that the Random Forest classifier outperforms other classifiers in all three datasets and, that using SMOTE and ADASYN techniques causes more discrimination in the female group. The rate of unintentional discrimination seems to be higher in the original data of extremely unbalanced datasets under the following classifiers: Logistics Regression, LightGBM, and XGBoost. Conclusions: Algorithmic fairness has become a broadly studied area in recent years, yet there is very little systematic study on the effect of using DSTs on algorithmic fairness. This study presents important findings to further the use of algorithmic fairness in CCP research.https://f1000research.com/articles/10-988/v2Customer churn prediction Data sampling techniques Algorithmic fairness Class imbalance problem eng
spellingShingle	Maw Maw Su-Cheng Haw Chin-Kuan Ho Utilizing data sampling techniques on algorithmic fairness for customer churn prediction with data imbalance problems [version 2; peer review: 1 approved, 2 approved with reservations] F1000Research Customer churn prediction Data sampling techniques Algorithmic fairness Class imbalance problem eng
title	Utilizing data sampling techniques on algorithmic fairness for customer churn prediction with data imbalance problems [version 2; peer review: 1 approved, 2 approved with reservations]
title_full	Utilizing data sampling techniques on algorithmic fairness for customer churn prediction with data imbalance problems [version 2; peer review: 1 approved, 2 approved with reservations]
title_fullStr	Utilizing data sampling techniques on algorithmic fairness for customer churn prediction with data imbalance problems [version 2; peer review: 1 approved, 2 approved with reservations]
title_full_unstemmed	Utilizing data sampling techniques on algorithmic fairness for customer churn prediction with data imbalance problems [version 2; peer review: 1 approved, 2 approved with reservations]
title_short	Utilizing data sampling techniques on algorithmic fairness for customer churn prediction with data imbalance problems [version 2; peer review: 1 approved, 2 approved with reservations]
title_sort	utilizing data sampling techniques on algorithmic fairness for customer churn prediction with data imbalance problems version 2 peer review 1 approved 2 approved with reservations
topic	Customer churn prediction Data sampling techniques Algorithmic fairness Class imbalance problem eng
url	https://f1000research.com/articles/10-988/v2
work_keys_str_mv	AT mawmaw utilizingdatasamplingtechniquesonalgorithmicfairnessforcustomerchurnpredictionwithdataimbalanceproblemsversion2peerreview1approved2approvedwithreservations AT suchenghaw utilizingdatasamplingtechniquesonalgorithmicfairnessforcustomerchurnpredictionwithdataimbalanceproblemsversion2peerreview1approved2approvedwithreservations AT chinkuanho utilizingdatasamplingtechniquesonalgorithmicfairnessforcustomerchurnpredictionwithdataimbalanceproblemsversion2peerreview1approved2approvedwithreservations

Utilizing data sampling techniques on algorithmic fairness for customer churn prediction with data imbalance problems [version 2; peer review: 1 approved, 2 approved with reservations]

Similar Items