BUILDING CLASSIFICATION MODELS FROM IMBALANCED FRAUD DETECTION DATA

Many real-world data sets exhibit imbalanced class distributions in which almost all instances are assigned to one class and far fewer instances to a smaller, yet usually interesting class. Building classification models from such imbalanced data sets is a relatively new challenge in the machine lea...

Full description

Bibliographic Details
Main Authors:	Terence Yong Koon Beh, Swee Chuan Tan, Hwee Theng Yeo
Format:	Article
Language:	English
Published:	UiTM Press 2014-10-01
Series:	Malaysian Journal of Computing
Subjects:	imbalanced data machine learning model evaluation performances measures

_version_	1797450997594849280
author	Terence Yong Koon Beh Swee Chuan Tan Hwee Theng Yeo
author_facet	Terence Yong Koon Beh Swee Chuan Tan Hwee Theng Yeo
author_sort	Terence Yong Koon Beh
collection	DOAJ
description	Many real-world data sets exhibit imbalanced class distributions in which almost all instances are assigned to one class and far fewer instances to a smaller, yet usually interesting class. Building classification models from such imbalanced data sets is a relatively new challenge in the machine learning and data mining community because many traditional classification algorithms assume similar proportions of majority and minority classes. When the data is imbalanced, these algorithms generate models that achieve good classification accuracy for the majority class, but poor accuracy for the minority class. This paper reports our experience in applying data balancing techniques to develop a classifier for an imbalanced real-world fraud detection data set. We evaluated the models generated from seven classification algorithms with two simple data balancing techniques. Despite many ideas floating in the literature to tackle the imbalanced issue, our study shows the simplest data balancing technique is all that is required to significantly improve the accuracy in identifying the primary class of interest (i.e., the minority class) in all the seven algorithms tested. Our results also show that precision and recall are useful and effective measures for evaluating models created from artificially balanced data. Hence, we advise data mining practitioners to try simple data balancing first before exploring more sophisticated techniques to tackle the class imbalance problem.
first_indexed	2024-03-09T14:47:39Z
format	Article
id	doaj.art-a49d4a7883574332b20af00688991995
institution	Directory Open Access Journal
issn	2600-8238
language	English
last_indexed	2024-03-09T14:47:39Z
publishDate	2014-10-01
publisher	UiTM Press
record_format	Article
series	Malaysian Journal of Computing
spelling	doaj.art-a49d4a7883574332b20af006889919952023-11-26T19:16:44ZengUiTM PressMalaysian Journal of Computing2600-82382014-10-01221333https://doi.org/10.24191/mjoc.v2i2.0012BUILDING CLASSIFICATION MODELS FROM IMBALANCED FRAUD DETECTION DATATerence Yong Koon Beh0Swee Chuan Tan1Hwee Theng Yeo2School of Business, SIM UniversitySchool of Business, SIM UniversitySchool of Business, SIM UniversityMany real-world data sets exhibit imbalanced class distributions in which almost all instances are assigned to one class and far fewer instances to a smaller, yet usually interesting class. Building classification models from such imbalanced data sets is a relatively new challenge in the machine learning and data mining community because many traditional classification algorithms assume similar proportions of majority and minority classes. When the data is imbalanced, these algorithms generate models that achieve good classification accuracy for the majority class, but poor accuracy for the minority class. This paper reports our experience in applying data balancing techniques to develop a classifier for an imbalanced real-world fraud detection data set. We evaluated the models generated from seven classification algorithms with two simple data balancing techniques. Despite many ideas floating in the literature to tackle the imbalanced issue, our study shows the simplest data balancing technique is all that is required to significantly improve the accuracy in identifying the primary class of interest (i.e., the minority class) in all the seven algorithms tested. Our results also show that precision and recall are useful and effective measures for evaluating models created from artificially balanced data. Hence, we advise data mining practitioners to try simple data balancing first before exploring more sophisticated techniques to tackle the class imbalance problem. imbalanced datamachine learningmodel evaluationperformances measures
spellingShingle	Terence Yong Koon Beh Swee Chuan Tan Hwee Theng Yeo BUILDING CLASSIFICATION MODELS FROM IMBALANCED FRAUD DETECTION DATA Malaysian Journal of Computing imbalanced data machine learning model evaluation performances measures
title	BUILDING CLASSIFICATION MODELS FROM IMBALANCED FRAUD DETECTION DATA
title_full	BUILDING CLASSIFICATION MODELS FROM IMBALANCED FRAUD DETECTION DATA
title_fullStr	BUILDING CLASSIFICATION MODELS FROM IMBALANCED FRAUD DETECTION DATA
title_full_unstemmed	BUILDING CLASSIFICATION MODELS FROM IMBALANCED FRAUD DETECTION DATA
title_short	BUILDING CLASSIFICATION MODELS FROM IMBALANCED FRAUD DETECTION DATA
title_sort	building classification models from imbalanced fraud detection data
topic	imbalanced data machine learning model evaluation performances measures
work_keys_str_mv	AT terenceyongkoonbeh buildingclassificationmodelsfromimbalancedfrauddetectiondata AT sweechuantan buildingclassificationmodelsfromimbalancedfrauddetectiondata AT hweethengyeo buildingclassificationmodelsfromimbalancedfrauddetectiondata

BUILDING CLASSIFICATION MODELS FROM IMBALANCED FRAUD DETECTION DATA

Similar Items