Handling Imbalanced Data in Road Crash Severity Prediction by Machine Learning Algorithms

Crash severity is undoubtedly a fundamental aspect of a crash event. Although machine learning algorithms for predicting crash severity have recently gained interest by the academic community, there is a significant trend towards neglecting the fact that crash datasets are acutely imbalanced. Overlo...

Full description

Bibliographic Details
Main Authors:	Nicholas Fiorentini, Massimo Losa
Format:	Article
Language:	English
Published:	MDPI AG 2020-07-01
Series:	Infrastructures
Subjects:	crash severity machine learning classification algorithms random undersampling the majority class random classification tree k-nearest neighbor random forest
Online Access:	https://www.mdpi.com/2412-3811/5/7/61

_version_	1797561896731148288
author	Nicholas Fiorentini Massimo Losa
author_facet	Nicholas Fiorentini Massimo Losa
author_sort	Nicholas Fiorentini
collection	DOAJ
description	Crash severity is undoubtedly a fundamental aspect of a crash event. Although machine learning algorithms for predicting crash severity have recently gained interest by the academic community, there is a significant trend towards neglecting the fact that crash datasets are acutely imbalanced. Overlooking this fact generally leads to weak classifiers for predicting the minority class (crashes with higher severity). In this paper, in order to handle imbalanced accident datasets and provide a better prediction for the minority class, the random undersampling the majority class (RUMC) technique is used. By employing an imbalanced and a RUMC-based balanced training set, we propose the calibration, validation, and evaluation of four different crash severity predictive models, including random tree, k-nearest neighbor, logistic regression, and random forest. Accuracy, true positive rate (recall), false positive rate, true negative rate, precision, <i>F</i><sub>1</sub>-score, and the confusion matrix have been calculated to assess the performance. Outcomes show that RUMC-based models provide an enhancement in the reliability of the classifiers for detecting fatal crashes and those causing injury. Indeed, in imbalanced models, the true positive rate for predicting fatal crashes and those causing injury spans from 0% (logistic regression) to 18.3% (k-nearest neighbor), while for the RUMC-based models, it spans from 52.5% (RUMC-based logistic regression) to 57.2% (RUMC-based k-nearest neighbor). Organizations and decision-makers could make use of RUMC and machine learning algorithms in predicting the severity of a crash occurrence, managing the present, and planning the future of their works.
first_indexed	2024-03-10T18:21:06Z
format	Article
id	doaj.art-b3da26cdab2944208f5a8313c9b6364f
institution	Directory Open Access Journal
issn	2412-3811
language	English
last_indexed	2024-03-10T18:21:06Z
publishDate	2020-07-01
publisher	MDPI AG
record_format	Article
series	Infrastructures
spelling	doaj.art-b3da26cdab2944208f5a8313c9b6364f2023-11-20T07:21:27ZengMDPI AGInfrastructures2412-38112020-07-01576110.3390/infrastructures5070061Handling Imbalanced Data in Road Crash Severity Prediction by Machine Learning AlgorithmsNicholas Fiorentini0Massimo Losa1Department of Civil and Industrial Engineering (DICI), Engineering School of the University of Pisa, Largo Lucio Lazzarino 1, 56126 Pisa, ItalyDepartment of Civil and Industrial Engineering (DICI), Engineering School of the University of Pisa, Largo Lucio Lazzarino 1, 56126 Pisa, ItalyCrash severity is undoubtedly a fundamental aspect of a crash event. Although machine learning algorithms for predicting crash severity have recently gained interest by the academic community, there is a significant trend towards neglecting the fact that crash datasets are acutely imbalanced. Overlooking this fact generally leads to weak classifiers for predicting the minority class (crashes with higher severity). In this paper, in order to handle imbalanced accident datasets and provide a better prediction for the minority class, the random undersampling the majority class (RUMC) technique is used. By employing an imbalanced and a RUMC-based balanced training set, we propose the calibration, validation, and evaluation of four different crash severity predictive models, including random tree, k-nearest neighbor, logistic regression, and random forest. Accuracy, true positive rate (recall), false positive rate, true negative rate, precision, <i>F</i><sub>1</sub>-score, and the confusion matrix have been calculated to assess the performance. Outcomes show that RUMC-based models provide an enhancement in the reliability of the classifiers for detecting fatal crashes and those causing injury. Indeed, in imbalanced models, the true positive rate for predicting fatal crashes and those causing injury spans from 0% (logistic regression) to 18.3% (k-nearest neighbor), while for the RUMC-based models, it spans from 52.5% (RUMC-based logistic regression) to 57.2% (RUMC-based k-nearest neighbor). Organizations and decision-makers could make use of RUMC and machine learning algorithms in predicting the severity of a crash occurrence, managing the present, and planning the future of their works.https://www.mdpi.com/2412-3811/5/7/61crash severitymachine learning classification algorithmsrandom undersampling the majority classrandom classification treek-nearest neighborrandom forest
spellingShingle	Nicholas Fiorentini Massimo Losa Handling Imbalanced Data in Road Crash Severity Prediction by Machine Learning Algorithms Infrastructures crash severity machine learning classification algorithms random undersampling the majority class random classification tree k-nearest neighbor random forest
title	Handling Imbalanced Data in Road Crash Severity Prediction by Machine Learning Algorithms
title_full	Handling Imbalanced Data in Road Crash Severity Prediction by Machine Learning Algorithms
title_fullStr	Handling Imbalanced Data in Road Crash Severity Prediction by Machine Learning Algorithms
title_full_unstemmed	Handling Imbalanced Data in Road Crash Severity Prediction by Machine Learning Algorithms
title_short	Handling Imbalanced Data in Road Crash Severity Prediction by Machine Learning Algorithms
title_sort	handling imbalanced data in road crash severity prediction by machine learning algorithms
topic	crash severity machine learning classification algorithms random undersampling the majority class random classification tree k-nearest neighbor random forest
url	https://www.mdpi.com/2412-3811/5/7/61
work_keys_str_mv	AT nicholasfiorentini handlingimbalanceddatainroadcrashseveritypredictionbymachinelearningalgorithms AT massimolosa handlingimbalanceddatainroadcrashseveritypredictionbymachinelearningalgorithms

Handling Imbalanced Data in Road Crash Severity Prediction by Machine Learning Algorithms

Similar Items