Predicting severely imbalanced data disk drive failures with machine learning models

Datasets related to hard drive failure, particularly BackBlaze Hard Drive Data, have been widely studied in the literature using many statistical, machine learning, and deep learning techniques. These datasets are severely imbalanced due to the presence of a small number of failed drives compared to...

Full description

Bibliographic Details
Main Authors: Jishan Ahmed, Robert C. Green II
Format: Article
Language:English
Published: Elsevier 2022-09-01
Series:Machine Learning with Applications
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2666827022000585
_version_ 1798032519320305664
author Jishan Ahmed
Robert C. Green II
author_facet Jishan Ahmed
Robert C. Green II
author_sort Jishan Ahmed
collection DOAJ
description Datasets related to hard drive failure, particularly BackBlaze Hard Drive Data, have been widely studied in the literature using many statistical, machine learning, and deep learning techniques. These datasets are severely imbalanced due to the presence of a small number of failed drives compared to huge amounts of healthy drives in the operational data centers. It is challenging to mitigate the adverse consequence of the class imbalance due to the presence of bias towards the majority class during learning. SMART (self monitoring analysis and reporting technology) attributes of the disk drives were utilized in the past to design standard classification or regression algorithms. Although few machine learning (ML) models, for instance, tree based methods and ensemble learning algorithms, addressed the failure prediction, the effects of class imbalance were rarely properly considered under the ML framework. This study, based on a review of the state-of-the-art in the area, evaluates current methodologies to identify areas that were either overlooked or lacking, proposes methods for remediating these issues, and performs some baseline experiments to demonstrate the proposed methodologies including data sampling techniques and cost-sensitive learning.
first_indexed 2024-04-11T20:15:18Z
format Article
id doaj.art-e94a1683a8c64d89b465f713674671a7
institution Directory Open Access Journal
issn 2666-8270
language English
last_indexed 2024-04-11T20:15:18Z
publishDate 2022-09-01
publisher Elsevier
record_format Article
series Machine Learning with Applications
spelling doaj.art-e94a1683a8c64d89b465f713674671a72022-12-22T04:05:00ZengElsevierMachine Learning with Applications2666-82702022-09-019100361Predicting severely imbalanced data disk drive failures with machine learning modelsJishan Ahmed0Robert C. Green II1Corresponding author.; Department of Computer Science, Bowling Green State University, Bowling Green, 43403, OH, USADepartment of Computer Science, Bowling Green State University, Bowling Green, 43403, OH, USADatasets related to hard drive failure, particularly BackBlaze Hard Drive Data, have been widely studied in the literature using many statistical, machine learning, and deep learning techniques. These datasets are severely imbalanced due to the presence of a small number of failed drives compared to huge amounts of healthy drives in the operational data centers. It is challenging to mitigate the adverse consequence of the class imbalance due to the presence of bias towards the majority class during learning. SMART (self monitoring analysis and reporting technology) attributes of the disk drives were utilized in the past to design standard classification or regression algorithms. Although few machine learning (ML) models, for instance, tree based methods and ensemble learning algorithms, addressed the failure prediction, the effects of class imbalance were rarely properly considered under the ML framework. This study, based on a review of the state-of-the-art in the area, evaluates current methodologies to identify areas that were either overlooked or lacking, proposes methods for remediating these issues, and performs some baseline experiments to demonstrate the proposed methodologies including data sampling techniques and cost-sensitive learning.http://www.sciencedirect.com/science/article/pii/S2666827022000585Machine learningCost-sensitive learningClass imbalancePredictive maintenance (PdM)
spellingShingle Jishan Ahmed
Robert C. Green II
Predicting severely imbalanced data disk drive failures with machine learning models
Machine Learning with Applications
Machine learning
Cost-sensitive learning
Class imbalance
Predictive maintenance (PdM)
title Predicting severely imbalanced data disk drive failures with machine learning models
title_full Predicting severely imbalanced data disk drive failures with machine learning models
title_fullStr Predicting severely imbalanced data disk drive failures with machine learning models
title_full_unstemmed Predicting severely imbalanced data disk drive failures with machine learning models
title_short Predicting severely imbalanced data disk drive failures with machine learning models
title_sort predicting severely imbalanced data disk drive failures with machine learning models
topic Machine learning
Cost-sensitive learning
Class imbalance
Predictive maintenance (PdM)
url http://www.sciencedirect.com/science/article/pii/S2666827022000585
work_keys_str_mv AT jishanahmed predictingseverelyimbalanceddatadiskdrivefailureswithmachinelearningmodels
AT robertcgreenii predictingseverelyimbalanceddatadiskdrivefailureswithmachinelearningmodels