Predicting severely imbalanced data disk drive failures with machine learning models
Datasets related to hard drive failure, particularly BackBlaze Hard Drive Data, have been widely studied in the literature using many statistical, machine learning, and deep learning techniques. These datasets are severely imbalanced due to the presence of a small number of failed drives compared to...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2022-09-01
|
Series: | Machine Learning with Applications |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2666827022000585 |
_version_ | 1798032519320305664 |
---|---|
author | Jishan Ahmed Robert C. Green II |
author_facet | Jishan Ahmed Robert C. Green II |
author_sort | Jishan Ahmed |
collection | DOAJ |
description | Datasets related to hard drive failure, particularly BackBlaze Hard Drive Data, have been widely studied in the literature using many statistical, machine learning, and deep learning techniques. These datasets are severely imbalanced due to the presence of a small number of failed drives compared to huge amounts of healthy drives in the operational data centers. It is challenging to mitigate the adverse consequence of the class imbalance due to the presence of bias towards the majority class during learning. SMART (self monitoring analysis and reporting technology) attributes of the disk drives were utilized in the past to design standard classification or regression algorithms. Although few machine learning (ML) models, for instance, tree based methods and ensemble learning algorithms, addressed the failure prediction, the effects of class imbalance were rarely properly considered under the ML framework. This study, based on a review of the state-of-the-art in the area, evaluates current methodologies to identify areas that were either overlooked or lacking, proposes methods for remediating these issues, and performs some baseline experiments to demonstrate the proposed methodologies including data sampling techniques and cost-sensitive learning. |
first_indexed | 2024-04-11T20:15:18Z |
format | Article |
id | doaj.art-e94a1683a8c64d89b465f713674671a7 |
institution | Directory Open Access Journal |
issn | 2666-8270 |
language | English |
last_indexed | 2024-04-11T20:15:18Z |
publishDate | 2022-09-01 |
publisher | Elsevier |
record_format | Article |
series | Machine Learning with Applications |
spelling | doaj.art-e94a1683a8c64d89b465f713674671a72022-12-22T04:05:00ZengElsevierMachine Learning with Applications2666-82702022-09-019100361Predicting severely imbalanced data disk drive failures with machine learning modelsJishan Ahmed0Robert C. Green II1Corresponding author.; Department of Computer Science, Bowling Green State University, Bowling Green, 43403, OH, USADepartment of Computer Science, Bowling Green State University, Bowling Green, 43403, OH, USADatasets related to hard drive failure, particularly BackBlaze Hard Drive Data, have been widely studied in the literature using many statistical, machine learning, and deep learning techniques. These datasets are severely imbalanced due to the presence of a small number of failed drives compared to huge amounts of healthy drives in the operational data centers. It is challenging to mitigate the adverse consequence of the class imbalance due to the presence of bias towards the majority class during learning. SMART (self monitoring analysis and reporting technology) attributes of the disk drives were utilized in the past to design standard classification or regression algorithms. Although few machine learning (ML) models, for instance, tree based methods and ensemble learning algorithms, addressed the failure prediction, the effects of class imbalance were rarely properly considered under the ML framework. This study, based on a review of the state-of-the-art in the area, evaluates current methodologies to identify areas that were either overlooked or lacking, proposes methods for remediating these issues, and performs some baseline experiments to demonstrate the proposed methodologies including data sampling techniques and cost-sensitive learning.http://www.sciencedirect.com/science/article/pii/S2666827022000585Machine learningCost-sensitive learningClass imbalancePredictive maintenance (PdM) |
spellingShingle | Jishan Ahmed Robert C. Green II Predicting severely imbalanced data disk drive failures with machine learning models Machine Learning with Applications Machine learning Cost-sensitive learning Class imbalance Predictive maintenance (PdM) |
title | Predicting severely imbalanced data disk drive failures with machine learning models |
title_full | Predicting severely imbalanced data disk drive failures with machine learning models |
title_fullStr | Predicting severely imbalanced data disk drive failures with machine learning models |
title_full_unstemmed | Predicting severely imbalanced data disk drive failures with machine learning models |
title_short | Predicting severely imbalanced data disk drive failures with machine learning models |
title_sort | predicting severely imbalanced data disk drive failures with machine learning models |
topic | Machine learning Cost-sensitive learning Class imbalance Predictive maintenance (PdM) |
url | http://www.sciencedirect.com/science/article/pii/S2666827022000585 |
work_keys_str_mv | AT jishanahmed predictingseverelyimbalanceddatadiskdrivefailureswithmachinelearningmodels AT robertcgreenii predictingseverelyimbalanceddatadiskdrivefailureswithmachinelearningmodels |