Predicting severely imbalanced data disk drive failures with machine learning models

Datasets related to hard drive failure, particularly BackBlaze Hard Drive Data, have been widely studied in the literature using many statistical, machine learning, and deep learning techniques. These datasets are severely imbalanced due to the presence of a small number of failed drives compared to...

Full description

Bibliographic Details
Main Authors: Jishan Ahmed, Robert C. Green II
Format: Article
Language:English
Published: Elsevier 2022-09-01
Series:Machine Learning with Applications
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2666827022000585
Description
Summary:Datasets related to hard drive failure, particularly BackBlaze Hard Drive Data, have been widely studied in the literature using many statistical, machine learning, and deep learning techniques. These datasets are severely imbalanced due to the presence of a small number of failed drives compared to huge amounts of healthy drives in the operational data centers. It is challenging to mitigate the adverse consequence of the class imbalance due to the presence of bias towards the majority class during learning. SMART (self monitoring analysis and reporting technology) attributes of the disk drives were utilized in the past to design standard classification or regression algorithms. Although few machine learning (ML) models, for instance, tree based methods and ensemble learning algorithms, addressed the failure prediction, the effects of class imbalance were rarely properly considered under the ML framework. This study, based on a review of the state-of-the-art in the area, evaluates current methodologies to identify areas that were either overlooked or lacking, proposes methods for remediating these issues, and performs some baseline experiments to demonstrate the proposed methodologies including data sampling techniques and cost-sensitive learning.
ISSN:2666-8270