ACPR: Adaptive Classification Predictive Repair Method for Different Fault Scenarios

Erasure codes are widely used in large-scale distributed storage systems due to their high efficiency and reliability, but they also face extremely high repair penalties when data corruption occurs. At present, machine learning methods can accurately predict the next failure time and type of machine...

Full description

Bibliographic Details
Main Authors: Ying Song, Peisen Zheng, Yingai Tian, Bo Wang
Format: Article
Language:English
Published: IEEE 2024-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10373784/
_version_ 1797357087484805120
author Ying Song
Peisen Zheng
Yingai Tian
Bo Wang
author_facet Ying Song
Peisen Zheng
Yingai Tian
Bo Wang
author_sort Ying Song
collection DOAJ
description Erasure codes are widely used in large-scale distributed storage systems due to their high efficiency and reliability, but they also face extremely high repair penalties when data corruption occurs. At present, machine learning methods can accurately predict the next failure time and type of machine nodes. Based on this, in order to solve the problem of unnecessary repair traffic caused by temporary failures, as well as the more degraded reads of high-frequency accessed data due to longer failure time of such data in existing repair methods, we propose an Adaptive Classification Predictive Repair method (ACPR) for different fault scenarios. By categorizing the failed blocks into high-risk and low-risk based on the failure type of the soon-to-fail (STF) node and the access heat of STF blocks, ACPR can perform adaptive predictive repair. By quickly repair high-risk blocks to ensure data availability while delaying the repair of low-risk blocks, a large amount of unnecessary repair traffic caused by temporary node failures in the cluster is avoided. Alibaba Cloud Elastic Compute Service (ECS) experiments results show that compared with FastPR and ECPipe, ACPR can shorten the repair time per data block by up to 15.2% and 33.5%, respectively. Moreover, ACPR can reduce repair traffic by up to 74.1% and 84.4%, respectively.
first_indexed 2024-03-08T14:39:40Z
format Article
id doaj.art-6f8b10857fd44aa58d1a4acc341099eb
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-03-08T14:39:40Z
publishDate 2024-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-6f8b10857fd44aa58d1a4acc341099eb2024-01-12T00:01:47ZengIEEEIEEE Access2169-35362024-01-01124631464110.1109/ACCESS.2023.334688110373784ACPR: Adaptive Classification Predictive Repair Method for Different Fault ScenariosYing Song0https://orcid.org/0000-0001-6257-1747Peisen Zheng1https://orcid.org/0009-0000-5790-4316Yingai Tian2Bo Wang3https://orcid.org/0000-0003-3598-5359Beijing Information Science and Technology University, Beijing, ChinaBeijing Information Science and Technology University, Beijing, ChinaBeijing Information Science and Technology University, Beijing, ChinaSoftware Engineering College, Zhengzhou University of Light Industry (ZZULI), Zhengzhou, ChinaErasure codes are widely used in large-scale distributed storage systems due to their high efficiency and reliability, but they also face extremely high repair penalties when data corruption occurs. At present, machine learning methods can accurately predict the next failure time and type of machine nodes. Based on this, in order to solve the problem of unnecessary repair traffic caused by temporary failures, as well as the more degraded reads of high-frequency accessed data due to longer failure time of such data in existing repair methods, we propose an Adaptive Classification Predictive Repair method (ACPR) for different fault scenarios. By categorizing the failed blocks into high-risk and low-risk based on the failure type of the soon-to-fail (STF) node and the access heat of STF blocks, ACPR can perform adaptive predictive repair. By quickly repair high-risk blocks to ensure data availability while delaying the repair of low-risk blocks, a large amount of unnecessary repair traffic caused by temporary node failures in the cluster is avoided. Alibaba Cloud Elastic Compute Service (ECS) experiments results show that compared with FastPR and ECPipe, ACPR can shorten the repair time per data block by up to 15.2% and 33.5%, respectively. Moreover, ACPR can reduce repair traffic by up to 74.1% and 84.4%, respectively.https://ieeexplore.ieee.org/document/10373784/Distributed storage systemdata recoveryerasure coding
spellingShingle Ying Song
Peisen Zheng
Yingai Tian
Bo Wang
ACPR: Adaptive Classification Predictive Repair Method for Different Fault Scenarios
IEEE Access
Distributed storage system
data recovery
erasure coding
title ACPR: Adaptive Classification Predictive Repair Method for Different Fault Scenarios
title_full ACPR: Adaptive Classification Predictive Repair Method for Different Fault Scenarios
title_fullStr ACPR: Adaptive Classification Predictive Repair Method for Different Fault Scenarios
title_full_unstemmed ACPR: Adaptive Classification Predictive Repair Method for Different Fault Scenarios
title_short ACPR: Adaptive Classification Predictive Repair Method for Different Fault Scenarios
title_sort acpr adaptive classification predictive repair method for different fault scenarios
topic Distributed storage system
data recovery
erasure coding
url https://ieeexplore.ieee.org/document/10373784/
work_keys_str_mv AT yingsong acpradaptiveclassificationpredictiverepairmethodfordifferentfaultscenarios
AT peisenzheng acpradaptiveclassificationpredictiverepairmethodfordifferentfaultscenarios
AT yingaitian acpradaptiveclassificationpredictiverepairmethodfordifferentfaultscenarios
AT bowang acpradaptiveclassificationpredictiverepairmethodfordifferentfaultscenarios