ACPR: Adaptive Classification Predictive Repair Method for Different Fault Scenarios
Erasure codes are widely used in large-scale distributed storage systems due to their high efficiency and reliability, but they also face extremely high repair penalties when data corruption occurs. At present, machine learning methods can accurately predict the next failure time and type of machine...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2024-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10373784/ |
_version_ | 1797357087484805120 |
---|---|
author | Ying Song Peisen Zheng Yingai Tian Bo Wang |
author_facet | Ying Song Peisen Zheng Yingai Tian Bo Wang |
author_sort | Ying Song |
collection | DOAJ |
description | Erasure codes are widely used in large-scale distributed storage systems due to their high efficiency and reliability, but they also face extremely high repair penalties when data corruption occurs. At present, machine learning methods can accurately predict the next failure time and type of machine nodes. Based on this, in order to solve the problem of unnecessary repair traffic caused by temporary failures, as well as the more degraded reads of high-frequency accessed data due to longer failure time of such data in existing repair methods, we propose an Adaptive Classification Predictive Repair method (ACPR) for different fault scenarios. By categorizing the failed blocks into high-risk and low-risk based on the failure type of the soon-to-fail (STF) node and the access heat of STF blocks, ACPR can perform adaptive predictive repair. By quickly repair high-risk blocks to ensure data availability while delaying the repair of low-risk blocks, a large amount of unnecessary repair traffic caused by temporary node failures in the cluster is avoided. Alibaba Cloud Elastic Compute Service (ECS) experiments results show that compared with FastPR and ECPipe, ACPR can shorten the repair time per data block by up to 15.2% and 33.5%, respectively. Moreover, ACPR can reduce repair traffic by up to 74.1% and 84.4%, respectively. |
first_indexed | 2024-03-08T14:39:40Z |
format | Article |
id | doaj.art-6f8b10857fd44aa58d1a4acc341099eb |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-03-08T14:39:40Z |
publishDate | 2024-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-6f8b10857fd44aa58d1a4acc341099eb2024-01-12T00:01:47ZengIEEEIEEE Access2169-35362024-01-01124631464110.1109/ACCESS.2023.334688110373784ACPR: Adaptive Classification Predictive Repair Method for Different Fault ScenariosYing Song0https://orcid.org/0000-0001-6257-1747Peisen Zheng1https://orcid.org/0009-0000-5790-4316Yingai Tian2Bo Wang3https://orcid.org/0000-0003-3598-5359Beijing Information Science and Technology University, Beijing, ChinaBeijing Information Science and Technology University, Beijing, ChinaBeijing Information Science and Technology University, Beijing, ChinaSoftware Engineering College, Zhengzhou University of Light Industry (ZZULI), Zhengzhou, ChinaErasure codes are widely used in large-scale distributed storage systems due to their high efficiency and reliability, but they also face extremely high repair penalties when data corruption occurs. At present, machine learning methods can accurately predict the next failure time and type of machine nodes. Based on this, in order to solve the problem of unnecessary repair traffic caused by temporary failures, as well as the more degraded reads of high-frequency accessed data due to longer failure time of such data in existing repair methods, we propose an Adaptive Classification Predictive Repair method (ACPR) for different fault scenarios. By categorizing the failed blocks into high-risk and low-risk based on the failure type of the soon-to-fail (STF) node and the access heat of STF blocks, ACPR can perform adaptive predictive repair. By quickly repair high-risk blocks to ensure data availability while delaying the repair of low-risk blocks, a large amount of unnecessary repair traffic caused by temporary node failures in the cluster is avoided. Alibaba Cloud Elastic Compute Service (ECS) experiments results show that compared with FastPR and ECPipe, ACPR can shorten the repair time per data block by up to 15.2% and 33.5%, respectively. Moreover, ACPR can reduce repair traffic by up to 74.1% and 84.4%, respectively.https://ieeexplore.ieee.org/document/10373784/Distributed storage systemdata recoveryerasure coding |
spellingShingle | Ying Song Peisen Zheng Yingai Tian Bo Wang ACPR: Adaptive Classification Predictive Repair Method for Different Fault Scenarios IEEE Access Distributed storage system data recovery erasure coding |
title | ACPR: Adaptive Classification Predictive Repair Method for Different Fault Scenarios |
title_full | ACPR: Adaptive Classification Predictive Repair Method for Different Fault Scenarios |
title_fullStr | ACPR: Adaptive Classification Predictive Repair Method for Different Fault Scenarios |
title_full_unstemmed | ACPR: Adaptive Classification Predictive Repair Method for Different Fault Scenarios |
title_short | ACPR: Adaptive Classification Predictive Repair Method for Different Fault Scenarios |
title_sort | acpr adaptive classification predictive repair method for different fault scenarios |
topic | Distributed storage system data recovery erasure coding |
url | https://ieeexplore.ieee.org/document/10373784/ |
work_keys_str_mv | AT yingsong acpradaptiveclassificationpredictiverepairmethodfordifferentfaultscenarios AT peisenzheng acpradaptiveclassificationpredictiverepairmethodfordifferentfaultscenarios AT yingaitian acpradaptiveclassificationpredictiverepairmethodfordifferentfaultscenarios AT bowang acpradaptiveclassificationpredictiverepairmethodfordifferentfaultscenarios |