A Logistic Based Mathematical Model to Optimize Duplicate Elimination Ratio in Content Defined Chunking Based Big Data Storage System
Deduplication is an efficient data reduction technique, and it is used to mitigate the problem of huge data volume in big data storage systems. Content defined chunking (CDC) is the most widely used algorithm in deduplication systems. The expected chunk size is an important parameter of CDC, and it...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2016-07-01
|
Series: | Symmetry |
Subjects: | |
Online Access: | http://www.mdpi.com/2073-8994/8/7/69 |
_version_ | 1811184903886209024 |
---|---|
author | Longxiang Wang Xiaoshe Dong Xingjun Zhang Fuliang Guo Yinfeng Wang Weifeng Gong |
author_facet | Longxiang Wang Xiaoshe Dong Xingjun Zhang Fuliang Guo Yinfeng Wang Weifeng Gong |
author_sort | Longxiang Wang |
collection | DOAJ |
description | Deduplication is an efficient data reduction technique, and it is used to mitigate the problem of huge data volume in big data storage systems. Content defined chunking (CDC) is the most widely used algorithm in deduplication systems. The expected chunk size is an important parameter of CDC, and it influences the duplicate elimination ratio (DER) significantly. We collected two realistic datasets to perform an experiment. The experimental results showed that the current approach of setting the expected chunk size to 4 KB or 8 KB empirically cannot optimize DER. Therefore, we present a logistic based mathematical model to reveal the hidden relationship between the expected chunk size and the DER. This model provides a theoretical basis for optimizing DER by setting the expected chunk size reasonably. We used the collected datasets to verify this model. The experimental results showed that the R2 values, which describe the goodness of fit, are above 0.9, validating the correctness of this mathematic model. Based on the DER model, we discussed how to make DER close to the optimum by setting the expected chunk size reasonably. |
first_indexed | 2024-04-11T13:21:26Z |
format | Article |
id | doaj.art-5fc2c979b237461eacd1bf72f68eac5a |
institution | Directory Open Access Journal |
issn | 2073-8994 |
language | English |
last_indexed | 2024-04-11T13:21:26Z |
publishDate | 2016-07-01 |
publisher | MDPI AG |
record_format | Article |
series | Symmetry |
spelling | doaj.art-5fc2c979b237461eacd1bf72f68eac5a2022-12-22T04:22:12ZengMDPI AGSymmetry2073-89942016-07-01876910.3390/sym8070069sym8070069A Logistic Based Mathematical Model to Optimize Duplicate Elimination Ratio in Content Defined Chunking Based Big Data Storage SystemLongxiang Wang0Xiaoshe Dong1Xingjun Zhang2Fuliang Guo3Yinfeng Wang4Weifeng Gong5The School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, ChinaThe School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, ChinaThe School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, ChinaThe School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, ChinaThe Shenzhen Institute of Information Technology, Shenzhen, 518172, ChinaState Key Laboratory of High-End Server & Storage Technology, Jinan 250101, ChinaDeduplication is an efficient data reduction technique, and it is used to mitigate the problem of huge data volume in big data storage systems. Content defined chunking (CDC) is the most widely used algorithm in deduplication systems. The expected chunk size is an important parameter of CDC, and it influences the duplicate elimination ratio (DER) significantly. We collected two realistic datasets to perform an experiment. The experimental results showed that the current approach of setting the expected chunk size to 4 KB or 8 KB empirically cannot optimize DER. Therefore, we present a logistic based mathematical model to reveal the hidden relationship between the expected chunk size and the DER. This model provides a theoretical basis for optimizing DER by setting the expected chunk size reasonably. We used the collected datasets to verify this model. The experimental results showed that the R2 values, which describe the goodness of fit, are above 0.9, validating the correctness of this mathematic model. Based on the DER model, we discussed how to make DER close to the optimum by setting the expected chunk size reasonably.http://www.mdpi.com/2073-8994/8/7/69storage systemdeduplicationduplication elimination ratiocontent defined chunking |
spellingShingle | Longxiang Wang Xiaoshe Dong Xingjun Zhang Fuliang Guo Yinfeng Wang Weifeng Gong A Logistic Based Mathematical Model to Optimize Duplicate Elimination Ratio in Content Defined Chunking Based Big Data Storage System Symmetry storage system deduplication duplication elimination ratio content defined chunking |
title | A Logistic Based Mathematical Model to Optimize Duplicate Elimination Ratio in Content Defined Chunking Based Big Data Storage System |
title_full | A Logistic Based Mathematical Model to Optimize Duplicate Elimination Ratio in Content Defined Chunking Based Big Data Storage System |
title_fullStr | A Logistic Based Mathematical Model to Optimize Duplicate Elimination Ratio in Content Defined Chunking Based Big Data Storage System |
title_full_unstemmed | A Logistic Based Mathematical Model to Optimize Duplicate Elimination Ratio in Content Defined Chunking Based Big Data Storage System |
title_short | A Logistic Based Mathematical Model to Optimize Duplicate Elimination Ratio in Content Defined Chunking Based Big Data Storage System |
title_sort | logistic based mathematical model to optimize duplicate elimination ratio in content defined chunking based big data storage system |
topic | storage system deduplication duplication elimination ratio content defined chunking |
url | http://www.mdpi.com/2073-8994/8/7/69 |
work_keys_str_mv | AT longxiangwang alogisticbasedmathematicalmodeltooptimizeduplicateeliminationratioincontentdefinedchunkingbasedbigdatastoragesystem AT xiaoshedong alogisticbasedmathematicalmodeltooptimizeduplicateeliminationratioincontentdefinedchunkingbasedbigdatastoragesystem AT xingjunzhang alogisticbasedmathematicalmodeltooptimizeduplicateeliminationratioincontentdefinedchunkingbasedbigdatastoragesystem AT fuliangguo alogisticbasedmathematicalmodeltooptimizeduplicateeliminationratioincontentdefinedchunkingbasedbigdatastoragesystem AT yinfengwang alogisticbasedmathematicalmodeltooptimizeduplicateeliminationratioincontentdefinedchunkingbasedbigdatastoragesystem AT weifenggong alogisticbasedmathematicalmodeltooptimizeduplicateeliminationratioincontentdefinedchunkingbasedbigdatastoragesystem AT longxiangwang logisticbasedmathematicalmodeltooptimizeduplicateeliminationratioincontentdefinedchunkingbasedbigdatastoragesystem AT xiaoshedong logisticbasedmathematicalmodeltooptimizeduplicateeliminationratioincontentdefinedchunkingbasedbigdatastoragesystem AT xingjunzhang logisticbasedmathematicalmodeltooptimizeduplicateeliminationratioincontentdefinedchunkingbasedbigdatastoragesystem AT fuliangguo logisticbasedmathematicalmodeltooptimizeduplicateeliminationratioincontentdefinedchunkingbasedbigdatastoragesystem AT yinfengwang logisticbasedmathematicalmodeltooptimizeduplicateeliminationratioincontentdefinedchunkingbasedbigdatastoragesystem AT weifenggong logisticbasedmathematicalmodeltooptimizeduplicateeliminationratioincontentdefinedchunkingbasedbigdatastoragesystem |