A Logistic Based Mathematical Model to Optimize Duplicate Elimination Ratio in Content Defined Chunking Based Big Data Storage System

Deduplication is an efficient data reduction technique, and it is used to mitigate the problem of huge data volume in big data storage systems. Content defined chunking (CDC) is the most widely used algorithm in deduplication systems. The expected chunk size is an important parameter of CDC, and it...

Full description

Bibliographic Details
Main Authors: Longxiang Wang, Xiaoshe Dong, Xingjun Zhang, Fuliang Guo, Yinfeng Wang, Weifeng Gong
Format: Article
Language:English
Published: MDPI AG 2016-07-01
Series:Symmetry
Subjects:
Online Access:http://www.mdpi.com/2073-8994/8/7/69
_version_ 1811184903886209024
author Longxiang Wang
Xiaoshe Dong
Xingjun Zhang
Fuliang Guo
Yinfeng Wang
Weifeng Gong
author_facet Longxiang Wang
Xiaoshe Dong
Xingjun Zhang
Fuliang Guo
Yinfeng Wang
Weifeng Gong
author_sort Longxiang Wang
collection DOAJ
description Deduplication is an efficient data reduction technique, and it is used to mitigate the problem of huge data volume in big data storage systems. Content defined chunking (CDC) is the most widely used algorithm in deduplication systems. The expected chunk size is an important parameter of CDC, and it influences the duplicate elimination ratio (DER) significantly. We collected two realistic datasets to perform an experiment. The experimental results showed that the current approach of setting the expected chunk size to 4 KB or 8 KB empirically cannot optimize DER. Therefore, we present a logistic based mathematical model to reveal the hidden relationship between the expected chunk size and the DER. This model provides a theoretical basis for optimizing DER by setting the expected chunk size reasonably. We used the collected datasets to verify this model. The experimental results showed that the R2 values, which describe the goodness of fit, are above 0.9, validating the correctness of this mathematic model. Based on the DER model, we discussed how to make DER close to the optimum by setting the expected chunk size reasonably.
first_indexed 2024-04-11T13:21:26Z
format Article
id doaj.art-5fc2c979b237461eacd1bf72f68eac5a
institution Directory Open Access Journal
issn 2073-8994
language English
last_indexed 2024-04-11T13:21:26Z
publishDate 2016-07-01
publisher MDPI AG
record_format Article
series Symmetry
spelling doaj.art-5fc2c979b237461eacd1bf72f68eac5a2022-12-22T04:22:12ZengMDPI AGSymmetry2073-89942016-07-01876910.3390/sym8070069sym8070069A Logistic Based Mathematical Model to Optimize Duplicate Elimination Ratio in Content Defined Chunking Based Big Data Storage SystemLongxiang Wang0Xiaoshe Dong1Xingjun Zhang2Fuliang Guo3Yinfeng Wang4Weifeng Gong5The School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, ChinaThe School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, ChinaThe School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, ChinaThe School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, ChinaThe Shenzhen Institute of Information Technology, Shenzhen, 518172, ChinaState Key Laboratory of High-End Server & Storage Technology, Jinan 250101, ChinaDeduplication is an efficient data reduction technique, and it is used to mitigate the problem of huge data volume in big data storage systems. Content defined chunking (CDC) is the most widely used algorithm in deduplication systems. The expected chunk size is an important parameter of CDC, and it influences the duplicate elimination ratio (DER) significantly. We collected two realistic datasets to perform an experiment. The experimental results showed that the current approach of setting the expected chunk size to 4 KB or 8 KB empirically cannot optimize DER. Therefore, we present a logistic based mathematical model to reveal the hidden relationship between the expected chunk size and the DER. This model provides a theoretical basis for optimizing DER by setting the expected chunk size reasonably. We used the collected datasets to verify this model. The experimental results showed that the R2 values, which describe the goodness of fit, are above 0.9, validating the correctness of this mathematic model. Based on the DER model, we discussed how to make DER close to the optimum by setting the expected chunk size reasonably.http://www.mdpi.com/2073-8994/8/7/69storage systemdeduplicationduplication elimination ratiocontent defined chunking
spellingShingle Longxiang Wang
Xiaoshe Dong
Xingjun Zhang
Fuliang Guo
Yinfeng Wang
Weifeng Gong
A Logistic Based Mathematical Model to Optimize Duplicate Elimination Ratio in Content Defined Chunking Based Big Data Storage System
Symmetry
storage system
deduplication
duplication elimination ratio
content defined chunking
title A Logistic Based Mathematical Model to Optimize Duplicate Elimination Ratio in Content Defined Chunking Based Big Data Storage System
title_full A Logistic Based Mathematical Model to Optimize Duplicate Elimination Ratio in Content Defined Chunking Based Big Data Storage System
title_fullStr A Logistic Based Mathematical Model to Optimize Duplicate Elimination Ratio in Content Defined Chunking Based Big Data Storage System
title_full_unstemmed A Logistic Based Mathematical Model to Optimize Duplicate Elimination Ratio in Content Defined Chunking Based Big Data Storage System
title_short A Logistic Based Mathematical Model to Optimize Duplicate Elimination Ratio in Content Defined Chunking Based Big Data Storage System
title_sort logistic based mathematical model to optimize duplicate elimination ratio in content defined chunking based big data storage system
topic storage system
deduplication
duplication elimination ratio
content defined chunking
url http://www.mdpi.com/2073-8994/8/7/69
work_keys_str_mv AT longxiangwang alogisticbasedmathematicalmodeltooptimizeduplicateeliminationratioincontentdefinedchunkingbasedbigdatastoragesystem
AT xiaoshedong alogisticbasedmathematicalmodeltooptimizeduplicateeliminationratioincontentdefinedchunkingbasedbigdatastoragesystem
AT xingjunzhang alogisticbasedmathematicalmodeltooptimizeduplicateeliminationratioincontentdefinedchunkingbasedbigdatastoragesystem
AT fuliangguo alogisticbasedmathematicalmodeltooptimizeduplicateeliminationratioincontentdefinedchunkingbasedbigdatastoragesystem
AT yinfengwang alogisticbasedmathematicalmodeltooptimizeduplicateeliminationratioincontentdefinedchunkingbasedbigdatastoragesystem
AT weifenggong alogisticbasedmathematicalmodeltooptimizeduplicateeliminationratioincontentdefinedchunkingbasedbigdatastoragesystem
AT longxiangwang logisticbasedmathematicalmodeltooptimizeduplicateeliminationratioincontentdefinedchunkingbasedbigdatastoragesystem
AT xiaoshedong logisticbasedmathematicalmodeltooptimizeduplicateeliminationratioincontentdefinedchunkingbasedbigdatastoragesystem
AT xingjunzhang logisticbasedmathematicalmodeltooptimizeduplicateeliminationratioincontentdefinedchunkingbasedbigdatastoragesystem
AT fuliangguo logisticbasedmathematicalmodeltooptimizeduplicateeliminationratioincontentdefinedchunkingbasedbigdatastoragesystem
AT yinfengwang logisticbasedmathematicalmodeltooptimizeduplicateeliminationratioincontentdefinedchunkingbasedbigdatastoragesystem
AT weifenggong logisticbasedmathematicalmodeltooptimizeduplicateeliminationratioincontentdefinedchunkingbasedbigdatastoragesystem