In-Memory Data Anonymization Using Scalable and High Performance RDD Design

Recent studies in data anonymization techniques have primarily focused on MapReduce. However, these existing MapReduce based approaches often suffer from many performance overheads due to their inappropriate use of data allocation, expensive disk I/O access and network transfer, and no support for i...

Full description

Bibliographic Details
Main Authors: Sibghat Ullah Bazai, Julian Jang-Jaccard
Format: Article
Language:English
Published: MDPI AG 2020-10-01
Series:Electronics
Subjects:
Online Access:https://www.mdpi.com/2079-9292/9/10/1732
_version_ 1797550408529346560
author Sibghat Ullah Bazai
Julian Jang-Jaccard
author_facet Sibghat Ullah Bazai
Julian Jang-Jaccard
author_sort Sibghat Ullah Bazai
collection DOAJ
description Recent studies in data anonymization techniques have primarily focused on MapReduce. However, these existing MapReduce based approaches often suffer from many performance overheads due to their inappropriate use of data allocation, expensive disk I/O access and network transfer, and no support for iterative tasks. We propose “SparkDA” which is a new novel anonymization technique that is designed to take the full advantage of Spark platform to generate privacy-preserving anonymized dataset in the most efficient way possible. Our proposal offers a better partition control, in-memory operation and cache management for iterative operations that are heavily utilised for data anonymization processing. Our proposal is based on Spark’s Resilient Distributed Dataset (RDD) with two critical operations of RDD, such as FlatMapRDD and ReduceByKeyRDD, respectively. The experimental results demonstrate that our proposal outperforms the existing approaches in terms of performance and scalability while maintaining high data privacy and utility levels. This illustrates that our proposal is capable to be used in a wider big data applications that demands privacy.
first_indexed 2024-03-10T15:28:51Z
format Article
id doaj.art-ac783ac775484d05a7591ca7f957d004
institution Directory Open Access Journal
issn 2079-9292
language English
last_indexed 2024-03-10T15:28:51Z
publishDate 2020-10-01
publisher MDPI AG
record_format Article
series Electronics
spelling doaj.art-ac783ac775484d05a7591ca7f957d0042023-11-20T17:48:15ZengMDPI AGElectronics2079-92922020-10-01910173210.3390/electronics9101732In-Memory Data Anonymization Using Scalable and High Performance RDD DesignSibghat Ullah Bazai0Julian Jang-Jaccard1Cybersecurity Lab, Computer Science/Information Technology, Massey University, Auckland 0632, New ZealandCybersecurity Lab, Computer Science/Information Technology, Massey University, Auckland 0632, New ZealandRecent studies in data anonymization techniques have primarily focused on MapReduce. However, these existing MapReduce based approaches often suffer from many performance overheads due to their inappropriate use of data allocation, expensive disk I/O access and network transfer, and no support for iterative tasks. We propose “SparkDA” which is a new novel anonymization technique that is designed to take the full advantage of Spark platform to generate privacy-preserving anonymized dataset in the most efficient way possible. Our proposal offers a better partition control, in-memory operation and cache management for iterative operations that are heavily utilised for data anonymization processing. Our proposal is based on Spark’s Resilient Distributed Dataset (RDD) with two critical operations of RDD, such as FlatMapRDD and ReduceByKeyRDD, respectively. The experimental results demonstrate that our proposal outperforms the existing approaches in terms of performance and scalability while maintaining high data privacy and utility levels. This illustrates that our proposal is capable to be used in a wider big data applications that demands privacy.https://www.mdpi.com/2079-9292/9/10/1732high performancedata anonymizationscalabilitysparkbig data miningprivacy and utility
spellingShingle Sibghat Ullah Bazai
Julian Jang-Jaccard
In-Memory Data Anonymization Using Scalable and High Performance RDD Design
Electronics
high performance
data anonymization
scalability
spark
big data mining
privacy and utility
title In-Memory Data Anonymization Using Scalable and High Performance RDD Design
title_full In-Memory Data Anonymization Using Scalable and High Performance RDD Design
title_fullStr In-Memory Data Anonymization Using Scalable and High Performance RDD Design
title_full_unstemmed In-Memory Data Anonymization Using Scalable and High Performance RDD Design
title_short In-Memory Data Anonymization Using Scalable and High Performance RDD Design
title_sort in memory data anonymization using scalable and high performance rdd design
topic high performance
data anonymization
scalability
spark
big data mining
privacy and utility
url https://www.mdpi.com/2079-9292/9/10/1732
work_keys_str_mv AT sibghatullahbazai inmemorydataanonymizationusingscalableandhighperformancerdddesign
AT julianjangjaccard inmemorydataanonymizationusingscalableandhighperformancerdddesign