In-Memory Data Anonymization Using Scalable and High Performance RDD Design

Recent studies in data anonymization techniques have primarily focused on MapReduce. However, these existing MapReduce based approaches often suffer from many performance overheads due to their inappropriate use of data allocation, expensive disk I/O access and network transfer, and no support for i...

Full description

Bibliographic Details
Main Authors:	Sibghat Ullah Bazai, Julian Jang-Jaccard
Format:	Article
Language:	English
Published:	MDPI AG 2020-10-01
Series:	Electronics
Subjects:	high performance data anonymization scalability spark big data mining privacy and utility
Online Access:	https://www.mdpi.com/2079-9292/9/10/1732

_version_	1797550408529346560
author	Sibghat Ullah Bazai Julian Jang-Jaccard
author_facet	Sibghat Ullah Bazai Julian Jang-Jaccard
author_sort	Sibghat Ullah Bazai
collection	DOAJ
description	Recent studies in data anonymization techniques have primarily focused on MapReduce. However, these existing MapReduce based approaches often suffer from many performance overheads due to their inappropriate use of data allocation, expensive disk I/O access and network transfer, and no support for iterative tasks. We propose “SparkDA” which is a new novel anonymization technique that is designed to take the full advantage of Spark platform to generate privacy-preserving anonymized dataset in the most efficient way possible. Our proposal offers a better partition control, in-memory operation and cache management for iterative operations that are heavily utilised for data anonymization processing. Our proposal is based on Spark’s Resilient Distributed Dataset (RDD) with two critical operations of RDD, such as FlatMapRDD and ReduceByKeyRDD, respectively. The experimental results demonstrate that our proposal outperforms the existing approaches in terms of performance and scalability while maintaining high data privacy and utility levels. This illustrates that our proposal is capable to be used in a wider big data applications that demands privacy.
first_indexed	2024-03-10T15:28:51Z
format	Article
id	doaj.art-ac783ac775484d05a7591ca7f957d004
institution	Directory Open Access Journal
issn	2079-9292
language	English
last_indexed	2024-03-10T15:28:51Z
publishDate	2020-10-01
publisher	MDPI AG
record_format	Article
series	Electronics
spelling	doaj.art-ac783ac775484d05a7591ca7f957d0042023-11-20T17:48:15ZengMDPI AGElectronics2079-92922020-10-01910173210.3390/electronics9101732In-Memory Data Anonymization Using Scalable and High Performance RDD DesignSibghat Ullah Bazai0Julian Jang-Jaccard1Cybersecurity Lab, Computer Science/Information Technology, Massey University, Auckland 0632, New ZealandCybersecurity Lab, Computer Science/Information Technology, Massey University, Auckland 0632, New ZealandRecent studies in data anonymization techniques have primarily focused on MapReduce. However, these existing MapReduce based approaches often suffer from many performance overheads due to their inappropriate use of data allocation, expensive disk I/O access and network transfer, and no support for iterative tasks. We propose “SparkDA” which is a new novel anonymization technique that is designed to take the full advantage of Spark platform to generate privacy-preserving anonymized dataset in the most efficient way possible. Our proposal offers a better partition control, in-memory operation and cache management for iterative operations that are heavily utilised for data anonymization processing. Our proposal is based on Spark’s Resilient Distributed Dataset (RDD) with two critical operations of RDD, such as FlatMapRDD and ReduceByKeyRDD, respectively. The experimental results demonstrate that our proposal outperforms the existing approaches in terms of performance and scalability while maintaining high data privacy and utility levels. This illustrates that our proposal is capable to be used in a wider big data applications that demands privacy.https://www.mdpi.com/2079-9292/9/10/1732high performancedata anonymizationscalabilitysparkbig data miningprivacy and utility
spellingShingle	Sibghat Ullah Bazai Julian Jang-Jaccard In-Memory Data Anonymization Using Scalable and High Performance RDD Design Electronics high performance data anonymization scalability spark big data mining privacy and utility
title	In-Memory Data Anonymization Using Scalable and High Performance RDD Design
title_full	In-Memory Data Anonymization Using Scalable and High Performance RDD Design
title_fullStr	In-Memory Data Anonymization Using Scalable and High Performance RDD Design
title_full_unstemmed	In-Memory Data Anonymization Using Scalable and High Performance RDD Design
title_short	In-Memory Data Anonymization Using Scalable and High Performance RDD Design
title_sort	in memory data anonymization using scalable and high performance rdd design
topic	high performance data anonymization scalability spark big data mining privacy and utility
url	https://www.mdpi.com/2079-9292/9/10/1732
work_keys_str_mv	AT sibghatullahbazai inmemorydataanonymizationusingscalableandhighperformancerdddesign AT julianjangjaccard inmemorydataanonymizationusingscalableandhighperformancerdddesign

In-Memory Data Anonymization Using Scalable and High Performance RDD Design

Similar Items