Scalable Secure Privacy-Preserving Record Linkage (PPRL) Methods Using Cloud-based Infrastructure

Introduction Bloom Filters (BFs) are a scalable solution for probabilistic privacy-preserving record linkage but BFs can be compromised. Yao’s garbled circuits (GCs) can perform secure multi-party computation to compute the similarity of two BFs without a trusted third party. The major drawback of u...

Full description

Bibliographic Details
Main Authors: Toan Ong, Ibrahim Lazrig, Indrajit Ray, Indrakshi Ray, Michael Kahn
Format: Article
Language:English
Published: Swansea University 2018-08-01
Series:International Journal of Population Data Science
Online Access:https://ijpds.org/article/view/638
_version_ 1797430516007305216
author Toan Ong
Ibrahim Lazrig
Indrajit Ray
Indrakshi Ray
Michael Kahn
author_facet Toan Ong
Ibrahim Lazrig
Indrajit Ray
Indrakshi Ray
Michael Kahn
author_sort Toan Ong
collection DOAJ
description Introduction Bloom Filters (BFs) are a scalable solution for probabilistic privacy-preserving record linkage but BFs can be compromised. Yao’s garbled circuits (GCs) can perform secure multi-party computation to compute the similarity of two BFs without a trusted third party. The major drawback of using BFs and GCs together is poor efficiency. Objectives and Approach We evaluated the feasibility of BFs+GCs using high capacity compute engines and implementing a novel parallel processing framework in Google Cloud Compute Engines (GCCE). In the Yao’s two-party secure computation protocol, one party serves as the generator and the other party serves as the evaluator. To link data in parallel, records from both parties are divided into chunks. Linkage between every two chunks in the same block is processed by a thread. The number of threads for linkage depends on available computing resources. We tested the parallelized process in various scenarios with variations in hardware and software configurations. Results Two synthetic datasets with 10K records were linked using BFs+GCs on 12 different software and hardware configurations which varied by: number of CPU cores (4 to 32), memory size (15GB – 28.8GB), number of threads (6-41), and chunk size (50-200 records). The minimum configuration (4 cores; 15GB memory) took 8,062.4s to complete whereas the maximum configuration (32 cores; 28.8GB memory) took 1,454.1s. Increasing the number of threads or changing the chunk size without providing more CPU cores and memory did not improve the efficiency. Efficiency is improved on average by 39.81% when the number of cores and memory on the both sides are doubled. The CPU utilization is maximized (near 100% on both sides) when the computing power of the generator is double the evaluator. Conclusion/Implications The PPRL runtime of BFs+GCs was greatly improved using parallel processing in a cloud-based infrastructure. A cluster of GCCEs could be leveraged to reduce the runtime of data linkage operations even further. Scalable cloud-based infrastructures can overcome the trade-off between security and efficiency, allowing computationally complex methods to be implemented.
first_indexed 2024-03-09T09:28:38Z
format Article
id doaj.art-20c57fa7820f46d69fbcb582d40bfcbe
institution Directory Open Access Journal
issn 2399-4908
language English
last_indexed 2024-03-09T09:28:38Z
publishDate 2018-08-01
publisher Swansea University
record_format Article
series International Journal of Population Data Science
spelling doaj.art-20c57fa7820f46d69fbcb582d40bfcbe2023-12-02T04:53:18ZengSwansea UniversityInternational Journal of Population Data Science2399-49082018-08-013410.23889/ijpds.v3i4.638Scalable Secure Privacy-Preserving Record Linkage (PPRL) Methods Using Cloud-based InfrastructureToan Ong0Ibrahim Lazrig1Indrajit Ray2Indrakshi Ray3Michael Kahn4University of Colorado Anschutz Medical CampusColorado State UniversityColorado State UniversityColorado State UniversityUniversity of Colorado Anschutz Medical CampusIntroduction Bloom Filters (BFs) are a scalable solution for probabilistic privacy-preserving record linkage but BFs can be compromised. Yao’s garbled circuits (GCs) can perform secure multi-party computation to compute the similarity of two BFs without a trusted third party. The major drawback of using BFs and GCs together is poor efficiency. Objectives and Approach We evaluated the feasibility of BFs+GCs using high capacity compute engines and implementing a novel parallel processing framework in Google Cloud Compute Engines (GCCE). In the Yao’s two-party secure computation protocol, one party serves as the generator and the other party serves as the evaluator. To link data in parallel, records from both parties are divided into chunks. Linkage between every two chunks in the same block is processed by a thread. The number of threads for linkage depends on available computing resources. We tested the parallelized process in various scenarios with variations in hardware and software configurations. Results Two synthetic datasets with 10K records were linked using BFs+GCs on 12 different software and hardware configurations which varied by: number of CPU cores (4 to 32), memory size (15GB – 28.8GB), number of threads (6-41), and chunk size (50-200 records). The minimum configuration (4 cores; 15GB memory) took 8,062.4s to complete whereas the maximum configuration (32 cores; 28.8GB memory) took 1,454.1s. Increasing the number of threads or changing the chunk size without providing more CPU cores and memory did not improve the efficiency. Efficiency is improved on average by 39.81% when the number of cores and memory on the both sides are doubled. The CPU utilization is maximized (near 100% on both sides) when the computing power of the generator is double the evaluator. Conclusion/Implications The PPRL runtime of BFs+GCs was greatly improved using parallel processing in a cloud-based infrastructure. A cluster of GCCEs could be leveraged to reduce the runtime of data linkage operations even further. Scalable cloud-based infrastructures can overcome the trade-off between security and efficiency, allowing computationally complex methods to be implemented.https://ijpds.org/article/view/638
spellingShingle Toan Ong
Ibrahim Lazrig
Indrajit Ray
Indrakshi Ray
Michael Kahn
Scalable Secure Privacy-Preserving Record Linkage (PPRL) Methods Using Cloud-based Infrastructure
International Journal of Population Data Science
title Scalable Secure Privacy-Preserving Record Linkage (PPRL) Methods Using Cloud-based Infrastructure
title_full Scalable Secure Privacy-Preserving Record Linkage (PPRL) Methods Using Cloud-based Infrastructure
title_fullStr Scalable Secure Privacy-Preserving Record Linkage (PPRL) Methods Using Cloud-based Infrastructure
title_full_unstemmed Scalable Secure Privacy-Preserving Record Linkage (PPRL) Methods Using Cloud-based Infrastructure
title_short Scalable Secure Privacy-Preserving Record Linkage (PPRL) Methods Using Cloud-based Infrastructure
title_sort scalable secure privacy preserving record linkage pprl methods using cloud based infrastructure
url https://ijpds.org/article/view/638
work_keys_str_mv AT toanong scalablesecureprivacypreservingrecordlinkagepprlmethodsusingcloudbasedinfrastructure
AT ibrahimlazrig scalablesecureprivacypreservingrecordlinkagepprlmethodsusingcloudbasedinfrastructure
AT indrajitray scalablesecureprivacypreservingrecordlinkagepprlmethodsusingcloudbasedinfrastructure
AT indrakshiray scalablesecureprivacypreservingrecordlinkagepprlmethodsusingcloudbasedinfrastructure
AT michaelkahn scalablesecureprivacypreservingrecordlinkagepprlmethodsusingcloudbasedinfrastructure