Scalable Secure Privacy-Preserving Record Linkage (PPRL) Methods Using Cloud-based Infrastructure
Introduction Bloom Filters (BFs) are a scalable solution for probabilistic privacy-preserving record linkage but BFs can be compromised. Yao’s garbled circuits (GCs) can perform secure multi-party computation to compute the similarity of two BFs without a trusted third party. The major drawback of u...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Swansea University
2018-08-01
|
Series: | International Journal of Population Data Science |
Online Access: | https://ijpds.org/article/view/638 |
_version_ | 1797430516007305216 |
---|---|
author | Toan Ong Ibrahim Lazrig Indrajit Ray Indrakshi Ray Michael Kahn |
author_facet | Toan Ong Ibrahim Lazrig Indrajit Ray Indrakshi Ray Michael Kahn |
author_sort | Toan Ong |
collection | DOAJ |
description | Introduction
Bloom Filters (BFs) are a scalable solution for probabilistic privacy-preserving record linkage but BFs can be compromised. Yao’s garbled circuits (GCs) can perform secure multi-party computation to compute the similarity of two BFs without a trusted third party. The major drawback of using BFs and GCs together is poor efficiency.
Objectives and Approach
We evaluated the feasibility of BFs+GCs using high capacity compute engines and implementing a novel parallel processing framework in Google Cloud Compute Engines (GCCE). In the Yao’s two-party secure computation protocol, one party serves as the generator and the other party serves as the evaluator. To link data in parallel, records from both parties are divided into chunks. Linkage between every two chunks in the same block is processed by a thread. The number of threads for linkage depends on available computing resources. We tested the parallelized process in various scenarios with variations in hardware and software configurations.
Results
Two synthetic datasets with 10K records were linked using BFs+GCs on 12 different software and hardware configurations which varied by: number of CPU cores (4 to 32), memory size (15GB – 28.8GB), number of threads (6-41), and chunk size (50-200 records). The minimum configuration (4 cores; 15GB memory) took 8,062.4s to complete whereas the maximum configuration (32 cores; 28.8GB memory) took 1,454.1s. Increasing the number of threads or changing the chunk size without providing more CPU cores and memory did not improve the efficiency. Efficiency is improved on average by 39.81% when the number of cores and memory on the both sides are doubled. The CPU utilization is maximized (near 100% on both sides) when the computing power of the generator is double the evaluator.
Conclusion/Implications
The PPRL runtime of BFs+GCs was greatly improved using parallel processing in a cloud-based infrastructure. A cluster of GCCEs could be leveraged to reduce the runtime of data linkage operations even further. Scalable cloud-based infrastructures can overcome the trade-off between security and efficiency, allowing computationally complex methods to be implemented. |
first_indexed | 2024-03-09T09:28:38Z |
format | Article |
id | doaj.art-20c57fa7820f46d69fbcb582d40bfcbe |
institution | Directory Open Access Journal |
issn | 2399-4908 |
language | English |
last_indexed | 2024-03-09T09:28:38Z |
publishDate | 2018-08-01 |
publisher | Swansea University |
record_format | Article |
series | International Journal of Population Data Science |
spelling | doaj.art-20c57fa7820f46d69fbcb582d40bfcbe2023-12-02T04:53:18ZengSwansea UniversityInternational Journal of Population Data Science2399-49082018-08-013410.23889/ijpds.v3i4.638Scalable Secure Privacy-Preserving Record Linkage (PPRL) Methods Using Cloud-based InfrastructureToan Ong0Ibrahim Lazrig1Indrajit Ray2Indrakshi Ray3Michael Kahn4University of Colorado Anschutz Medical CampusColorado State UniversityColorado State UniversityColorado State UniversityUniversity of Colorado Anschutz Medical CampusIntroduction Bloom Filters (BFs) are a scalable solution for probabilistic privacy-preserving record linkage but BFs can be compromised. Yao’s garbled circuits (GCs) can perform secure multi-party computation to compute the similarity of two BFs without a trusted third party. The major drawback of using BFs and GCs together is poor efficiency. Objectives and Approach We evaluated the feasibility of BFs+GCs using high capacity compute engines and implementing a novel parallel processing framework in Google Cloud Compute Engines (GCCE). In the Yao’s two-party secure computation protocol, one party serves as the generator and the other party serves as the evaluator. To link data in parallel, records from both parties are divided into chunks. Linkage between every two chunks in the same block is processed by a thread. The number of threads for linkage depends on available computing resources. We tested the parallelized process in various scenarios with variations in hardware and software configurations. Results Two synthetic datasets with 10K records were linked using BFs+GCs on 12 different software and hardware configurations which varied by: number of CPU cores (4 to 32), memory size (15GB – 28.8GB), number of threads (6-41), and chunk size (50-200 records). The minimum configuration (4 cores; 15GB memory) took 8,062.4s to complete whereas the maximum configuration (32 cores; 28.8GB memory) took 1,454.1s. Increasing the number of threads or changing the chunk size without providing more CPU cores and memory did not improve the efficiency. Efficiency is improved on average by 39.81% when the number of cores and memory on the both sides are doubled. The CPU utilization is maximized (near 100% on both sides) when the computing power of the generator is double the evaluator. Conclusion/Implications The PPRL runtime of BFs+GCs was greatly improved using parallel processing in a cloud-based infrastructure. A cluster of GCCEs could be leveraged to reduce the runtime of data linkage operations even further. Scalable cloud-based infrastructures can overcome the trade-off between security and efficiency, allowing computationally complex methods to be implemented.https://ijpds.org/article/view/638 |
spellingShingle | Toan Ong Ibrahim Lazrig Indrajit Ray Indrakshi Ray Michael Kahn Scalable Secure Privacy-Preserving Record Linkage (PPRL) Methods Using Cloud-based Infrastructure International Journal of Population Data Science |
title | Scalable Secure Privacy-Preserving Record Linkage (PPRL) Methods Using Cloud-based Infrastructure |
title_full | Scalable Secure Privacy-Preserving Record Linkage (PPRL) Methods Using Cloud-based Infrastructure |
title_fullStr | Scalable Secure Privacy-Preserving Record Linkage (PPRL) Methods Using Cloud-based Infrastructure |
title_full_unstemmed | Scalable Secure Privacy-Preserving Record Linkage (PPRL) Methods Using Cloud-based Infrastructure |
title_short | Scalable Secure Privacy-Preserving Record Linkage (PPRL) Methods Using Cloud-based Infrastructure |
title_sort | scalable secure privacy preserving record linkage pprl methods using cloud based infrastructure |
url | https://ijpds.org/article/view/638 |
work_keys_str_mv | AT toanong scalablesecureprivacypreservingrecordlinkagepprlmethodsusingcloudbasedinfrastructure AT ibrahimlazrig scalablesecureprivacypreservingrecordlinkagepprlmethodsusingcloudbasedinfrastructure AT indrajitray scalablesecureprivacypreservingrecordlinkagepprlmethodsusingcloudbasedinfrastructure AT indrakshiray scalablesecureprivacypreservingrecordlinkagepprlmethodsusingcloudbasedinfrastructure AT michaelkahn scalablesecureprivacypreservingrecordlinkagepprlmethodsusingcloudbasedinfrastructure |