Evaluating Cluster-Based Synthetic Data Generation for Blood-Transfusion Analysis

Synthetic data generation is becoming an increasingly popular approach to making privacy-sensitive data available for analysis. Recently, cluster-based synthetic data generation (CBSDG) has been proposed, which uses explainable and tractable techniques for privacy preservation. Although the algorith...

Full description

Bibliographic Details
Main Authors: Shannon K. S. Kroes, Matthijs van Leeuwen, Rolf H. H. Groenwold, Mart P. Janssen
Format: Article
Language:English
Published: MDPI AG 2023-12-01
Series:Journal of Cybersecurity and Privacy
Subjects:
Online Access:https://www.mdpi.com/2624-800X/3/4/40
_version_ 1827574480090693632
author Shannon K. S. Kroes
Matthijs van Leeuwen
Rolf H. H. Groenwold
Mart P. Janssen
author_facet Shannon K. S. Kroes
Matthijs van Leeuwen
Rolf H. H. Groenwold
Mart P. Janssen
author_sort Shannon K. S. Kroes
collection DOAJ
description Synthetic data generation is becoming an increasingly popular approach to making privacy-sensitive data available for analysis. Recently, cluster-based synthetic data generation (CBSDG) has been proposed, which uses explainable and tractable techniques for privacy preservation. Although the algorithm demonstrated promising performance on simulated data, CBSDG has not yet been applied to real, personal data. In this work, a published blood-transfusion analysis is replicated with synthetic data to assess whether CBSDG can reproduce more complex and intricate variable relations than previously evaluated. Data from the Dutch national blood bank, consisting of 250,729 donation records, were used to predict donor hemoglobin (Hb) levels by means of support vector machines (SVMs). Precision scores were equal to the original data results for both male (0.997) and female (0.987) donors, recall was 0.007 higher for male and 0.003 lower for female donors (original estimates 0.739 and 0.637, respectively). The impact of the variables on Hb predictions was similar, as quantified and visualized with Shapley additive explanation values. Opportunities for attribute disclosure were decreased for all but two variables; only the binary variables Deferral Status and Sex could still be inferred. Such inference was also possible for donors who were not used as input for the generator and may result from correlations in the data as opposed to overfitting in the synthetic-data-generation process. The high predictive performance obtained with the synthetic data shows potential of CBSDG for practical implementation.
first_indexed 2024-03-08T20:38:46Z
format Article
id doaj.art-51eae3367d2a42d186b5d4b278168e18
institution Directory Open Access Journal
issn 2624-800X
language English
last_indexed 2024-03-08T20:38:46Z
publishDate 2023-12-01
publisher MDPI AG
record_format Article
series Journal of Cybersecurity and Privacy
spelling doaj.art-51eae3367d2a42d186b5d4b278168e182023-12-22T14:17:46ZengMDPI AGJournal of Cybersecurity and Privacy2624-800X2023-12-013488289410.3390/jcp3040040Evaluating Cluster-Based Synthetic Data Generation for Blood-Transfusion AnalysisShannon K. S. Kroes0Matthijs van Leeuwen1Rolf H. H. Groenwold2Mart P. Janssen3Netherlands Organisation for Applied Scientific Research (TNO), Anna van Buerenplein 1, 2595 DA The Hague, The NetherlandsLeiden Institute of Advanced Computer Science, Leiden University, 2333 CA Leiden, The NetherlandsDepartment of Clinical Epidemiology, Leiden University Medical Center, 2333 ZA Leiden, The NetherlandsTransfusion Technology Assessment Group, Donor Medicine Research Department, Sanquin Research, 1066 CX Amsterdam, The NetherlandsSynthetic data generation is becoming an increasingly popular approach to making privacy-sensitive data available for analysis. Recently, cluster-based synthetic data generation (CBSDG) has been proposed, which uses explainable and tractable techniques for privacy preservation. Although the algorithm demonstrated promising performance on simulated data, CBSDG has not yet been applied to real, personal data. In this work, a published blood-transfusion analysis is replicated with synthetic data to assess whether CBSDG can reproduce more complex and intricate variable relations than previously evaluated. Data from the Dutch national blood bank, consisting of 250,729 donation records, were used to predict donor hemoglobin (Hb) levels by means of support vector machines (SVMs). Precision scores were equal to the original data results for both male (0.997) and female (0.987) donors, recall was 0.007 higher for male and 0.003 lower for female donors (original estimates 0.739 and 0.637, respectively). The impact of the variables on Hb predictions was similar, as quantified and visualized with Shapley additive explanation values. Opportunities for attribute disclosure were decreased for all but two variables; only the binary variables Deferral Status and Sex could still be inferred. Such inference was also possible for donors who were not used as input for the generator and may result from correlations in the data as opposed to overfitting in the synthetic-data-generation process. The high predictive performance obtained with the synthetic data shows potential of CBSDG for practical implementation.https://www.mdpi.com/2624-800X/3/4/40synthetic data generationprivacyblood transfusiondonor Hb deferral prediction
spellingShingle Shannon K. S. Kroes
Matthijs van Leeuwen
Rolf H. H. Groenwold
Mart P. Janssen
Evaluating Cluster-Based Synthetic Data Generation for Blood-Transfusion Analysis
Journal of Cybersecurity and Privacy
synthetic data generation
privacy
blood transfusion
donor Hb deferral prediction
title Evaluating Cluster-Based Synthetic Data Generation for Blood-Transfusion Analysis
title_full Evaluating Cluster-Based Synthetic Data Generation for Blood-Transfusion Analysis
title_fullStr Evaluating Cluster-Based Synthetic Data Generation for Blood-Transfusion Analysis
title_full_unstemmed Evaluating Cluster-Based Synthetic Data Generation for Blood-Transfusion Analysis
title_short Evaluating Cluster-Based Synthetic Data Generation for Blood-Transfusion Analysis
title_sort evaluating cluster based synthetic data generation for blood transfusion analysis
topic synthetic data generation
privacy
blood transfusion
donor Hb deferral prediction
url https://www.mdpi.com/2624-800X/3/4/40
work_keys_str_mv AT shannonkskroes evaluatingclusterbasedsyntheticdatagenerationforbloodtransfusionanalysis
AT matthijsvanleeuwen evaluatingclusterbasedsyntheticdatagenerationforbloodtransfusionanalysis
AT rolfhhgroenwold evaluatingclusterbasedsyntheticdatagenerationforbloodtransfusionanalysis
AT martpjanssen evaluatingclusterbasedsyntheticdatagenerationforbloodtransfusionanalysis