Optimization of the Mainzelliste software for fast privacy-preserving record linkage

Abstract Background Data analysis for biomedical research often requires a record linkage step to identify records from multiple data sources referring to the same person. Due to the lack of unique personal identifiers across these sources, record linkage relies on the similarity of personal data su...

Full description

Bibliographic Details
Main Authors: Florens Rohde, Martin Franke, Ziad Sehili, Martin Lablans, Erhard Rahm
Format: Article
Language:English
Published: BMC 2021-01-01
Series:Journal of Translational Medicine
Subjects:
Online Access:https://doi.org/10.1186/s12967-020-02678-1
_version_ 1818429866589028352
author Florens Rohde
Martin Franke
Ziad Sehili
Martin Lablans
Erhard Rahm
author_facet Florens Rohde
Martin Franke
Ziad Sehili
Martin Lablans
Erhard Rahm
author_sort Florens Rohde
collection DOAJ
description Abstract Background Data analysis for biomedical research often requires a record linkage step to identify records from multiple data sources referring to the same person. Due to the lack of unique personal identifiers across these sources, record linkage relies on the similarity of personal data such as first and last names or birth dates. However, the exchange of such identifying data with a third party, as is the case in record linkage, is generally subject to strict privacy requirements. This problem is addressed by privacy-preserving record linkage (PPRL) and pseudonymization services. Mainzelliste is an open-source record linkage and pseudonymization service used to carry out PPRL processes in real-world use cases. Methods We evaluate the linkage quality and performance of the linkage process using several real and near-real datasets with different properties w.r.t. size and error-rate of matching records. We conduct a comparison between (plaintext) record linkage and PPRL based on encoded records (Bloom filters). Furthermore, since the Mainzelliste software offers no blocking mechanism, we extend it by phonetic blocking as well as novel blocking schemes based on locality-sensitive hashing (LSH) to improve runtime for both standard and privacy-preserving record linkage. Results The Mainzelliste achieves high linkage quality for PPRL using field-level Bloom filters due to the use of an error-tolerant matching algorithm that can handle variances in names, in particular missing or transposed name compounds. However, due to the absence of blocking, the runtimes are unacceptable for real use cases with larger datasets. The newly implemented blocking approaches improve runtimes by orders of magnitude while retaining high linkage quality. Conclusion We conduct the first comprehensive evaluation of the record linkage facilities of the Mainzelliste software and extend it with blocking methods to improve its runtime. We observed a very high linkage quality for both plaintext as well as encoded data even in the presence of errors. The provided blocking methods provide order of magnitude improvements regarding runtime performance thus facilitating the use in research projects with large datasets and many participants.
first_indexed 2024-12-14T15:24:19Z
format Article
id doaj.art-73a13189e13045bcac0c23ae4f93de1c
institution Directory Open Access Journal
issn 1479-5876
language English
last_indexed 2024-12-14T15:24:19Z
publishDate 2021-01-01
publisher BMC
record_format Article
series Journal of Translational Medicine
spelling doaj.art-73a13189e13045bcac0c23ae4f93de1c2022-12-21T22:56:04ZengBMCJournal of Translational Medicine1479-58762021-01-0119111210.1186/s12967-020-02678-1Optimization of the Mainzelliste software for fast privacy-preserving record linkageFlorens Rohde0Martin Franke1Ziad Sehili2Martin Lablans3Erhard Rahm4Database Group, University of LeipzigDatabase Group, University of LeipzigDatabase Group, University of LeipzigFederated Information Systems, German Cancer Research CenterDatabase Group, University of LeipzigAbstract Background Data analysis for biomedical research often requires a record linkage step to identify records from multiple data sources referring to the same person. Due to the lack of unique personal identifiers across these sources, record linkage relies on the similarity of personal data such as first and last names or birth dates. However, the exchange of such identifying data with a third party, as is the case in record linkage, is generally subject to strict privacy requirements. This problem is addressed by privacy-preserving record linkage (PPRL) and pseudonymization services. Mainzelliste is an open-source record linkage and pseudonymization service used to carry out PPRL processes in real-world use cases. Methods We evaluate the linkage quality and performance of the linkage process using several real and near-real datasets with different properties w.r.t. size and error-rate of matching records. We conduct a comparison between (plaintext) record linkage and PPRL based on encoded records (Bloom filters). Furthermore, since the Mainzelliste software offers no blocking mechanism, we extend it by phonetic blocking as well as novel blocking schemes based on locality-sensitive hashing (LSH) to improve runtime for both standard and privacy-preserving record linkage. Results The Mainzelliste achieves high linkage quality for PPRL using field-level Bloom filters due to the use of an error-tolerant matching algorithm that can handle variances in names, in particular missing or transposed name compounds. However, due to the absence of blocking, the runtimes are unacceptable for real use cases with larger datasets. The newly implemented blocking approaches improve runtimes by orders of magnitude while retaining high linkage quality. Conclusion We conduct the first comprehensive evaluation of the record linkage facilities of the Mainzelliste software and extend it with blocking methods to improve its runtime. We observed a very high linkage quality for both plaintext as well as encoded data even in the presence of errors. The provided blocking methods provide order of magnitude improvements regarding runtime performance thus facilitating the use in research projects with large datasets and many participants.https://doi.org/10.1186/s12967-020-02678-1MainzellistePrivacy-preserving record linkageBlockingLocality-sensitive hashing
spellingShingle Florens Rohde
Martin Franke
Ziad Sehili
Martin Lablans
Erhard Rahm
Optimization of the Mainzelliste software for fast privacy-preserving record linkage
Journal of Translational Medicine
Mainzelliste
Privacy-preserving record linkage
Blocking
Locality-sensitive hashing
title Optimization of the Mainzelliste software for fast privacy-preserving record linkage
title_full Optimization of the Mainzelliste software for fast privacy-preserving record linkage
title_fullStr Optimization of the Mainzelliste software for fast privacy-preserving record linkage
title_full_unstemmed Optimization of the Mainzelliste software for fast privacy-preserving record linkage
title_short Optimization of the Mainzelliste software for fast privacy-preserving record linkage
title_sort optimization of the mainzelliste software for fast privacy preserving record linkage
topic Mainzelliste
Privacy-preserving record linkage
Blocking
Locality-sensitive hashing
url https://doi.org/10.1186/s12967-020-02678-1
work_keys_str_mv AT florensrohde optimizationofthemainzellistesoftwareforfastprivacypreservingrecordlinkage
AT martinfranke optimizationofthemainzellistesoftwareforfastprivacypreservingrecordlinkage
AT ziadsehili optimizationofthemainzellistesoftwareforfastprivacypreservingrecordlinkage
AT martinlablans optimizationofthemainzellistesoftwareforfastprivacypreservingrecordlinkage
AT erhardrahm optimizationofthemainzellistesoftwareforfastprivacypreservingrecordlinkage