Data Linkage of Hashed Data: Derive and Conquer

Introduction Data Linkage of hashed datasets is much more difficult than linking in-the-clear data. Hashing prevents the use of matching tools that overcome messy data such as ‘contained-within’ functions and edit distance metrics. Hashing sensitive data received from third parties is becoming more...

Full description

Bibliographic Details
Main Authors: Josie Plachta, Charlie Tomlin, Rachel Shipsey
Format: Article
Language:English
Published: Swansea University 2020-12-01
Series:International Journal of Population Data Science
Online Access:https://ijpds.org/article/view/1447
_version_ 1797430713739378688
author Josie Plachta
Charlie Tomlin
Rachel Shipsey
author_facet Josie Plachta
Charlie Tomlin
Rachel Shipsey
author_sort Josie Plachta
collection DOAJ
description Introduction Data Linkage of hashed datasets is much more difficult than linking in-the-clear data. Hashing prevents the use of matching tools that overcome messy data such as ‘contained-within’ functions and edit distance metrics. Hashing sensitive data received from third parties is becoming more common due to increased Data Security concerns. Institutions need to be ready to link hashed data with high accuracy, otherwise the quality of outputs from these linked datasets will suffer. Objectives and Approach We designed an innovative matching method, Derive and Conquer (D&C). We derived variables containing substrings or patterns of the full variable (e.g. Soundex or first 4 characters of a string) to match on instead. However, using lots of combinations of these derived variables would require thousands of traditional match keys to be programmed, run, and reviewed. Instead, D&C runs matchkeys on a derived agreement variable which amalgamates information stored in multiple derived variables into one value, reducing the number of matchkeys to a manageable amount. D&C runs on distributing computing systems using PySpark to link datasets containing millions of records in a timely manner. Results D&C was developed using in-the-clear UK Census and health records with results comparable to the in-the-clear gold standard. It is currently being tested on hashed data to link UK tax and benefits data to UK health records. 66.4 million records were declared matched - a realistic match rate for the UK population. Research into the linkage quality is ongoing to produce estimates on the amount of bias in the linkage and the precision and recall. We will be excited to present these results at the Conference in October. These results will be used to improve D&C. Conclusion / Implications Using these derived variables, we have been able to overcome the challenge of matching massive hashed datasets with a realistic match rate and in a realistic time frame.
first_indexed 2024-03-09T09:32:30Z
format Article
id doaj.art-038c818bf20f416c8d556ed6644a0239
institution Directory Open Access Journal
issn 2399-4908
language English
last_indexed 2024-03-09T09:32:30Z
publishDate 2020-12-01
publisher Swansea University
record_format Article
series International Journal of Population Data Science
spelling doaj.art-038c818bf20f416c8d556ed6644a02392023-12-02T03:15:14ZengSwansea UniversityInternational Journal of Population Data Science2399-49082020-12-015510.23889/ijpds.v5i5.1447Data Linkage of Hashed Data: Derive and ConquerJosie Plachta0Charlie Tomlin1Rachel Shipsey2Office for National Statistics UKOffice for National Statistics UKOffice for National Statistics UKIntroduction Data Linkage of hashed datasets is much more difficult than linking in-the-clear data. Hashing prevents the use of matching tools that overcome messy data such as ‘contained-within’ functions and edit distance metrics. Hashing sensitive data received from third parties is becoming more common due to increased Data Security concerns. Institutions need to be ready to link hashed data with high accuracy, otherwise the quality of outputs from these linked datasets will suffer. Objectives and Approach We designed an innovative matching method, Derive and Conquer (D&C). We derived variables containing substrings or patterns of the full variable (e.g. Soundex or first 4 characters of a string) to match on instead. However, using lots of combinations of these derived variables would require thousands of traditional match keys to be programmed, run, and reviewed. Instead, D&C runs matchkeys on a derived agreement variable which amalgamates information stored in multiple derived variables into one value, reducing the number of matchkeys to a manageable amount. D&C runs on distributing computing systems using PySpark to link datasets containing millions of records in a timely manner. Results D&C was developed using in-the-clear UK Census and health records with results comparable to the in-the-clear gold standard. It is currently being tested on hashed data to link UK tax and benefits data to UK health records. 66.4 million records were declared matched - a realistic match rate for the UK population. Research into the linkage quality is ongoing to produce estimates on the amount of bias in the linkage and the precision and recall. We will be excited to present these results at the Conference in October. These results will be used to improve D&C. Conclusion / Implications Using these derived variables, we have been able to overcome the challenge of matching massive hashed datasets with a realistic match rate and in a realistic time frame.https://ijpds.org/article/view/1447
spellingShingle Josie Plachta
Charlie Tomlin
Rachel Shipsey
Data Linkage of Hashed Data: Derive and Conquer
International Journal of Population Data Science
title Data Linkage of Hashed Data: Derive and Conquer
title_full Data Linkage of Hashed Data: Derive and Conquer
title_fullStr Data Linkage of Hashed Data: Derive and Conquer
title_full_unstemmed Data Linkage of Hashed Data: Derive and Conquer
title_short Data Linkage of Hashed Data: Derive and Conquer
title_sort data linkage of hashed data derive and conquer
url https://ijpds.org/article/view/1447
work_keys_str_mv AT josieplachta datalinkageofhasheddataderiveandconquer
AT charlietomlin datalinkageofhasheddataderiveandconquer
AT rachelshipsey datalinkageofhasheddataderiveandconquer