Exploring Goldstein et al.’s Scalelink method of data linkage.

Objectives Scalelink is an innovative probabilistic data linkage method based on correspondence analysis. Unlike the popular and widely-used Fellegi-Sunter algorithm, it does not assume linkage variable independence. It also claims to be more intuitive and computationally efficient. We aim to test t...

Full description

Bibliographic Details
Main Authors: Mary Ann Megan Cleaton, Josie Plachta, Rachel Shipsey
Format: Article
Language:English
Published: Swansea University 2022-08-01
Series:International Journal of Population Data Science
Subjects:
Online Access:https://ijpds.org/article/view/2042
_version_ 1797427142892453888
author Mary Ann Megan Cleaton
Josie Plachta
Rachel Shipsey
author_facet Mary Ann Megan Cleaton
Josie Plachta
Rachel Shipsey
author_sort Mary Ann Megan Cleaton
collection DOAJ
description Objectives Scalelink is an innovative probabilistic data linkage method based on correspondence analysis. Unlike the popular and widely-used Fellegi-Sunter algorithm, it does not assume linkage variable independence. It also claims to be more intuitive and computationally efficient. We aim to test this method for the first time on real-world big data. Approach Scalelink uses agreement states for each linkage variable and candidate pair. These are compared to determine how frequently, for all candidate pairs, any given agreement state is held at the same time as any other agreement state (this accounts for variable dependence). The results of this comparison are inputted into a loss function and the minimisation of this function is derived within constraints to produce weights. Currently, the method is accessible via Goldstein et al.’s paper and R package. We are translating it into PySpark to enable testing on datasets that are too large to link without using distributed computing. Results Initial testing of Goldstein et al.’s Scalelink method on small samples of real-world datasets shows that it performs as expected for a probabilistic linkage method, although cannot currently deal with missingness. To test the quality of the method on real-world big data, a high-quality linked dataset of the 2021 England and Wales Census and follow-up Census Coverage Survey will be used as a Gold Standard (GS). After developing a method that enables Scalelink to deal with missingness, we will apply Scalelink and automatic Fellegi-Sunter probabilistic linkage to this GS. We can thus establish and compare the precision and recall of both methods. We will also investigate linkage bias for particular demographics, test computational efficiency and estimate the clerical review burden for each method. Conclusion Goldstein et al.’s Scalelink algorithm shows promise as a high quality, scalable, dependence-free linkage algorithm for use in any matching project. Here, for the first time, we research the method’s quality and feasibility with real-world big data. From this we will produce recommendations regarding its utility.
first_indexed 2024-03-09T08:40:39Z
format Article
id doaj.art-34090c76c6304b679521dae0b8c0b363
institution Directory Open Access Journal
issn 2399-4908
language English
last_indexed 2024-03-09T08:40:39Z
publishDate 2022-08-01
publisher Swansea University
record_format Article
series International Journal of Population Data Science
spelling doaj.art-34090c76c6304b679521dae0b8c0b3632023-12-02T17:02:09ZengSwansea UniversityInternational Journal of Population Data Science2399-49082022-08-017310.23889/ijpds.v7i3.2042Exploring Goldstein et al.’s Scalelink method of data linkage.Mary Ann Megan Cleaton0Josie Plachta1Rachel Shipsey2Office for National StatisticsOffice for National StatisticsOffice for National StatisticsObjectives Scalelink is an innovative probabilistic data linkage method based on correspondence analysis. Unlike the popular and widely-used Fellegi-Sunter algorithm, it does not assume linkage variable independence. It also claims to be more intuitive and computationally efficient. We aim to test this method for the first time on real-world big data. Approach Scalelink uses agreement states for each linkage variable and candidate pair. These are compared to determine how frequently, for all candidate pairs, any given agreement state is held at the same time as any other agreement state (this accounts for variable dependence). The results of this comparison are inputted into a loss function and the minimisation of this function is derived within constraints to produce weights. Currently, the method is accessible via Goldstein et al.’s paper and R package. We are translating it into PySpark to enable testing on datasets that are too large to link without using distributed computing. Results Initial testing of Goldstein et al.’s Scalelink method on small samples of real-world datasets shows that it performs as expected for a probabilistic linkage method, although cannot currently deal with missingness. To test the quality of the method on real-world big data, a high-quality linked dataset of the 2021 England and Wales Census and follow-up Census Coverage Survey will be used as a Gold Standard (GS). After developing a method that enables Scalelink to deal with missingness, we will apply Scalelink and automatic Fellegi-Sunter probabilistic linkage to this GS. We can thus establish and compare the precision and recall of both methods. We will also investigate linkage bias for particular demographics, test computational efficiency and estimate the clerical review burden for each method. Conclusion Goldstein et al.’s Scalelink algorithm shows promise as a high quality, scalable, dependence-free linkage algorithm for use in any matching project. Here, for the first time, we research the method’s quality and feasibility with real-world big data. From this we will produce recommendations regarding its utility. https://ijpds.org/article/view/2042data linkageprobabilistic data linkageScalelinkcorrespondence analysisbig data
spellingShingle Mary Ann Megan Cleaton
Josie Plachta
Rachel Shipsey
Exploring Goldstein et al.’s Scalelink method of data linkage.
International Journal of Population Data Science
data linkage
probabilistic data linkage
Scalelink
correspondence analysis
big data
title Exploring Goldstein et al.’s Scalelink method of data linkage.
title_full Exploring Goldstein et al.’s Scalelink method of data linkage.
title_fullStr Exploring Goldstein et al.’s Scalelink method of data linkage.
title_full_unstemmed Exploring Goldstein et al.’s Scalelink method of data linkage.
title_short Exploring Goldstein et al.’s Scalelink method of data linkage.
title_sort exploring goldstein et al s scalelink method of data linkage
topic data linkage
probabilistic data linkage
Scalelink
correspondence analysis
big data
url https://ijpds.org/article/view/2042
work_keys_str_mv AT maryannmegancleaton exploringgoldsteinetalsscalelinkmethodofdatalinkage
AT josieplachta exploringgoldsteinetalsscalelinkmethodofdatalinkage
AT rachelshipsey exploringgoldsteinetalsscalelinkmethodofdatalinkage