Exploring Goldstein et al.’s Scalelink method of data linkage.

Objectives Scalelink is an innovative probabilistic data linkage method based on correspondence analysis. Unlike the popular and widely-used Fellegi-Sunter algorithm, it does not assume linkage variable independence. It also claims to be more intuitive and computationally efficient. We aim to test t...

Full description

Bibliographic Details
Main Authors:	Mary Ann Megan Cleaton, Josie Plachta, Rachel Shipsey
Format:	Article
Language:	English
Published:	Swansea University 2022-08-01
Series:	International Journal of Population Data Science
Subjects:	data linkage probabilistic data linkage Scalelink correspondence analysis big data
Online Access:	https://ijpds.org/article/view/2042

_version_	1797427142892453888
author	Mary Ann Megan Cleaton Josie Plachta Rachel Shipsey
author_facet	Mary Ann Megan Cleaton Josie Plachta Rachel Shipsey
author_sort	Mary Ann Megan Cleaton
collection	DOAJ
description	Objectives Scalelink is an innovative probabilistic data linkage method based on correspondence analysis. Unlike the popular and widely-used Fellegi-Sunter algorithm, it does not assume linkage variable independence. It also claims to be more intuitive and computationally efficient. We aim to test this method for the first time on real-world big data. Approach Scalelink uses agreement states for each linkage variable and candidate pair. These are compared to determine how frequently, for all candidate pairs, any given agreement state is held at the same time as any other agreement state (this accounts for variable dependence). The results of this comparison are inputted into a loss function and the minimisation of this function is derived within constraints to produce weights. Currently, the method is accessible via Goldstein et al.’s paper and R package. We are translating it into PySpark to enable testing on datasets that are too large to link without using distributed computing. Results Initial testing of Goldstein et al.’s Scalelink method on small samples of real-world datasets shows that it performs as expected for a probabilistic linkage method, although cannot currently deal with missingness. To test the quality of the method on real-world big data, a high-quality linked dataset of the 2021 England and Wales Census and follow-up Census Coverage Survey will be used as a Gold Standard (GS). After developing a method that enables Scalelink to deal with missingness, we will apply Scalelink and automatic Fellegi-Sunter probabilistic linkage to this GS. We can thus establish and compare the precision and recall of both methods. We will also investigate linkage bias for particular demographics, test computational efficiency and estimate the clerical review burden for each method. Conclusion Goldstein et al.’s Scalelink algorithm shows promise as a high quality, scalable, dependence-free linkage algorithm for use in any matching project. Here, for the first time, we research the method’s quality and feasibility with real-world big data. From this we will produce recommendations regarding its utility.
first_indexed	2024-03-09T08:40:39Z
format	Article
id	doaj.art-34090c76c6304b679521dae0b8c0b363
institution	Directory Open Access Journal
issn	2399-4908
language	English
last_indexed	2024-03-09T08:40:39Z
publishDate	2022-08-01
publisher	Swansea University
record_format	Article
series	International Journal of Population Data Science
spelling	doaj.art-34090c76c6304b679521dae0b8c0b3632023-12-02T17:02:09ZengSwansea UniversityInternational Journal of Population Data Science2399-49082022-08-017310.23889/ijpds.v7i3.2042Exploring Goldstein et al.’s Scalelink method of data linkage.Mary Ann Megan Cleaton0Josie Plachta1Rachel Shipsey2Office for National StatisticsOffice for National StatisticsOffice for National StatisticsObjectives Scalelink is an innovative probabilistic data linkage method based on correspondence analysis. Unlike the popular and widely-used Fellegi-Sunter algorithm, it does not assume linkage variable independence. It also claims to be more intuitive and computationally efficient. We aim to test this method for the first time on real-world big data. Approach Scalelink uses agreement states for each linkage variable and candidate pair. These are compared to determine how frequently, for all candidate pairs, any given agreement state is held at the same time as any other agreement state (this accounts for variable dependence). The results of this comparison are inputted into a loss function and the minimisation of this function is derived within constraints to produce weights. Currently, the method is accessible via Goldstein et al.’s paper and R package. We are translating it into PySpark to enable testing on datasets that are too large to link without using distributed computing. Results Initial testing of Goldstein et al.’s Scalelink method on small samples of real-world datasets shows that it performs as expected for a probabilistic linkage method, although cannot currently deal with missingness. To test the quality of the method on real-world big data, a high-quality linked dataset of the 2021 England and Wales Census and follow-up Census Coverage Survey will be used as a Gold Standard (GS). After developing a method that enables Scalelink to deal with missingness, we will apply Scalelink and automatic Fellegi-Sunter probabilistic linkage to this GS. We can thus establish and compare the precision and recall of both methods. We will also investigate linkage bias for particular demographics, test computational efficiency and estimate the clerical review burden for each method. Conclusion Goldstein et al.’s Scalelink algorithm shows promise as a high quality, scalable, dependence-free linkage algorithm for use in any matching project. Here, for the first time, we research the method’s quality and feasibility with real-world big data. From this we will produce recommendations regarding its utility. https://ijpds.org/article/view/2042data linkageprobabilistic data linkageScalelinkcorrespondence analysisbig data
spellingShingle	Mary Ann Megan Cleaton Josie Plachta Rachel Shipsey Exploring Goldstein et al.’s Scalelink method of data linkage. International Journal of Population Data Science data linkage probabilistic data linkage Scalelink correspondence analysis big data
title	Exploring Goldstein et al.’s Scalelink method of data linkage.
title_full	Exploring Goldstein et al.’s Scalelink method of data linkage.
title_fullStr	Exploring Goldstein et al.’s Scalelink method of data linkage.
title_full_unstemmed	Exploring Goldstein et al.’s Scalelink method of data linkage.
title_short	Exploring Goldstein et al.’s Scalelink method of data linkage.
title_sort	exploring goldstein et al s scalelink method of data linkage
topic	data linkage probabilistic data linkage Scalelink correspondence analysis big data
url	https://ijpds.org/article/view/2042
work_keys_str_mv	AT maryannmegancleaton exploringgoldsteinetalsscalelinkmethodofdatalinkage AT josieplachta exploringgoldsteinetalsscalelinkmethodofdatalinkage AT rachelshipsey exploringgoldsteinetalsscalelinkmethodofdatalinkage

Exploring Goldstein et al.’s Scalelink method of data linkage.

Similar Items