Partial Agreements in Probabilistic Linkages

Introduction Record linkage units around the world use probabilistic linkage techniques for routine linkage of large datasets. It is widely known how probabilities are converted to agreement and disagreement weights for each field, yet there has been little exploration of the methodology to optimall...

Full description

Bibliographic Details
Main Authors: Adrian Brown, Sean Randall, Anna Ferrante, James Boyd
Format: Article
Language:English
Published: Swansea University 2018-09-01
Series:International Journal of Population Data Science
Online Access:https://ijpds.org/article/view/884
_version_ 1797427092398276608
author Adrian Brown
Sean Randall
Anna Ferrante
James Boyd
author_facet Adrian Brown
Sean Randall
Anna Ferrante
James Boyd
author_sort Adrian Brown
collection DOAJ
description Introduction Record linkage units around the world use probabilistic linkage techniques for routine linkage of large datasets. It is widely known how probabilities are converted to agreement and disagreement weights for each field, yet there has been little exploration of the methodology to optimally convert field similarity scores into partial weights. Objectives and Approach String similarity comparators such as Jaro-Winkler are commonly used in traditional linkage, other comparators such as the Sorenson Dice coefficient, Jaccard similarity and Hamming distance are used in alternative privacy-preserving record linkage techniques. Determining partial weights to apply at each level of similarity is a non-trivial task. However, both types of linkages would greatly benefit from similarity to weight functions for each field that maximises the accuracy of the linkage. We evaluated several methods for computing partial agreement weights and applied these to synthetic datasets with varying levels of corruption. We then evaluated the methods on real administrative datasets. Results Exact comparisons can miss matches where typographical errors or misspellings produce small changes in value. Similarity comparisons can reduce the number of missed matches, but may also increase the number of incorrect matches. Various results of the partial agreement methods on Jaro-Winkler, Sorenson Dice coefficient, Jaccard similarity and Hamming distance comparators will be presented. A generic function to convert similarity values to weights, created from synthetic data, can be used on most datasets with a greatly improved result in linkage quality. However, maximising the linkage quality requires the creation of similarity-to-weight functions that are optimised for each dataset. Conclusion/Implications Accuracy in record linkage is vital for the correct analysis of linked data. It is even more critical in privacy-preserving record linkage where the ability for clerical review is limited. Optimised functions for converting similarities to partial weights can significantly improve the quality of linkage and should not be overlooked.
first_indexed 2024-03-09T08:40:19Z
format Article
id doaj.art-d4bf2aca0fba4fba9f94c7f92e5c75ae
institution Directory Open Access Journal
issn 2399-4908
language English
last_indexed 2024-03-09T08:40:19Z
publishDate 2018-09-01
publisher Swansea University
record_format Article
series International Journal of Population Data Science
spelling doaj.art-d4bf2aca0fba4fba9f94c7f92e5c75ae2023-12-02T17:15:04ZengSwansea UniversityInternational Journal of Population Data Science2399-49082018-09-013410.23889/ijpds.v3i4.884884Partial Agreements in Probabilistic LinkagesAdrian Brown0Sean Randall1Anna Ferrante2James Boyd3Curtin UniversityCurtin UniversityCurtin UniversityCurtin UniversityIntroduction Record linkage units around the world use probabilistic linkage techniques for routine linkage of large datasets. It is widely known how probabilities are converted to agreement and disagreement weights for each field, yet there has been little exploration of the methodology to optimally convert field similarity scores into partial weights. Objectives and Approach String similarity comparators such as Jaro-Winkler are commonly used in traditional linkage, other comparators such as the Sorenson Dice coefficient, Jaccard similarity and Hamming distance are used in alternative privacy-preserving record linkage techniques. Determining partial weights to apply at each level of similarity is a non-trivial task. However, both types of linkages would greatly benefit from similarity to weight functions for each field that maximises the accuracy of the linkage. We evaluated several methods for computing partial agreement weights and applied these to synthetic datasets with varying levels of corruption. We then evaluated the methods on real administrative datasets. Results Exact comparisons can miss matches where typographical errors or misspellings produce small changes in value. Similarity comparisons can reduce the number of missed matches, but may also increase the number of incorrect matches. Various results of the partial agreement methods on Jaro-Winkler, Sorenson Dice coefficient, Jaccard similarity and Hamming distance comparators will be presented. A generic function to convert similarity values to weights, created from synthetic data, can be used on most datasets with a greatly improved result in linkage quality. However, maximising the linkage quality requires the creation of similarity-to-weight functions that are optimised for each dataset. Conclusion/Implications Accuracy in record linkage is vital for the correct analysis of linked data. It is even more critical in privacy-preserving record linkage where the ability for clerical review is limited. Optimised functions for converting similarities to partial weights can significantly improve the quality of linkage and should not be overlooked.https://ijpds.org/article/view/884
spellingShingle Adrian Brown
Sean Randall
Anna Ferrante
James Boyd
Partial Agreements in Probabilistic Linkages
International Journal of Population Data Science
title Partial Agreements in Probabilistic Linkages
title_full Partial Agreements in Probabilistic Linkages
title_fullStr Partial Agreements in Probabilistic Linkages
title_full_unstemmed Partial Agreements in Probabilistic Linkages
title_short Partial Agreements in Probabilistic Linkages
title_sort partial agreements in probabilistic linkages
url https://ijpds.org/article/view/884
work_keys_str_mv AT adrianbrown partialagreementsinprobabilisticlinkages
AT seanrandall partialagreementsinprobabilisticlinkages
AT annaferrante partialagreementsinprobabilisticlinkages
AT jamesboyd partialagreementsinprobabilisticlinkages