Partial Agreements in Probabilistic Linkages
Introduction Record linkage units around the world use probabilistic linkage techniques for routine linkage of large datasets. It is widely known how probabilities are converted to agreement and disagreement weights for each field, yet there has been little exploration of the methodology to optimall...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Swansea University
2018-09-01
|
Series: | International Journal of Population Data Science |
Online Access: | https://ijpds.org/article/view/884 |
_version_ | 1797427092398276608 |
---|---|
author | Adrian Brown Sean Randall Anna Ferrante James Boyd |
author_facet | Adrian Brown Sean Randall Anna Ferrante James Boyd |
author_sort | Adrian Brown |
collection | DOAJ |
description | Introduction
Record linkage units around the world use probabilistic linkage techniques for routine linkage of large datasets. It is widely known how probabilities are converted to agreement and disagreement weights for each field, yet there has been little exploration of the methodology to optimally convert field similarity scores into partial weights.
Objectives and Approach
String similarity comparators such as Jaro-Winkler are commonly used in traditional linkage, other comparators such as the Sorenson Dice coefficient, Jaccard similarity and Hamming distance are used in alternative privacy-preserving record linkage techniques. Determining partial weights to apply at each level of similarity is a non-trivial task. However, both types of linkages would greatly benefit from similarity to weight functions for each field that maximises the accuracy of the linkage.
We evaluated several methods for computing partial agreement weights and applied these to synthetic datasets with varying levels of corruption. We then evaluated the methods on real administrative datasets.
Results
Exact comparisons can miss matches where typographical errors or misspellings produce small changes in value. Similarity comparisons can reduce the number of missed matches, but may also increase the number of incorrect matches.
Various results of the partial agreement methods on Jaro-Winkler, Sorenson Dice coefficient, Jaccard similarity and Hamming distance comparators will be presented. A generic function to convert similarity values to weights, created from synthetic data, can be used on most datasets with a greatly improved result in linkage quality. However, maximising the linkage quality requires the creation of similarity-to-weight functions that are optimised for each dataset.
Conclusion/Implications
Accuracy in record linkage is vital for the correct analysis of linked data. It is even more critical in privacy-preserving record linkage where the ability for clerical review is limited. Optimised functions for converting similarities to partial weights can significantly improve the quality of linkage and should not be overlooked. |
first_indexed | 2024-03-09T08:40:19Z |
format | Article |
id | doaj.art-d4bf2aca0fba4fba9f94c7f92e5c75ae |
institution | Directory Open Access Journal |
issn | 2399-4908 |
language | English |
last_indexed | 2024-03-09T08:40:19Z |
publishDate | 2018-09-01 |
publisher | Swansea University |
record_format | Article |
series | International Journal of Population Data Science |
spelling | doaj.art-d4bf2aca0fba4fba9f94c7f92e5c75ae2023-12-02T17:15:04ZengSwansea UniversityInternational Journal of Population Data Science2399-49082018-09-013410.23889/ijpds.v3i4.884884Partial Agreements in Probabilistic LinkagesAdrian Brown0Sean Randall1Anna Ferrante2James Boyd3Curtin UniversityCurtin UniversityCurtin UniversityCurtin UniversityIntroduction Record linkage units around the world use probabilistic linkage techniques for routine linkage of large datasets. It is widely known how probabilities are converted to agreement and disagreement weights for each field, yet there has been little exploration of the methodology to optimally convert field similarity scores into partial weights. Objectives and Approach String similarity comparators such as Jaro-Winkler are commonly used in traditional linkage, other comparators such as the Sorenson Dice coefficient, Jaccard similarity and Hamming distance are used in alternative privacy-preserving record linkage techniques. Determining partial weights to apply at each level of similarity is a non-trivial task. However, both types of linkages would greatly benefit from similarity to weight functions for each field that maximises the accuracy of the linkage. We evaluated several methods for computing partial agreement weights and applied these to synthetic datasets with varying levels of corruption. We then evaluated the methods on real administrative datasets. Results Exact comparisons can miss matches where typographical errors or misspellings produce small changes in value. Similarity comparisons can reduce the number of missed matches, but may also increase the number of incorrect matches. Various results of the partial agreement methods on Jaro-Winkler, Sorenson Dice coefficient, Jaccard similarity and Hamming distance comparators will be presented. A generic function to convert similarity values to weights, created from synthetic data, can be used on most datasets with a greatly improved result in linkage quality. However, maximising the linkage quality requires the creation of similarity-to-weight functions that are optimised for each dataset. Conclusion/Implications Accuracy in record linkage is vital for the correct analysis of linked data. It is even more critical in privacy-preserving record linkage where the ability for clerical review is limited. Optimised functions for converting similarities to partial weights can significantly improve the quality of linkage and should not be overlooked.https://ijpds.org/article/view/884 |
spellingShingle | Adrian Brown Sean Randall Anna Ferrante James Boyd Partial Agreements in Probabilistic Linkages International Journal of Population Data Science |
title | Partial Agreements in Probabilistic Linkages |
title_full | Partial Agreements in Probabilistic Linkages |
title_fullStr | Partial Agreements in Probabilistic Linkages |
title_full_unstemmed | Partial Agreements in Probabilistic Linkages |
title_short | Partial Agreements in Probabilistic Linkages |
title_sort | partial agreements in probabilistic linkages |
url | https://ijpds.org/article/view/884 |
work_keys_str_mv | AT adrianbrown partialagreementsinprobabilisticlinkages AT seanrandall partialagreementsinprobabilisticlinkages AT annaferrante partialagreementsinprobabilisticlinkages AT jamesboyd partialagreementsinprobabilisticlinkages |