Administrative mass data linking minimal/zero false positives

ABSTRACT Objectives Part of eHealth project in Scotland to assign a health index to all electronic patient records. One off extracts from live and historical records were posted to a record linkage department where deterministic and Newcombe probability mass matching was performed to assign the S...

Full description

Bibliographic Details
Main Author: Chris Povey
Format: Article
Language:English
Published: Swansea University 2017-04-01
Series:International Journal of Population Data Science
Online Access:https://ijpds.org/article/view/116
_version_ 1797423002920419328
author Chris Povey
author_facet Chris Povey
author_sort Chris Povey
collection DOAJ
description ABSTRACT Objectives Part of eHealth project in Scotland to assign a health index to all electronic patient records. One off extracts from live and historical records were posted to a record linkage department where deterministic and Newcombe probability mass matching was performed to assign the Scottish GP registration (CHI) number. These were real world administrative matches with emphasis on minimal false positives rather than maximum acceptable match rates. Approach Early investigations examined the causes of false positive matching. A running window of incomer match scores, instead of only the highest pair score indicated that highest pair Binit scores, even well above acceptance threshold yielded spurious matches and that lower scoring pair matches for the same incomer were more acceptable. A single threshold would not work. The customers were invited to clerically check the matches; their heuristic strategies were observed and incorporated into an automated partitioning exercise. Results Deterministic match rates below 70% were considered very poor. 70-80% poor, 80-85% average, 85-90% good, >90% excellent. The residual unmatched incomers were processed using Newcombe methods, then through the partitioning exercise. 90-95% (deterministic+residual) match rates were viewed as average, 95-98 % as good, 98-100% as excellent. Several deterministic match runs were passed through the residue process early in the exercise, any false positive thrown up by this caused a change in the deterministic process to eradicate errors. Roughly 1000 linkage exercises were done for the eHealth project. Conclusions This was a joint exercise where the linkage department delivered potential match pairs to the customer. The customer then decided on the partitions they were willing to accept. All the potential pairs were sent with a checking engine to view the outcome. Most elected to accept only deterministic matches. Some accepted linkage department advice; often the linkers would clerically flag accepted and rejected pair matches for the customers to review. There was a pilot administrative matching project to assign the health index to social service data in Scotland called eCare which started after the eHealth exercise; in both, the customers were requested to alert us with any false positives - no alerts were received. The same methods were used in the recent exercise to de-duplicate and merge all Glasgow's hospital records; the customer was very used to the methodology, so more checking work by the linker was accepted to achieve higher match rates. A method to estimate false positive rates is proposed.
first_indexed 2024-03-09T07:41:08Z
format Article
id doaj.art-5465a11bac5f4e68968bb132ed7a2387
institution Directory Open Access Journal
issn 2399-4908
language English
last_indexed 2024-03-09T07:41:08Z
publishDate 2017-04-01
publisher Swansea University
record_format Article
series International Journal of Population Data Science
spelling doaj.art-5465a11bac5f4e68968bb132ed7a23872023-12-03T04:38:08ZengSwansea UniversityInternational Journal of Population Data Science2399-49082017-04-011110.23889/ijpds.v1i1.116116Administrative mass data linking minimal/zero false positivesChris PoveyABSTRACT Objectives Part of eHealth project in Scotland to assign a health index to all electronic patient records. One off extracts from live and historical records were posted to a record linkage department where deterministic and Newcombe probability mass matching was performed to assign the Scottish GP registration (CHI) number. These were real world administrative matches with emphasis on minimal false positives rather than maximum acceptable match rates. Approach Early investigations examined the causes of false positive matching. A running window of incomer match scores, instead of only the highest pair score indicated that highest pair Binit scores, even well above acceptance threshold yielded spurious matches and that lower scoring pair matches for the same incomer were more acceptable. A single threshold would not work. The customers were invited to clerically check the matches; their heuristic strategies were observed and incorporated into an automated partitioning exercise. Results Deterministic match rates below 70% were considered very poor. 70-80% poor, 80-85% average, 85-90% good, >90% excellent. The residual unmatched incomers were processed using Newcombe methods, then through the partitioning exercise. 90-95% (deterministic+residual) match rates were viewed as average, 95-98 % as good, 98-100% as excellent. Several deterministic match runs were passed through the residue process early in the exercise, any false positive thrown up by this caused a change in the deterministic process to eradicate errors. Roughly 1000 linkage exercises were done for the eHealth project. Conclusions This was a joint exercise where the linkage department delivered potential match pairs to the customer. The customer then decided on the partitions they were willing to accept. All the potential pairs were sent with a checking engine to view the outcome. Most elected to accept only deterministic matches. Some accepted linkage department advice; often the linkers would clerically flag accepted and rejected pair matches for the customers to review. There was a pilot administrative matching project to assign the health index to social service data in Scotland called eCare which started after the eHealth exercise; in both, the customers were requested to alert us with any false positives - no alerts were received. The same methods were used in the recent exercise to de-duplicate and merge all Glasgow's hospital records; the customer was very used to the methodology, so more checking work by the linker was accepted to achieve higher match rates. A method to estimate false positive rates is proposed.https://ijpds.org/article/view/116
spellingShingle Chris Povey
Administrative mass data linking minimal/zero false positives
International Journal of Population Data Science
title Administrative mass data linking minimal/zero false positives
title_full Administrative mass data linking minimal/zero false positives
title_fullStr Administrative mass data linking minimal/zero false positives
title_full_unstemmed Administrative mass data linking minimal/zero false positives
title_short Administrative mass data linking minimal/zero false positives
title_sort administrative mass data linking minimal zero false positives
url https://ijpds.org/article/view/116
work_keys_str_mv AT chrispovey administrativemassdatalinkingminimalzerofalsepositives