Administrative mass data linking minimal/zero false positives

ABSTRACT Objectives Part of eHealth project in Scotland to assign a health index to all electronic patient records. One off extracts from live and historical records were posted to a record linkage department where deterministic and Newcombe probability mass matching was performed to assign the S...

Full description

Bibliographic Details
Main Author: Chris Povey
Format: Article
Language:English
Published: Swansea University 2017-04-01
Series:International Journal of Population Data Science
Online Access:https://ijpds.org/article/view/116
Description
Summary:ABSTRACT Objectives Part of eHealth project in Scotland to assign a health index to all electronic patient records. One off extracts from live and historical records were posted to a record linkage department where deterministic and Newcombe probability mass matching was performed to assign the Scottish GP registration (CHI) number. These were real world administrative matches with emphasis on minimal false positives rather than maximum acceptable match rates. Approach Early investigations examined the causes of false positive matching. A running window of incomer match scores, instead of only the highest pair score indicated that highest pair Binit scores, even well above acceptance threshold yielded spurious matches and that lower scoring pair matches for the same incomer were more acceptable. A single threshold would not work. The customers were invited to clerically check the matches; their heuristic strategies were observed and incorporated into an automated partitioning exercise. Results Deterministic match rates below 70% were considered very poor. 70-80% poor, 80-85% average, 85-90% good, >90% excellent. The residual unmatched incomers were processed using Newcombe methods, then through the partitioning exercise. 90-95% (deterministic+residual) match rates were viewed as average, 95-98 % as good, 98-100% as excellent. Several deterministic match runs were passed through the residue process early in the exercise, any false positive thrown up by this caused a change in the deterministic process to eradicate errors. Roughly 1000 linkage exercises were done for the eHealth project. Conclusions This was a joint exercise where the linkage department delivered potential match pairs to the customer. The customer then decided on the partitions they were willing to accept. All the potential pairs were sent with a checking engine to view the outcome. Most elected to accept only deterministic matches. Some accepted linkage department advice; often the linkers would clerically flag accepted and rejected pair matches for the customers to review. There was a pilot administrative matching project to assign the health index to social service data in Scotland called eCare which started after the eHealth exercise; in both, the customers were requested to alert us with any false positives - no alerts were received. The same methods were used in the recent exercise to de-duplicate and merge all Glasgow's hospital records; the customer was very used to the methodology, so more checking work by the linker was accepted to achieve higher match rates. A method to estimate false positive rates is proposed.
ISSN:2399-4908