Utilising identifier error variation in linkage of large administrative data sources

Abstract Background Linkage of administrative data sources often relies on probabilistic methods using a set of common identifiers (e.g. sex, date of birth, postcode). Variation in data quality on an individual or organisational level (e.g. by hospital) can result in clustering of identifier errors,...

Full description

Bibliographic Details
Main Authors:	Katie Harron, Gareth Hagger-Johnson, Ruth Gilbert, Harvey Goldstein
Format:	Article
Language:	English
Published:	BMC 2017-02-01
Series:	BMC Medical Research Methodology
Subjects:	Data linkage Record linkage Administrative data Linkage error Linkage evaluation Hospital admission
Online Access:	http://link.springer.com/article/10.1186/s12874-017-0306-8

_version_	1818548109192462336
author	Katie Harron Gareth Hagger-Johnson Ruth Gilbert Harvey Goldstein
author_facet	Katie Harron Gareth Hagger-Johnson Ruth Gilbert Harvey Goldstein
author_sort	Katie Harron
collection	DOAJ
description	Abstract Background Linkage of administrative data sources often relies on probabilistic methods using a set of common identifiers (e.g. sex, date of birth, postcode). Variation in data quality on an individual or organisational level (e.g. by hospital) can result in clustering of identifier errors, violating the assumption of independence between identifiers required for traditional probabilistic match weight estimation. This potentially introduces selection bias to the resulting linked dataset. We aimed to measure variation in identifier error rates in a large English administrative data source (Hospital Episode Statistics; HES) and to incorporate this information into match weight calculation. Methods We used 30,000 randomly selected HES hospital admissions records of patients aged 0–1, 5–6 and 18–19 years, for 2011/2012, linked via NHS number with data from the Personal Demographic Service (PDS; our gold-standard). We calculated identifier error rates for sex, date of birth and postcode and used multi-level logistic regression to investigate associations with individual-level attributes (age, ethnicity, and gender) and organisational variation. We then derived: i) weights incorporating dependence between identifiers; ii) attribute-specific weights (varying by age, ethnicity and gender); and iii) organisation-specific weights (by hospital). Results were compared with traditional match weights using a simulation study. Results Identifier errors (where values disagreed in linked HES-PDS records) or missing values were found in 0.11% of records for sex and date of birth and in 53% of records for postcode. Identifier error rates differed significantly by age, ethnicity and sex (p < 0.0005). Errors were less frequent in males, in 5–6 year olds and 18–19 year olds compared with infants, and were lowest for the Asian ethic group. A simulation study demonstrated that substantial bias was introduced into estimated readmission rates in the presence of identifier errors. Attribute- and organisational-specific weights reduced this bias compared with weights estimated using traditional probabilistic matching algorithms. Conclusions We provide empirical evidence on variation in rates of identifier error in a widely-used administrative data source and propose a new method for deriving match weights that incorporates additional data attributes. Our results demonstrate that incorporating information on variation by individual-level characteristics can help to reduce bias due to linkage error.
first_indexed	2024-12-12T08:15:33Z
format	Article
id	doaj.art-83fb11d307004f11af251a60bdd8083a
institution	Directory Open Access Journal
issn	1471-2288
language	English
last_indexed	2024-12-12T08:15:33Z
publishDate	2017-02-01
publisher	BMC
record_format	Article
series	BMC Medical Research Methodology
spelling	doaj.art-83fb11d307004f11af251a60bdd8083a2022-12-22T00:31:37ZengBMCBMC Medical Research Methodology1471-22882017-02-011711910.1186/s12874-017-0306-8Utilising identifier error variation in linkage of large administrative data sourcesKatie Harron0Gareth Hagger-Johnson1Ruth Gilbert2Harvey Goldstein3London School of Hygiene and Tropical MedicineAdministrative Data Research Centre for England, UCLAdministrative Data Research Centre for England and UCL Great Ormond Street Institute of Child HealthUniversity of Bristol, Administrative Data Research Centre for England and UCL Great Ormond Street Institute of Child HealthAbstract Background Linkage of administrative data sources often relies on probabilistic methods using a set of common identifiers (e.g. sex, date of birth, postcode). Variation in data quality on an individual or organisational level (e.g. by hospital) can result in clustering of identifier errors, violating the assumption of independence between identifiers required for traditional probabilistic match weight estimation. This potentially introduces selection bias to the resulting linked dataset. We aimed to measure variation in identifier error rates in a large English administrative data source (Hospital Episode Statistics; HES) and to incorporate this information into match weight calculation. Methods We used 30,000 randomly selected HES hospital admissions records of patients aged 0–1, 5–6 and 18–19 years, for 2011/2012, linked via NHS number with data from the Personal Demographic Service (PDS; our gold-standard). We calculated identifier error rates for sex, date of birth and postcode and used multi-level logistic regression to investigate associations with individual-level attributes (age, ethnicity, and gender) and organisational variation. We then derived: i) weights incorporating dependence between identifiers; ii) attribute-specific weights (varying by age, ethnicity and gender); and iii) organisation-specific weights (by hospital). Results were compared with traditional match weights using a simulation study. Results Identifier errors (where values disagreed in linked HES-PDS records) or missing values were found in 0.11% of records for sex and date of birth and in 53% of records for postcode. Identifier error rates differed significantly by age, ethnicity and sex (p < 0.0005). Errors were less frequent in males, in 5–6 year olds and 18–19 year olds compared with infants, and were lowest for the Asian ethic group. A simulation study demonstrated that substantial bias was introduced into estimated readmission rates in the presence of identifier errors. Attribute- and organisational-specific weights reduced this bias compared with weights estimated using traditional probabilistic matching algorithms. Conclusions We provide empirical evidence on variation in rates of identifier error in a widely-used administrative data source and propose a new method for deriving match weights that incorporates additional data attributes. Our results demonstrate that incorporating information on variation by individual-level characteristics can help to reduce bias due to linkage error.http://link.springer.com/article/10.1186/s12874-017-0306-8Data linkageRecord linkageAdministrative dataLinkage errorLinkage evaluationHospital admission
spellingShingle	Katie Harron Gareth Hagger-Johnson Ruth Gilbert Harvey Goldstein Utilising identifier error variation in linkage of large administrative data sources BMC Medical Research Methodology Data linkage Record linkage Administrative data Linkage error Linkage evaluation Hospital admission
title	Utilising identifier error variation in linkage of large administrative data sources
title_full	Utilising identifier error variation in linkage of large administrative data sources
title_fullStr	Utilising identifier error variation in linkage of large administrative data sources
title_full_unstemmed	Utilising identifier error variation in linkage of large administrative data sources
title_short	Utilising identifier error variation in linkage of large administrative data sources
title_sort	utilising identifier error variation in linkage of large administrative data sources
topic	Data linkage Record linkage Administrative data Linkage error Linkage evaluation Hospital admission
url	http://link.springer.com/article/10.1186/s12874-017-0306-8
work_keys_str_mv	AT katieharron utilisingidentifiererrorvariationinlinkageoflargeadministrativedatasources AT garethhaggerjohnson utilisingidentifiererrorvariationinlinkageoflargeadministrativedatasources AT ruthgilbert utilisingidentifiererrorvariationinlinkageoflargeadministrativedatasources AT harveygoldstein utilisingidentifiererrorvariationinlinkageoflargeadministrativedatasources

Utilising identifier error variation in linkage of large administrative data sources

Similar Items