Linking of the 2021 Census to massive linked administrative data to understand coverage and quality
The Office for National Statistics has built a vast, composite dataset for population statistics by linking data from health, education, and employment sources, known as the Demographic Index (DI). It attempts to contain a record ‘cluster’ for each person in England and Wales. To understand the cov...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Swansea University
2023-09-01
|
Series: | International Journal of Population Data Science |
Online Access: | https://ijpds.org/article/view/2201 |
_version_ | 1827610725836652544 |
---|---|
author | Josie Plachta Sarah Collyer |
author_facet | Josie Plachta Sarah Collyer |
author_sort | Josie Plachta |
collection | DOAJ |
description |
The Office for National Statistics has built a vast, composite dataset for population statistics by linking data from health, education, and employment sources, known as the Demographic Index (DI). It attempts to contain a record ‘cluster’ for each person in England and Wales. To understand the coverage and quality of the DI, it has been linked to the 2021 Census to a high standard, enabling review of those captured incorrectly: and over- and undercoverage.
Massive data techniques were used to apply deterministic, probabilistic, and associative methods. High quality was achieved by applying clerical matching methods to cases that could not be confirmed by automatic techniques. Due to resource limitations, only a subsample was linked to this high standard. The resulting links were flagged to indicate cases where the DI had correctly captured persons or had made errors. Errors included capturing persons at the wrong address, accidently splitting a person’s records across two clusters, or incorrectly capturing two persons in the same cluster. Unlinked records were flagged as under-coverage (census) or over-coverage (DI).
The 2021 Census was linked to the DI with an estimated precision of 99.4%-99.7% and recall of 99.1%-99.7%. This exceptional quality allows ONS analysts to use this dataset with high confidence in analysing the quality of the DI and its impact on statistics. In general, DI under-coverage was low, with 0.9% of Census records in the subsample not present on the DI. However, DI over-coverage was much higher, with 29.5% of DI records in the subsample not present on the census. 2.3% of census persons in the subsample had been incorrectly split across multiple clusters, and 0.3% had been merged into a cluster with multiple other persons.
The ONS successfully linked the 2021 Census to the DI to a high quality. The linkage suggests that the DI captures most of the current population correctly but captures many persons that are not. These insights must be considered by any users of the data.
|
first_indexed | 2024-03-09T07:54:29Z |
format | Article |
id | doaj.art-4b39df59767c47ba9087cca430b1c573 |
institution | Directory Open Access Journal |
issn | 2399-4908 |
language | English |
last_indexed | 2024-03-09T07:54:29Z |
publishDate | 2023-09-01 |
publisher | Swansea University |
record_format | Article |
series | International Journal of Population Data Science |
spelling | doaj.art-4b39df59767c47ba9087cca430b1c5732023-12-03T01:11:27ZengSwansea UniversityInternational Journal of Population Data Science2399-49082023-09-018210.23889/ijpds.v8i2.2201Linking of the 2021 Census to massive linked administrative data to understand coverage and qualityJosie Plachta0Sarah Collyer1Office for National Statistics, Newport, United KingdomOffice for National Statistics, Newport, United Kingdom The Office for National Statistics has built a vast, composite dataset for population statistics by linking data from health, education, and employment sources, known as the Demographic Index (DI). It attempts to contain a record ‘cluster’ for each person in England and Wales. To understand the coverage and quality of the DI, it has been linked to the 2021 Census to a high standard, enabling review of those captured incorrectly: and over- and undercoverage. Massive data techniques were used to apply deterministic, probabilistic, and associative methods. High quality was achieved by applying clerical matching methods to cases that could not be confirmed by automatic techniques. Due to resource limitations, only a subsample was linked to this high standard. The resulting links were flagged to indicate cases where the DI had correctly captured persons or had made errors. Errors included capturing persons at the wrong address, accidently splitting a person’s records across two clusters, or incorrectly capturing two persons in the same cluster. Unlinked records were flagged as under-coverage (census) or over-coverage (DI). The 2021 Census was linked to the DI with an estimated precision of 99.4%-99.7% and recall of 99.1%-99.7%. This exceptional quality allows ONS analysts to use this dataset with high confidence in analysing the quality of the DI and its impact on statistics. In general, DI under-coverage was low, with 0.9% of Census records in the subsample not present on the DI. However, DI over-coverage was much higher, with 29.5% of DI records in the subsample not present on the census. 2.3% of census persons in the subsample had been incorrectly split across multiple clusters, and 0.3% had been merged into a cluster with multiple other persons. The ONS successfully linked the 2021 Census to the DI to a high quality. The linkage suggests that the DI captures most of the current population correctly but captures many persons that are not. These insights must be considered by any users of the data. https://ijpds.org/article/view/2201 |
spellingShingle | Josie Plachta Sarah Collyer Linking of the 2021 Census to massive linked administrative data to understand coverage and quality International Journal of Population Data Science |
title | Linking of the 2021 Census to massive linked administrative data to understand coverage and quality |
title_full | Linking of the 2021 Census to massive linked administrative data to understand coverage and quality |
title_fullStr | Linking of the 2021 Census to massive linked administrative data to understand coverage and quality |
title_full_unstemmed | Linking of the 2021 Census to massive linked administrative data to understand coverage and quality |
title_short | Linking of the 2021 Census to massive linked administrative data to understand coverage and quality |
title_sort | linking of the 2021 census to massive linked administrative data to understand coverage and quality |
url | https://ijpds.org/article/view/2201 |
work_keys_str_mv | AT josieplachta linkingofthe2021censustomassivelinkedadministrativedatatounderstandcoverageandquality AT sarahcollyer linkingofthe2021censustomassivelinkedadministrativedatatounderstandcoverageandquality |