Using Logistic Regression to Estimate the False Positive Rate in The IDI (SoLinks)

Introduction Stats NZ’s Integrated Data Infrastructure (IDI) is a linked longitudinal database combining administrative and survey data. Previously, false positive linkages (FP) in the IDI were assessed by clerical review of a sample of linked records, which was time consuming and subject to inconsi...

Full description

Bibliographic Details
Main Authors: Anna Lin, Soon Song, Nancy Wang
Format: Article
Language:English
Published: Swansea University 2020-12-01
Series:International Journal of Population Data Science
Online Access:https://ijpds.org/article/view/1484
_version_ 1797430554765819904
author Anna Lin
Soon Song
Nancy Wang
author_facet Anna Lin
Soon Song
Nancy Wang
author_sort Anna Lin
collection DOAJ
description Introduction Stats NZ’s Integrated Data Infrastructure (IDI) is a linked longitudinal database combining administrative and survey data. Previously, false positive linkages (FP) in the IDI were assessed by clerical review of a sample of linked records, which was time consuming and subject to inconsistency. Objectives and Approach A modelled approach, ‘SoLinks’ has been developed in order to automate the FP estimation process for the IDI. It uses a logistic regression model to calculate the probability that a given link is a true match. The model is based on the agreement types defined for four key linking variables – first name, last name, sex, and date of birth. Exemptions have been given to some specific types of links that we believe to be high quality true matches. The training data used to estimate the model parameters was based on the outcomes of the clerical review process over several years. Results We have compared the FP rates estimated through clerical review to the ones estimated through the SoLinks model. Some SoLinks estimates fall outside the 95% confidence intervals of the clerically reviewed ones. This may be the result of the pre-defined probabilities for the specific types of links are too high. Conclusion The automation of FP checking has saved analyst time and resource. The modelled FP estimates have been more stable across time than the previous clerical reviews. As this model estimates the probability of a true match at the individual link level, we may provide this probability to researchers so that they can calculate linked quality indicators for their research populations.
first_indexed 2024-03-09T09:29:17Z
format Article
id doaj.art-b89eeb911bb14e538c770a779160f837
institution Directory Open Access Journal
issn 2399-4908
language English
last_indexed 2024-03-09T09:29:17Z
publishDate 2020-12-01
publisher Swansea University
record_format Article
series International Journal of Population Data Science
spelling doaj.art-b89eeb911bb14e538c770a779160f8372023-12-02T04:49:13ZengSwansea UniversityInternational Journal of Population Data Science2399-49082020-12-015510.23889/ijpds.v5i5.1484Using Logistic Regression to Estimate the False Positive Rate in The IDI (SoLinks)Anna Lin0Soon Song1Nancy Wang2Statistics New ZealandFormerly worked at Statistics New ZealandFormerly worked at Statistics New ZealandIntroduction Stats NZ’s Integrated Data Infrastructure (IDI) is a linked longitudinal database combining administrative and survey data. Previously, false positive linkages (FP) in the IDI were assessed by clerical review of a sample of linked records, which was time consuming and subject to inconsistency. Objectives and Approach A modelled approach, ‘SoLinks’ has been developed in order to automate the FP estimation process for the IDI. It uses a logistic regression model to calculate the probability that a given link is a true match. The model is based on the agreement types defined for four key linking variables – first name, last name, sex, and date of birth. Exemptions have been given to some specific types of links that we believe to be high quality true matches. The training data used to estimate the model parameters was based on the outcomes of the clerical review process over several years. Results We have compared the FP rates estimated through clerical review to the ones estimated through the SoLinks model. Some SoLinks estimates fall outside the 95% confidence intervals of the clerically reviewed ones. This may be the result of the pre-defined probabilities for the specific types of links are too high. Conclusion The automation of FP checking has saved analyst time and resource. The modelled FP estimates have been more stable across time than the previous clerical reviews. As this model estimates the probability of a true match at the individual link level, we may provide this probability to researchers so that they can calculate linked quality indicators for their research populations.https://ijpds.org/article/view/1484
spellingShingle Anna Lin
Soon Song
Nancy Wang
Using Logistic Regression to Estimate the False Positive Rate in The IDI (SoLinks)
International Journal of Population Data Science
title Using Logistic Regression to Estimate the False Positive Rate in The IDI (SoLinks)
title_full Using Logistic Regression to Estimate the False Positive Rate in The IDI (SoLinks)
title_fullStr Using Logistic Regression to Estimate the False Positive Rate in The IDI (SoLinks)
title_full_unstemmed Using Logistic Regression to Estimate the False Positive Rate in The IDI (SoLinks)
title_short Using Logistic Regression to Estimate the False Positive Rate in The IDI (SoLinks)
title_sort using logistic regression to estimate the false positive rate in the idi solinks
url https://ijpds.org/article/view/1484
work_keys_str_mv AT annalin usinglogisticregressiontoestimatethefalsepositiverateintheidisolinks
AT soonsong usinglogisticregressiontoestimatethefalsepositiverateintheidisolinks
AT nancywang usinglogisticregressiontoestimatethefalsepositiverateintheidisolinks