Survival analysis under imperfect record linkage using historic census data

Abstract Background Advancements in linking publicly available census records with vital and administrative records have enabled novel investigations in epidemiology and social history. However, in the absence of unique identifiers, the linkage of the records may be uncertain or only be successful f...

Full description

Bibliographic Details
Main Authors: Arielle K. Marks-Anglin, Frances K. Barg, Michelle Ross, Douglas J. Wiebe, Wei-Ting Hwang
Format: Article
Language:English
Published: BMC 2024-03-01
Series:BMC Medical Research Methodology
Subjects:
Online Access:https://doi.org/10.1186/s12874-024-02194-6
_version_ 1827315918924939264
author Arielle K. Marks-Anglin
Frances K. Barg
Michelle Ross
Douglas J. Wiebe
Wei-Ting Hwang
author_facet Arielle K. Marks-Anglin
Frances K. Barg
Michelle Ross
Douglas J. Wiebe
Wei-Ting Hwang
author_sort Arielle K. Marks-Anglin
collection DOAJ
description Abstract Background Advancements in linking publicly available census records with vital and administrative records have enabled novel investigations in epidemiology and social history. However, in the absence of unique identifiers, the linkage of the records may be uncertain or only be successful for a subset of the census cohort, resulting in missing data. For survival analysis, differential ascertainment of event times can impact inference on risk associations and median survival. Methods We modify some existing approaches that are commonly used to handle missing survival times to accommodate this imperfect linkage situation including complete case analysis, censoring, weighting, and several multiple imputation methods. We then conduct simulation studies to compare the performance of the proposed approaches in estimating the associations of a risk factor or exposure in terms of hazard ratio (HR) and median survival times in the presence of missing survival times. The effects of different missing data mechanisms and exposure-survival associations on their performance are also explored. The approaches are applied to a historic cohort of residents in Ambler, PA, established using the 1930 US census, from which only 2,440 out of 4,514 individuals (54%) had death records retrievable from publicly available data sources and death certificates. Using this cohort, we examine the effects of occupational and paraoccupational asbestos exposure on survival and disparities in mortality by race and gender. Results We show that imputation based on conditional survival results in less bias and greater efficiency relative to a complete case analysis when estimating log-hazard ratios and median survival times. When the approaches are applied to the Ambler cohort, we find a significant association between occupational exposure and mortality, particularly among black individuals and males, but not between paraoccupational exposure and mortality. Discussion This investigation illustrates the strengths and weaknesses of different imputation methods for missing survival times due to imperfect linkage of the administrative or registry data. The performance of the methods may depend on the missingness process as well as the parameter being estimated and models of interest, and such factors should be considered when choosing the methods to address the missing event times.
first_indexed 2024-04-24T23:06:07Z
format Article
id doaj.art-954b1a4a51064e18b0007f8645923faf
institution Directory Open Access Journal
issn 1471-2288
language English
last_indexed 2024-04-24T23:06:07Z
publishDate 2024-03-01
publisher BMC
record_format Article
series BMC Medical Research Methodology
spelling doaj.art-954b1a4a51064e18b0007f8645923faf2024-03-17T12:29:47ZengBMCBMC Medical Research Methodology1471-22882024-03-0124111610.1186/s12874-024-02194-6Survival analysis under imperfect record linkage using historic census dataArielle K. Marks-Anglin0Frances K. Barg1Michelle Ross2Douglas J. Wiebe3Wei-Ting Hwang4Department of Biostatistics, Epidemiology & Informatics, Perelman School of Medicine, University of PennsylvaniaDepartment of Biostatistics, Epidemiology & Informatics, Perelman School of Medicine, University of PennsylvaniaDepartment of Biostatistics, Epidemiology & Informatics, Perelman School of Medicine, University of PennsylvaniaDepartment of Biostatistics, Epidemiology & Informatics, Perelman School of Medicine, University of PennsylvaniaDepartment of Biostatistics, Epidemiology & Informatics, Perelman School of Medicine, University of PennsylvaniaAbstract Background Advancements in linking publicly available census records with vital and administrative records have enabled novel investigations in epidemiology and social history. However, in the absence of unique identifiers, the linkage of the records may be uncertain or only be successful for a subset of the census cohort, resulting in missing data. For survival analysis, differential ascertainment of event times can impact inference on risk associations and median survival. Methods We modify some existing approaches that are commonly used to handle missing survival times to accommodate this imperfect linkage situation including complete case analysis, censoring, weighting, and several multiple imputation methods. We then conduct simulation studies to compare the performance of the proposed approaches in estimating the associations of a risk factor or exposure in terms of hazard ratio (HR) and median survival times in the presence of missing survival times. The effects of different missing data mechanisms and exposure-survival associations on their performance are also explored. The approaches are applied to a historic cohort of residents in Ambler, PA, established using the 1930 US census, from which only 2,440 out of 4,514 individuals (54%) had death records retrievable from publicly available data sources and death certificates. Using this cohort, we examine the effects of occupational and paraoccupational asbestos exposure on survival and disparities in mortality by race and gender. Results We show that imputation based on conditional survival results in less bias and greater efficiency relative to a complete case analysis when estimating log-hazard ratios and median survival times. When the approaches are applied to the Ambler cohort, we find a significant association between occupational exposure and mortality, particularly among black individuals and males, but not between paraoccupational exposure and mortality. Discussion This investigation illustrates the strengths and weaknesses of different imputation methods for missing survival times due to imperfect linkage of the administrative or registry data. The performance of the methods may depend on the missingness process as well as the parameter being estimated and models of interest, and such factors should be considered when choosing the methods to address the missing event times.https://doi.org/10.1186/s12874-024-02194-6Census dataCensoringMissing dataRecord linkageSurvival analysis
spellingShingle Arielle K. Marks-Anglin
Frances K. Barg
Michelle Ross
Douglas J. Wiebe
Wei-Ting Hwang
Survival analysis under imperfect record linkage using historic census data
BMC Medical Research Methodology
Census data
Censoring
Missing data
Record linkage
Survival analysis
title Survival analysis under imperfect record linkage using historic census data
title_full Survival analysis under imperfect record linkage using historic census data
title_fullStr Survival analysis under imperfect record linkage using historic census data
title_full_unstemmed Survival analysis under imperfect record linkage using historic census data
title_short Survival analysis under imperfect record linkage using historic census data
title_sort survival analysis under imperfect record linkage using historic census data
topic Census data
Censoring
Missing data
Record linkage
Survival analysis
url https://doi.org/10.1186/s12874-024-02194-6
work_keys_str_mv AT ariellekmarksanglin survivalanalysisunderimperfectrecordlinkageusinghistoriccensusdata
AT franceskbarg survivalanalysisunderimperfectrecordlinkageusinghistoriccensusdata
AT michelleross survivalanalysisunderimperfectrecordlinkageusinghistoriccensusdata
AT douglasjwiebe survivalanalysisunderimperfectrecordlinkageusinghistoriccensusdata
AT weitinghwang survivalanalysisunderimperfectrecordlinkageusinghistoriccensusdata