Using machine learning to impute legal status of immigrants in the National Health Interview Survey

We describe a novel machine learning method of imputing legal status for immigrants using nationally representative survey data from the Survey of Income and Program Participation (SIPP) and the National Health Interview Survey (NHIS). K-nearest Neighbor (KNN) classifier and Random Forest (RF) Algor...

Full description

Bibliographic Details
Main Authors: Simon A. Ruhnke, Fernando A. Wilson, Jim P. Stimpson
Format: Article
Language:English
Published: Elsevier 2022-01-01
Series:MethodsX
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S221501612200228X
Description
Summary:We describe a novel machine learning method of imputing legal status for immigrants using nationally representative survey data from the Survey of Income and Program Participation (SIPP) and the National Health Interview Survey (NHIS). K-nearest Neighbor (KNN) classifier and Random Forest (RF) Algorithm machine learning were described as novel imputation methods compared to established regression-based imputation. After validating the imputation methods using sensitivity, specificity, positive predictive value (PPV) and accuracy statistics, the Random Forest Algorithm was more accurate in identifying undocumented immigrants and minimized bias in both socio-demographic variables included in the imputation, and unobserved health variables relative to regression-based imputation and KNN. • We developed a new machine learning method of imputing legal status for immigrants that can be used with nationally representative, publicly available data. • Our findings indicate that using machine learning to impute legal status of immigrants, specifically the Random Forest Algorithm, was more accurate in identifying undocumented immigrants and minimized bias relative to other imputation methods.
ISSN:2215-0161