A compelling demonstration of why traditional statistical regression models cannot be used to identify risk factors from case data on infectious diseases: a simulation study
Abstract Background Regression models are often used to explain the relative risk of infectious diseases among groups. For example, overrepresentation of immigrants among COVID-19 cases has been found in multiple countries. Several studies apply regression models to investigate whether different ris...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2022-05-01
|
Series: | BMC Medical Research Methodology |
Subjects: | |
Online Access: | https://doi.org/10.1186/s12874-022-01565-1 |
_version_ | 1818548773357355008 |
---|---|
author | Solveig Engebretsen Gunnar Rø Birgitte Freiesleben de Blasio |
author_facet | Solveig Engebretsen Gunnar Rø Birgitte Freiesleben de Blasio |
author_sort | Solveig Engebretsen |
collection | DOAJ |
description | Abstract Background Regression models are often used to explain the relative risk of infectious diseases among groups. For example, overrepresentation of immigrants among COVID-19 cases has been found in multiple countries. Several studies apply regression models to investigate whether different risk factors can explain this overrepresentation among immigrants without considering dependence between the cases. Methods We study the appropriateness of traditional statistical regression methods for identifying risk factors for infectious diseases, by a simulation study. We model infectious disease spread by a simple, population-structured version of an SIR (susceptible-infected-recovered)-model, which is one of the most famous and well-established models for infectious disease spread. The population is thus divided into different sub-groups. We vary the contact structure between the sub-groups of the population. We analyse the relation between individual-level risk of infection and group-level relative risk. We analyse whether Poisson regression estimators can capture the true, underlying parameters of transmission. We assess both the quantitative and qualitative accuracy of the estimated regression coefficients. Results We illustrate that there is no clear relationship between differences in individual characteristics and group-level overrepresentation —small differences on the individual level can result in arbitrarily high overrepresentation. We demonstrate that individual risk of infection cannot be properly defined without simultaneous specification of the infection level of the population. We argue that the estimated regression coefficients are not interpretable and show that it is not possible to adjust for other variables by standard regression methods. Finally, we illustrate that regression models can result in the significance of variables unrelated to infection risk in the constructed simulation example (e.g. ethnicity), particularly when a large proportion of contacts is within the same group. Conclusions Traditional regression models which are valid for modelling risk between groups for non-communicable diseases are not valid for infectious diseases. By applying such methods to identify risk factors of infectious diseases, one risks ending up with wrong conclusions. Output from such analyses should therefore be treated with great caution. |
first_indexed | 2024-12-12T08:24:39Z |
format | Article |
id | doaj.art-2b599768f71f4516b569592cec24d7da |
institution | Directory Open Access Journal |
issn | 1471-2288 |
language | English |
last_indexed | 2024-12-12T08:24:39Z |
publishDate | 2022-05-01 |
publisher | BMC |
record_format | Article |
series | BMC Medical Research Methodology |
spelling | doaj.art-2b599768f71f4516b569592cec24d7da2022-12-22T00:31:18ZengBMCBMC Medical Research Methodology1471-22882022-05-0122111310.1186/s12874-022-01565-1A compelling demonstration of why traditional statistical regression models cannot be used to identify risk factors from case data on infectious diseases: a simulation studySolveig Engebretsen0Gunnar Rø1Birgitte Freiesleben de Blasio2Norwegian Computing CenterDepartment of Method Development and Analytics, Norwegian Institute of Public HealthDepartment of Method Development and Analytics, Norwegian Institute of Public HealthAbstract Background Regression models are often used to explain the relative risk of infectious diseases among groups. For example, overrepresentation of immigrants among COVID-19 cases has been found in multiple countries. Several studies apply regression models to investigate whether different risk factors can explain this overrepresentation among immigrants without considering dependence between the cases. Methods We study the appropriateness of traditional statistical regression methods for identifying risk factors for infectious diseases, by a simulation study. We model infectious disease spread by a simple, population-structured version of an SIR (susceptible-infected-recovered)-model, which is one of the most famous and well-established models for infectious disease spread. The population is thus divided into different sub-groups. We vary the contact structure between the sub-groups of the population. We analyse the relation between individual-level risk of infection and group-level relative risk. We analyse whether Poisson regression estimators can capture the true, underlying parameters of transmission. We assess both the quantitative and qualitative accuracy of the estimated regression coefficients. Results We illustrate that there is no clear relationship between differences in individual characteristics and group-level overrepresentation —small differences on the individual level can result in arbitrarily high overrepresentation. We demonstrate that individual risk of infection cannot be properly defined without simultaneous specification of the infection level of the population. We argue that the estimated regression coefficients are not interpretable and show that it is not possible to adjust for other variables by standard regression methods. Finally, we illustrate that regression models can result in the significance of variables unrelated to infection risk in the constructed simulation example (e.g. ethnicity), particularly when a large proportion of contacts is within the same group. Conclusions Traditional regression models which are valid for modelling risk between groups for non-communicable diseases are not valid for infectious diseases. By applying such methods to identify risk factors of infectious diseases, one risks ending up with wrong conclusions. Output from such analyses should therefore be treated with great caution.https://doi.org/10.1186/s12874-022-01565-1Relative riskCommunicable diseasesInfectious diseasesRegression modelsOverrepresentation |
spellingShingle | Solveig Engebretsen Gunnar Rø Birgitte Freiesleben de Blasio A compelling demonstration of why traditional statistical regression models cannot be used to identify risk factors from case data on infectious diseases: a simulation study BMC Medical Research Methodology Relative risk Communicable diseases Infectious diseases Regression models Overrepresentation |
title | A compelling demonstration of why traditional statistical regression models cannot be used to identify risk factors from case data on infectious diseases: a simulation study |
title_full | A compelling demonstration of why traditional statistical regression models cannot be used to identify risk factors from case data on infectious diseases: a simulation study |
title_fullStr | A compelling demonstration of why traditional statistical regression models cannot be used to identify risk factors from case data on infectious diseases: a simulation study |
title_full_unstemmed | A compelling demonstration of why traditional statistical regression models cannot be used to identify risk factors from case data on infectious diseases: a simulation study |
title_short | A compelling demonstration of why traditional statistical regression models cannot be used to identify risk factors from case data on infectious diseases: a simulation study |
title_sort | compelling demonstration of why traditional statistical regression models cannot be used to identify risk factors from case data on infectious diseases a simulation study |
topic | Relative risk Communicable diseases Infectious diseases Regression models Overrepresentation |
url | https://doi.org/10.1186/s12874-022-01565-1 |
work_keys_str_mv | AT solveigengebretsen acompellingdemonstrationofwhytraditionalstatisticalregressionmodelscannotbeusedtoidentifyriskfactorsfromcasedataoninfectiousdiseasesasimulationstudy AT gunnarrø acompellingdemonstrationofwhytraditionalstatisticalregressionmodelscannotbeusedtoidentifyriskfactorsfromcasedataoninfectiousdiseasesasimulationstudy AT birgittefreieslebendeblasio acompellingdemonstrationofwhytraditionalstatisticalregressionmodelscannotbeusedtoidentifyriskfactorsfromcasedataoninfectiousdiseasesasimulationstudy AT solveigengebretsen compellingdemonstrationofwhytraditionalstatisticalregressionmodelscannotbeusedtoidentifyriskfactorsfromcasedataoninfectiousdiseasesasimulationstudy AT gunnarrø compellingdemonstrationofwhytraditionalstatisticalregressionmodelscannotbeusedtoidentifyriskfactorsfromcasedataoninfectiousdiseasesasimulationstudy AT birgittefreieslebendeblasio compellingdemonstrationofwhytraditionalstatisticalregressionmodelscannotbeusedtoidentifyriskfactorsfromcasedataoninfectiousdiseasesasimulationstudy |