Estimating multiplicity of infection, allele frequencies, and prevalences accounting for incomplete data.

<h4>Background</h4>Molecular surveillance of infectious diseases allows the monitoring of pathogens beyond the granularity of traditional epidemiological approaches and is well-established for some of the most relevant infectious diseases such as malaria. The presence of genetically dist...

Full description

Bibliographic Details
Main Authors: Meraj Hashemi, Kristan A Schneider
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2024-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0287161
_version_ 1797242823848755200
author Meraj Hashemi
Kristan A Schneider
author_facet Meraj Hashemi
Kristan A Schneider
author_sort Meraj Hashemi
collection DOAJ
description <h4>Background</h4>Molecular surveillance of infectious diseases allows the monitoring of pathogens beyond the granularity of traditional epidemiological approaches and is well-established for some of the most relevant infectious diseases such as malaria. The presence of genetically distinct pathogenic variants within an infection, referred to as multiplicity of infection (MOI) or complexity of infection (COI) is common in malaria and similar infectious diseases. It is an important metric that scales with transmission intensities, potentially affects the clinical pathogenesis, and a confounding factor when monitoring the frequency and prevalence of pathogenic variants. Several statistical methods exist to estimate MOI and the frequency distribution of pathogen variants. However, a common problem is the quality of the underlying molecular data. If molecular assays fail not randomly, it is likely to underestimate MOI and the prevalence of pathogen variants.<h4>Methods and findings</h4>A statistical model is introduced, which explicitly addresses data quality, by assuming a probability by which a pathogen variant remains undetected in a molecular assay. This is different from the assumption of missing at random, for which a molecular assay either performs perfectly or fails completely. The method is applicable to a single molecular marker and allows to estimate allele-frequency spectra, the distribution of MOI, and the probability of variants to remain undetected (incomplete information). Based on the statistical model, expressions for the prevalence of pathogen variants are derived and differences between frequency and prevalence are discussed. The usual desirable asymptotic properties of the maximum-likelihood estimator (MLE) are established by rewriting the model into an exponential family. The MLE has promising finite sample properties in terms of bias and variance. The covariance matrix of the estimator is close to the Cramér-Rao lower bound (inverse Fisher information). Importantly, the estimator's variance is larger than that of a similar method which disregards incomplete information, but its bias is smaller.<h4>Conclusions</h4>Although the model introduced here has convenient properties, in terms of the mean squared error it does not outperform a simple standard method that neglects missing information. Thus, the new method is recommendable only for data sets in which the molecular assays produced poor-quality results. This will be particularly true if the model is extended to accommodate information from multiple molecular markers at the same time, and incomplete information at one or more markers leads to a strong depletion of sample size.
first_indexed 2024-04-24T18:45:21Z
format Article
id doaj.art-7aa5ce3ff04d4ebbb9f8feb2b7878ee2
institution Directory Open Access Journal
issn 1932-6203
language English
last_indexed 2024-04-24T18:45:21Z
publishDate 2024-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj.art-7aa5ce3ff04d4ebbb9f8feb2b7878ee22024-03-27T05:32:52ZengPublic Library of Science (PLoS)PLoS ONE1932-62032024-01-01193e028716110.1371/journal.pone.0287161Estimating multiplicity of infection, allele frequencies, and prevalences accounting for incomplete data.Meraj HashemiKristan A Schneider<h4>Background</h4>Molecular surveillance of infectious diseases allows the monitoring of pathogens beyond the granularity of traditional epidemiological approaches and is well-established for some of the most relevant infectious diseases such as malaria. The presence of genetically distinct pathogenic variants within an infection, referred to as multiplicity of infection (MOI) or complexity of infection (COI) is common in malaria and similar infectious diseases. It is an important metric that scales with transmission intensities, potentially affects the clinical pathogenesis, and a confounding factor when monitoring the frequency and prevalence of pathogenic variants. Several statistical methods exist to estimate MOI and the frequency distribution of pathogen variants. However, a common problem is the quality of the underlying molecular data. If molecular assays fail not randomly, it is likely to underestimate MOI and the prevalence of pathogen variants.<h4>Methods and findings</h4>A statistical model is introduced, which explicitly addresses data quality, by assuming a probability by which a pathogen variant remains undetected in a molecular assay. This is different from the assumption of missing at random, for which a molecular assay either performs perfectly or fails completely. The method is applicable to a single molecular marker and allows to estimate allele-frequency spectra, the distribution of MOI, and the probability of variants to remain undetected (incomplete information). Based on the statistical model, expressions for the prevalence of pathogen variants are derived and differences between frequency and prevalence are discussed. The usual desirable asymptotic properties of the maximum-likelihood estimator (MLE) are established by rewriting the model into an exponential family. The MLE has promising finite sample properties in terms of bias and variance. The covariance matrix of the estimator is close to the Cramér-Rao lower bound (inverse Fisher information). Importantly, the estimator's variance is larger than that of a similar method which disregards incomplete information, but its bias is smaller.<h4>Conclusions</h4>Although the model introduced here has convenient properties, in terms of the mean squared error it does not outperform a simple standard method that neglects missing information. Thus, the new method is recommendable only for data sets in which the molecular assays produced poor-quality results. This will be particularly true if the model is extended to accommodate information from multiple molecular markers at the same time, and incomplete information at one or more markers leads to a strong depletion of sample size.https://doi.org/10.1371/journal.pone.0287161
spellingShingle Meraj Hashemi
Kristan A Schneider
Estimating multiplicity of infection, allele frequencies, and prevalences accounting for incomplete data.
PLoS ONE
title Estimating multiplicity of infection, allele frequencies, and prevalences accounting for incomplete data.
title_full Estimating multiplicity of infection, allele frequencies, and prevalences accounting for incomplete data.
title_fullStr Estimating multiplicity of infection, allele frequencies, and prevalences accounting for incomplete data.
title_full_unstemmed Estimating multiplicity of infection, allele frequencies, and prevalences accounting for incomplete data.
title_short Estimating multiplicity of infection, allele frequencies, and prevalences accounting for incomplete data.
title_sort estimating multiplicity of infection allele frequencies and prevalences accounting for incomplete data
url https://doi.org/10.1371/journal.pone.0287161
work_keys_str_mv AT merajhashemi estimatingmultiplicityofinfectionallelefrequenciesandprevalencesaccountingforincompletedata
AT kristanaschneider estimatingmultiplicityofinfectionallelefrequenciesandprevalencesaccountingforincompletedata