Haplotype frequency estimation error analysis in the presence of missing genotype data

<p>Abstract</p> <p>Background</p> <p>Increasingly researchers are turning to the use of haplotype analysis as a tool in population studies, the investigation of linkage disequilibrium, and candidate gene analysis. When the phase of the data is unknown, computational met...

Full description

Bibliographic Details
Main Authors: McManus Ross, Sievers Fabian, Kelly Enda D
Format: Article
Language:English
Published: BMC 2004-12-01
Series:BMC Bioinformatics
Online Access:http://www.biomedcentral.com/1471-2105/5/188
_version_ 1818835770931150848
author McManus Ross
Sievers Fabian
Kelly Enda D
author_facet McManus Ross
Sievers Fabian
Kelly Enda D
author_sort McManus Ross
collection DOAJ
description <p>Abstract</p> <p>Background</p> <p>Increasingly researchers are turning to the use of haplotype analysis as a tool in population studies, the investigation of linkage disequilibrium, and candidate gene analysis. When the phase of the data is unknown, computational methods, in particular those employing the Expectation-Maximisation (EM) algorithm, are frequently used for estimating the phase and frequency of the underlying haplotypes. These methods have proved very successful, predicting the phase-known frequencies from data for which the phase is unknown with a high degree of accuracy. Recently there has been much speculation as to the effect of unknown, or missing allelic data – a common phenomenon even with modern automated DNA analysis techniques – on the performance of EM-based methods. To this end an EM-based program, modified to accommodate missing data, has been developed, incorporating non-parametric bootstrapping for the calculation of accurate confidence intervals.</p> <p>Results</p> <p>Here we present the results of the analyses of various data sets in which randomly selected known alleles have been relabelled as missing. Remarkably, we find that the absence of up to 30% of the data in both biallelic and multiallelic data sets with moderate to strong levels of linkage disequilibrium can be tolerated. Additionally, the frequencies of haplotypes which predominate in the complete data analysis remain essentially the same after the addition of the random noise caused by missing data.</p> <p>Conclusions</p> <p>These findings have important implications for the area of data gathering. It may be concluded that small levels of drop out in the data do not affect the overall accuracy of haplotype analysis perceptibly, and that, given recent findings on the effect of inaccurate data, ambiguous data points are best treated as unknown.</p>
first_indexed 2024-12-19T02:56:00Z
format Article
id doaj.art-d47c2ed805fa47ea8d3e87119a1d1188
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-12-19T02:56:00Z
publishDate 2004-12-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-d47c2ed805fa47ea8d3e87119a1d11882022-12-21T20:38:20ZengBMCBMC Bioinformatics1471-21052004-12-015118810.1186/1471-2105-5-188Haplotype frequency estimation error analysis in the presence of missing genotype dataMcManus RossSievers FabianKelly Enda D<p>Abstract</p> <p>Background</p> <p>Increasingly researchers are turning to the use of haplotype analysis as a tool in population studies, the investigation of linkage disequilibrium, and candidate gene analysis. When the phase of the data is unknown, computational methods, in particular those employing the Expectation-Maximisation (EM) algorithm, are frequently used for estimating the phase and frequency of the underlying haplotypes. These methods have proved very successful, predicting the phase-known frequencies from data for which the phase is unknown with a high degree of accuracy. Recently there has been much speculation as to the effect of unknown, or missing allelic data – a common phenomenon even with modern automated DNA analysis techniques – on the performance of EM-based methods. To this end an EM-based program, modified to accommodate missing data, has been developed, incorporating non-parametric bootstrapping for the calculation of accurate confidence intervals.</p> <p>Results</p> <p>Here we present the results of the analyses of various data sets in which randomly selected known alleles have been relabelled as missing. Remarkably, we find that the absence of up to 30% of the data in both biallelic and multiallelic data sets with moderate to strong levels of linkage disequilibrium can be tolerated. Additionally, the frequencies of haplotypes which predominate in the complete data analysis remain essentially the same after the addition of the random noise caused by missing data.</p> <p>Conclusions</p> <p>These findings have important implications for the area of data gathering. It may be concluded that small levels of drop out in the data do not affect the overall accuracy of haplotype analysis perceptibly, and that, given recent findings on the effect of inaccurate data, ambiguous data points are best treated as unknown.</p>http://www.biomedcentral.com/1471-2105/5/188
spellingShingle McManus Ross
Sievers Fabian
Kelly Enda D
Haplotype frequency estimation error analysis in the presence of missing genotype data
BMC Bioinformatics
title Haplotype frequency estimation error analysis in the presence of missing genotype data
title_full Haplotype frequency estimation error analysis in the presence of missing genotype data
title_fullStr Haplotype frequency estimation error analysis in the presence of missing genotype data
title_full_unstemmed Haplotype frequency estimation error analysis in the presence of missing genotype data
title_short Haplotype frequency estimation error analysis in the presence of missing genotype data
title_sort haplotype frequency estimation error analysis in the presence of missing genotype data
url http://www.biomedcentral.com/1471-2105/5/188
work_keys_str_mv AT mcmanusross haplotypefrequencyestimationerroranalysisinthepresenceofmissinggenotypedata
AT sieversfabian haplotypefrequencyestimationerroranalysisinthepresenceofmissinggenotypedata
AT kellyendad haplotypefrequencyestimationerroranalysisinthepresenceofmissinggenotypedata