A robust clustering algorithm for identifying problematic samples in genome-wide association studies

High-throughput genotyping arrays provide an efficient way to survey single nucleotide polymorphisms (SNPs) across the genome in large numbers of individuals. Downstream analysis of the data, for example in genome-wide association studies (GWAS), often involves statistical models of genotype frequen...

Full description

Bibliographic Details
Main Authors: Bellenguez, C, Strange, A, Freeman, C, Wellcome Trust Case Control Consortium 2, Donnelly, P, Spencer, C
Other Authors: The International Society for Computational Biology
Format: Journal article
Language:English
Published: Oxford University Press 2012
Subjects:
_version_ 1797085847030333440
author Bellenguez, C
Strange, A
Freeman, C
Wellcome Trust Case Control Consortium 2
Donnelly, P
Spencer, C
author2 The International Society for Computational Biology
author_facet The International Society for Computational Biology
Bellenguez, C
Strange, A
Freeman, C
Wellcome Trust Case Control Consortium 2
Donnelly, P
Spencer, C
author_sort Bellenguez, C
collection OXFORD
description High-throughput genotyping arrays provide an efficient way to survey single nucleotide polymorphisms (SNPs) across the genome in large numbers of individuals. Downstream analysis of the data, for example in genome-wide association studies (GWAS), often involves statistical models of genotype frequencies across individuals. The complexities of the sample collection process and the potential for errors in the experimental array can lead to biases and artefacts in an individual's inferred genotypes. Rather than attempting to model these complications, it has become standard practice to remove individuals whose genome-wide data differs from the sample at large. Here we describe a simple, but robust, statistical algorithm to identify samples with atypical summaries of genome-wide variation. Its use as a semi-automated quality control tool is demonstrated using several summary statistics, selected to identify different potential problems, and it is applied to two different genotyping platforms and sample collections.
first_indexed 2024-03-07T02:13:43Z
format Journal article
id oxford-uuid:a18401ce-7a9b-43b1-9cce-235e40300b2c
institution University of Oxford
language English
last_indexed 2024-03-07T02:13:43Z
publishDate 2012
publisher Oxford University Press
record_format dspace
spelling oxford-uuid:a18401ce-7a9b-43b1-9cce-235e40300b2c2022-03-27T02:13:46ZA robust clustering algorithm for identifying problematic samples in genome-wide association studiesJournal articlehttp://purl.org/coar/resource_type/c_dcae04bcuuid:a18401ce-7a9b-43b1-9cce-235e40300b2cStatistics (see also social sciences)Genetics (medical sciences)EnglishOxford University Research Archive - ValetOxford University Press2012Bellenguez, CStrange, AFreeman, CWellcome Trust Case Control Consortium 2Donnelly, PSpencer, CThe International Society for Computational BiologyHigh-throughput genotyping arrays provide an efficient way to survey single nucleotide polymorphisms (SNPs) across the genome in large numbers of individuals. Downstream analysis of the data, for example in genome-wide association studies (GWAS), often involves statistical models of genotype frequencies across individuals. The complexities of the sample collection process and the potential for errors in the experimental array can lead to biases and artefacts in an individual's inferred genotypes. Rather than attempting to model these complications, it has become standard practice to remove individuals whose genome-wide data differs from the sample at large. Here we describe a simple, but robust, statistical algorithm to identify samples with atypical summaries of genome-wide variation. Its use as a semi-automated quality control tool is demonstrated using several summary statistics, selected to identify different potential problems, and it is applied to two different genotyping platforms and sample collections.
spellingShingle Statistics (see also social sciences)
Genetics (medical sciences)
Bellenguez, C
Strange, A
Freeman, C
Wellcome Trust Case Control Consortium 2
Donnelly, P
Spencer, C
A robust clustering algorithm for identifying problematic samples in genome-wide association studies
title A robust clustering algorithm for identifying problematic samples in genome-wide association studies
title_full A robust clustering algorithm for identifying problematic samples in genome-wide association studies
title_fullStr A robust clustering algorithm for identifying problematic samples in genome-wide association studies
title_full_unstemmed A robust clustering algorithm for identifying problematic samples in genome-wide association studies
title_short A robust clustering algorithm for identifying problematic samples in genome-wide association studies
title_sort robust clustering algorithm for identifying problematic samples in genome wide association studies
topic Statistics (see also social sciences)
Genetics (medical sciences)
work_keys_str_mv AT bellenguezc arobustclusteringalgorithmforidentifyingproblematicsamplesingenomewideassociationstudies
AT strangea arobustclusteringalgorithmforidentifyingproblematicsamplesingenomewideassociationstudies
AT freemanc arobustclusteringalgorithmforidentifyingproblematicsamplesingenomewideassociationstudies
AT wellcometrustcasecontrolconsortium2 arobustclusteringalgorithmforidentifyingproblematicsamplesingenomewideassociationstudies
AT donnellyp arobustclusteringalgorithmforidentifyingproblematicsamplesingenomewideassociationstudies
AT spencerc arobustclusteringalgorithmforidentifyingproblematicsamplesingenomewideassociationstudies
AT bellenguezc robustclusteringalgorithmforidentifyingproblematicsamplesingenomewideassociationstudies
AT strangea robustclusteringalgorithmforidentifyingproblematicsamplesingenomewideassociationstudies
AT freemanc robustclusteringalgorithmforidentifyingproblematicsamplesingenomewideassociationstudies
AT wellcometrustcasecontrolconsortium2 robustclusteringalgorithmforidentifyingproblematicsamplesingenomewideassociationstudies
AT donnellyp robustclusteringalgorithmforidentifyingproblematicsamplesingenomewideassociationstudies
AT spencerc robustclusteringalgorithmforidentifyingproblematicsamplesingenomewideassociationstudies