Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to Electronic Health Records

Combining samples across multiple cohorts in large-scale scientific research programs is often required to achieve the necessary power for genome-wide association studies. Controlling for genomic ancestry through principal component analysis (PCA) to address the effect of population stratification i...

Full description

Bibliographic Details
Main Authors: David Russell Crosslin, Gerard eTromp, Amber eBurt, Daniel Seung Kim, Shefali S Verma, Anastasia M. Lucas, Yuki eBradford, Dana C. Crawford, Sebastian M. Armasu, John A. Heit, M. Geoffrey Hayes, Helena eKuivaniemi, Marylyn D Ritchie, Gail P. Jarvik, Mariza eDe Andrade
Format: Article
Language:English
Published: Frontiers Media S.A. 2014-11-01
Series:Frontiers in Genetics
Subjects:
Online Access:http://journal.frontiersin.org/Journal/10.3389/fgene.2014.00352/full
_version_ 1819177121044496384
author David Russell Crosslin
David Russell Crosslin
Gerard eTromp
Amber eBurt
Daniel Seung Kim
Shefali S Verma
Anastasia M. Lucas
Yuki eBradford
Dana C. Crawford
Dana C. Crawford
Sebastian M. Armasu
John A. Heit
M. Geoffrey Hayes
Helena eKuivaniemi
Marylyn D Ritchie
Gail P. Jarvik
Gail P. Jarvik
Mariza eDe Andrade
author_facet David Russell Crosslin
David Russell Crosslin
Gerard eTromp
Amber eBurt
Daniel Seung Kim
Shefali S Verma
Anastasia M. Lucas
Yuki eBradford
Dana C. Crawford
Dana C. Crawford
Sebastian M. Armasu
John A. Heit
M. Geoffrey Hayes
Helena eKuivaniemi
Marylyn D Ritchie
Gail P. Jarvik
Gail P. Jarvik
Mariza eDe Andrade
author_sort David Russell Crosslin
collection DOAJ
description Combining samples across multiple cohorts in large-scale scientific research programs is often required to achieve the necessary power for genome-wide association studies. Controlling for genomic ancestry through principal component analysis (PCA) to address the effect of population stratification is a common practice. In addition to local genomic variation, such as copy number variation and inversions, other factors directly related to combining multiple studies, such as platform and site recruitment bias, can drive the correlation patterns in PCA. In this report, we describe combination and analysis of multi-ethnic cohort with biobanks linked to electronic health records for large-scale genomic association discovery analyses. First, we outline the observed site and platform bias, in addition to ancestry differences. Second, we outline a general protocol for selecting variants for input into the subject variance-covariance matrix, the conventional PCA approach. Finally, we introduce an alternative approach to PCA by deriving components from subject loadings calculated from a reference sample. This alternative approach of generating principal components controlled for site and platform bias, in addition to ancestry differences, with the advantage of fewer covariates and degrees of freedom.principal component analysis, ancestry, biobank, loadings, genetic association study
first_indexed 2024-12-22T21:21:37Z
format Article
id doaj.art-aa41694d4ac745dab6bceb338a865ca0
institution Directory Open Access Journal
issn 1664-8021
language English
last_indexed 2024-12-22T21:21:37Z
publishDate 2014-11-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Genetics
spelling doaj.art-aa41694d4ac745dab6bceb338a865ca02022-12-21T18:12:12ZengFrontiers Media S.A.Frontiers in Genetics1664-80212014-11-01510.3389/fgene.2014.00352107770Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to Electronic Health RecordsDavid Russell Crosslin0David Russell Crosslin1Gerard eTromp2Amber eBurt3Daniel Seung Kim4Shefali S Verma5Anastasia M. Lucas6Yuki eBradford7Dana C. Crawford8Dana C. Crawford9Sebastian M. Armasu10John A. Heit11M. Geoffrey Hayes12Helena eKuivaniemi13Marylyn D Ritchie14Gail P. Jarvik15Gail P. Jarvik16Mariza eDe Andrade17University of WashingtonUniversity of WashingtonGeisinger Health SystemUniversity of WashingtonUniversity of WashingtonPennsylvania State UniversityPennsylvania State UniversityPennsylvania State UniversityVanderbilt UniversityVanderbilt UniversityMayo ClinicMayo ClinicNorthwestern UniversityGeisinger Health SystemPennsylvania State UniversityUniversity of WashingtonUniversity of WashingtonMayo ClinicCombining samples across multiple cohorts in large-scale scientific research programs is often required to achieve the necessary power for genome-wide association studies. Controlling for genomic ancestry through principal component analysis (PCA) to address the effect of population stratification is a common practice. In addition to local genomic variation, such as copy number variation and inversions, other factors directly related to combining multiple studies, such as platform and site recruitment bias, can drive the correlation patterns in PCA. In this report, we describe combination and analysis of multi-ethnic cohort with biobanks linked to electronic health records for large-scale genomic association discovery analyses. First, we outline the observed site and platform bias, in addition to ancestry differences. Second, we outline a general protocol for selecting variants for input into the subject variance-covariance matrix, the conventional PCA approach. Finally, we introduce an alternative approach to PCA by deriving components from subject loadings calculated from a reference sample. This alternative approach of generating principal components controlled for site and platform bias, in addition to ancestry differences, with the advantage of fewer covariates and degrees of freedom.principal component analysis, ancestry, biobank, loadings, genetic association studyhttp://journal.frontiersin.org/Journal/10.3389/fgene.2014.00352/fullPrincipal Component AnalysisancestryBiobankgenetic association studyloadings
spellingShingle David Russell Crosslin
David Russell Crosslin
Gerard eTromp
Amber eBurt
Daniel Seung Kim
Shefali S Verma
Anastasia M. Lucas
Yuki eBradford
Dana C. Crawford
Dana C. Crawford
Sebastian M. Armasu
John A. Heit
M. Geoffrey Hayes
Helena eKuivaniemi
Marylyn D Ritchie
Gail P. Jarvik
Gail P. Jarvik
Mariza eDe Andrade
Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to Electronic Health Records
Frontiers in Genetics
Principal Component Analysis
ancestry
Biobank
genetic association study
loadings
title Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to Electronic Health Records
title_full Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to Electronic Health Records
title_fullStr Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to Electronic Health Records
title_full_unstemmed Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to Electronic Health Records
title_short Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to Electronic Health Records
title_sort controlling for population structure and genotyping platform bias in the emerge multi institutional biobank linked to electronic health records
topic Principal Component Analysis
ancestry
Biobank
genetic association study
loadings
url http://journal.frontiersin.org/Journal/10.3389/fgene.2014.00352/full
work_keys_str_mv AT davidrussellcrosslin controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords
AT davidrussellcrosslin controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords
AT gerardetromp controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords
AT ambereburt controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords
AT danielseungkim controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords
AT shefalisverma controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords
AT anastasiamlucas controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords
AT yukiebradford controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords
AT danaccrawford controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords
AT danaccrawford controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords
AT sebastianmarmasu controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords
AT johnaheit controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords
AT mgeoffreyhayes controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords
AT helenaekuivaniemi controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords
AT marylyndritchie controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords
AT gailpjarvik controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords
AT gailpjarvik controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords
AT marizaedeandrade controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords