Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to Electronic Health Records
Combining samples across multiple cohorts in large-scale scientific research programs is often required to achieve the necessary power for genome-wide association studies. Controlling for genomic ancestry through principal component analysis (PCA) to address the effect of population stratification i...
Main Authors: | , , , , , , , , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Frontiers Media S.A.
2014-11-01
|
Series: | Frontiers in Genetics |
Subjects: | |
Online Access: | http://journal.frontiersin.org/Journal/10.3389/fgene.2014.00352/full |
_version_ | 1819177121044496384 |
---|---|
author | David Russell Crosslin David Russell Crosslin Gerard eTromp Amber eBurt Daniel Seung Kim Shefali S Verma Anastasia M. Lucas Yuki eBradford Dana C. Crawford Dana C. Crawford Sebastian M. Armasu John A. Heit M. Geoffrey Hayes Helena eKuivaniemi Marylyn D Ritchie Gail P. Jarvik Gail P. Jarvik Mariza eDe Andrade |
author_facet | David Russell Crosslin David Russell Crosslin Gerard eTromp Amber eBurt Daniel Seung Kim Shefali S Verma Anastasia M. Lucas Yuki eBradford Dana C. Crawford Dana C. Crawford Sebastian M. Armasu John A. Heit M. Geoffrey Hayes Helena eKuivaniemi Marylyn D Ritchie Gail P. Jarvik Gail P. Jarvik Mariza eDe Andrade |
author_sort | David Russell Crosslin |
collection | DOAJ |
description | Combining samples across multiple cohorts in large-scale scientific research programs is often required to achieve the necessary power for genome-wide association studies. Controlling for genomic ancestry through principal component analysis (PCA) to address the effect of population stratification is a common practice. In addition to local genomic variation, such as copy number variation and inversions, other factors directly related to combining multiple studies, such as platform and site recruitment bias, can drive the correlation patterns in PCA. In this report, we describe combination and analysis of multi-ethnic cohort with biobanks linked to electronic health records for large-scale genomic association discovery analyses. First, we outline the observed site and platform bias, in addition to ancestry differences. Second, we outline a general protocol for selecting variants for input into the subject variance-covariance matrix, the conventional PCA approach. Finally, we introduce an alternative approach to PCA by deriving components from subject loadings calculated from a reference sample. This alternative approach of generating principal components controlled for site and platform bias, in addition to ancestry differences, with the advantage of fewer covariates and degrees of freedom.principal component analysis, ancestry, biobank, loadings, genetic association study |
first_indexed | 2024-12-22T21:21:37Z |
format | Article |
id | doaj.art-aa41694d4ac745dab6bceb338a865ca0 |
institution | Directory Open Access Journal |
issn | 1664-8021 |
language | English |
last_indexed | 2024-12-22T21:21:37Z |
publishDate | 2014-11-01 |
publisher | Frontiers Media S.A. |
record_format | Article |
series | Frontiers in Genetics |
spelling | doaj.art-aa41694d4ac745dab6bceb338a865ca02022-12-21T18:12:12ZengFrontiers Media S.A.Frontiers in Genetics1664-80212014-11-01510.3389/fgene.2014.00352107770Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to Electronic Health RecordsDavid Russell Crosslin0David Russell Crosslin1Gerard eTromp2Amber eBurt3Daniel Seung Kim4Shefali S Verma5Anastasia M. Lucas6Yuki eBradford7Dana C. Crawford8Dana C. Crawford9Sebastian M. Armasu10John A. Heit11M. Geoffrey Hayes12Helena eKuivaniemi13Marylyn D Ritchie14Gail P. Jarvik15Gail P. Jarvik16Mariza eDe Andrade17University of WashingtonUniversity of WashingtonGeisinger Health SystemUniversity of WashingtonUniversity of WashingtonPennsylvania State UniversityPennsylvania State UniversityPennsylvania State UniversityVanderbilt UniversityVanderbilt UniversityMayo ClinicMayo ClinicNorthwestern UniversityGeisinger Health SystemPennsylvania State UniversityUniversity of WashingtonUniversity of WashingtonMayo ClinicCombining samples across multiple cohorts in large-scale scientific research programs is often required to achieve the necessary power for genome-wide association studies. Controlling for genomic ancestry through principal component analysis (PCA) to address the effect of population stratification is a common practice. In addition to local genomic variation, such as copy number variation and inversions, other factors directly related to combining multiple studies, such as platform and site recruitment bias, can drive the correlation patterns in PCA. In this report, we describe combination and analysis of multi-ethnic cohort with biobanks linked to electronic health records for large-scale genomic association discovery analyses. First, we outline the observed site and platform bias, in addition to ancestry differences. Second, we outline a general protocol for selecting variants for input into the subject variance-covariance matrix, the conventional PCA approach. Finally, we introduce an alternative approach to PCA by deriving components from subject loadings calculated from a reference sample. This alternative approach of generating principal components controlled for site and platform bias, in addition to ancestry differences, with the advantage of fewer covariates and degrees of freedom.principal component analysis, ancestry, biobank, loadings, genetic association studyhttp://journal.frontiersin.org/Journal/10.3389/fgene.2014.00352/fullPrincipal Component AnalysisancestryBiobankgenetic association studyloadings |
spellingShingle | David Russell Crosslin David Russell Crosslin Gerard eTromp Amber eBurt Daniel Seung Kim Shefali S Verma Anastasia M. Lucas Yuki eBradford Dana C. Crawford Dana C. Crawford Sebastian M. Armasu John A. Heit M. Geoffrey Hayes Helena eKuivaniemi Marylyn D Ritchie Gail P. Jarvik Gail P. Jarvik Mariza eDe Andrade Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to Electronic Health Records Frontiers in Genetics Principal Component Analysis ancestry Biobank genetic association study loadings |
title | Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to Electronic Health Records |
title_full | Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to Electronic Health Records |
title_fullStr | Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to Electronic Health Records |
title_full_unstemmed | Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to Electronic Health Records |
title_short | Controlling for population structure and genotyping platform bias in the eMERGE multi-institutional biobank linked to Electronic Health Records |
title_sort | controlling for population structure and genotyping platform bias in the emerge multi institutional biobank linked to electronic health records |
topic | Principal Component Analysis ancestry Biobank genetic association study loadings |
url | http://journal.frontiersin.org/Journal/10.3389/fgene.2014.00352/full |
work_keys_str_mv | AT davidrussellcrosslin controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords AT davidrussellcrosslin controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords AT gerardetromp controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords AT ambereburt controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords AT danielseungkim controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords AT shefalisverma controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords AT anastasiamlucas controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords AT yukiebradford controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords AT danaccrawford controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords AT danaccrawford controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords AT sebastianmarmasu controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords AT johnaheit controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords AT mgeoffreyhayes controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords AT helenaekuivaniemi controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords AT marylyndritchie controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords AT gailpjarvik controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords AT gailpjarvik controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords AT marizaedeandrade controllingforpopulationstructureandgenotypingplatformbiasintheemergemultiinstitutionalbiobanklinkedtoelectronichealthrecords |