Theoretical formulation of principal components analysis to detect and correct for population stratification.

The Eigenstrat method, based on principal components analysis (PCA), is commonly used both to quantify population relationships in population genetics and to correct for population stratification in genome-wide association studies. However, it can be difficult to make appropriate inference about pop...

Full description

Bibliographic Details
Main Authors: Jianzhong Ma, Christopher I Amos
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2010-09-01
Series:PLoS ONE
Online Access:http://europepmc.org/articles/PMC2941459?pdf=render
_version_ 1811322566787203072
author Jianzhong Ma
Christopher I Amos
author_facet Jianzhong Ma
Christopher I Amos
author_sort Jianzhong Ma
collection DOAJ
description The Eigenstrat method, based on principal components analysis (PCA), is commonly used both to quantify population relationships in population genetics and to correct for population stratification in genome-wide association studies. However, it can be difficult to make appropriate inference about population relationships from the principal component (PC) scatter plot. Here, to better understand the working mechanism of the Eigenstrat method, we consider its theoretical or "population" formulation. The eigen-equation for samples from an arbitrary number () of populations is reduced to that of a matrix of dimension , the elements of which are determined by the variance-covariance matrix for the random vector of the allele frequencies. Solving the reduced eigen-equation is numerically trivial and yields eigenvectors that are the axes of variation required for differentiating the populations. Using the reduced eigen-equation, we investigate the within-population fluctuations around the axes of variation on the PC scatter plot for simulated datasets. Specifically, we show that there exists an asymptotically stable pattern of the PC plot for large sample size. Our results provide theoretical guidance for interpreting the pattern of PC plot in terms of population relationships. For applications in genetic association tests, we demonstrate that, as a method of correcting for population stratification, regressing out the theoretical PCs corresponding to the axes of variation is equivalent to simply removing the population mean of allele counts and works as well as or better than the Eigenstrat method.
first_indexed 2024-04-13T13:38:32Z
format Article
id doaj.art-df6b567121d24167b5805f946c8eff86
institution Directory Open Access Journal
issn 1932-6203
language English
last_indexed 2024-04-13T13:38:32Z
publishDate 2010-09-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj.art-df6b567121d24167b5805f946c8eff862022-12-22T02:44:44ZengPublic Library of Science (PLoS)PLoS ONE1932-62032010-09-015910.1371/journal.pone.0012510Theoretical formulation of principal components analysis to detect and correct for population stratification.Jianzhong MaChristopher I AmosThe Eigenstrat method, based on principal components analysis (PCA), is commonly used both to quantify population relationships in population genetics and to correct for population stratification in genome-wide association studies. However, it can be difficult to make appropriate inference about population relationships from the principal component (PC) scatter plot. Here, to better understand the working mechanism of the Eigenstrat method, we consider its theoretical or "population" formulation. The eigen-equation for samples from an arbitrary number () of populations is reduced to that of a matrix of dimension , the elements of which are determined by the variance-covariance matrix for the random vector of the allele frequencies. Solving the reduced eigen-equation is numerically trivial and yields eigenvectors that are the axes of variation required for differentiating the populations. Using the reduced eigen-equation, we investigate the within-population fluctuations around the axes of variation on the PC scatter plot for simulated datasets. Specifically, we show that there exists an asymptotically stable pattern of the PC plot for large sample size. Our results provide theoretical guidance for interpreting the pattern of PC plot in terms of population relationships. For applications in genetic association tests, we demonstrate that, as a method of correcting for population stratification, regressing out the theoretical PCs corresponding to the axes of variation is equivalent to simply removing the population mean of allele counts and works as well as or better than the Eigenstrat method.http://europepmc.org/articles/PMC2941459?pdf=render
spellingShingle Jianzhong Ma
Christopher I Amos
Theoretical formulation of principal components analysis to detect and correct for population stratification.
PLoS ONE
title Theoretical formulation of principal components analysis to detect and correct for population stratification.
title_full Theoretical formulation of principal components analysis to detect and correct for population stratification.
title_fullStr Theoretical formulation of principal components analysis to detect and correct for population stratification.
title_full_unstemmed Theoretical formulation of principal components analysis to detect and correct for population stratification.
title_short Theoretical formulation of principal components analysis to detect and correct for population stratification.
title_sort theoretical formulation of principal components analysis to detect and correct for population stratification
url http://europepmc.org/articles/PMC2941459?pdf=render
work_keys_str_mv AT jianzhongma theoreticalformulationofprincipalcomponentsanalysistodetectandcorrectforpopulationstratification
AT christopheriamos theoreticalformulationofprincipalcomponentsanalysistodetectandcorrectforpopulationstratification