On rare variants in principal component analysis of population stratification

Abstract Background Population stratification is a known confounder of genome-wide association studies, as it can lead to false positive results. Principal component analysis (PCA) method is widely applied in the analysis of population structure with common variants. However, it is still unclear abo...

Full description

Bibliographic Details
Main Authors: Shengqing Ma, Gang Shi
Format: Article
Language:English
Published: BMC 2020-03-01
Series:BMC Genetics
Subjects:
Online Access:http://link.springer.com/article/10.1186/s12863-020-0833-x
_version_ 1818149778682281984
author Shengqing Ma
Gang Shi
author_facet Shengqing Ma
Gang Shi
author_sort Shengqing Ma
collection DOAJ
description Abstract Background Population stratification is a known confounder of genome-wide association studies, as it can lead to false positive results. Principal component analysis (PCA) method is widely applied in the analysis of population structure with common variants. However, it is still unclear about the analysis performance when rare variants are used. Results We derive a mathematical expectation of the genetic relationship matrix. Variance and covariance elements of the expected matrix depend explicitly on allele frequencies of the genetic markers used in the PCA analysis. We show that inter-population variance is solely contained in K principal components (PCs) and mostly in the largest K-1 PCs, where K is the number of populations in the samples. We propose FPC, ratio of the inter-population variance to the intra-population variance in the K population informative PCs, and d 2, sum of squared distances among populations, as measures of population divergence. We show analytically that when allele frequencies become small, the ratio FPC abates, the population distance d 2 decreases, and portion of variance explained by the K PCs diminishes. The results are validated in the analysis of the 1000 Genomes Project data. The ratio FPC is 93.85, population distance d 2 is 444.38, and variance explained by the largest five PCs is 17.09% when using with common variants with allele frequencies between 0.4 and 0.5. However, the ratio, distance and percentage decrease to 1.83, 17.83 and 0.74%, respectively, with rare variants of frequencies between 0.0001 and 0.01. Conclusions The PCA of population stratification performs worse with rare variants than with common ones. It is necessary to restrict the selection to only the common variants when analyzing population stratification with sequencing data.
first_indexed 2024-12-11T13:12:27Z
format Article
id doaj.art-fbeef469ed9647c1a8486970db74ad70
institution Directory Open Access Journal
issn 1471-2156
language English
last_indexed 2024-12-11T13:12:27Z
publishDate 2020-03-01
publisher BMC
record_format Article
series BMC Genetics
spelling doaj.art-fbeef469ed9647c1a8486970db74ad702022-12-22T01:06:09ZengBMCBMC Genetics1471-21562020-03-0121111110.1186/s12863-020-0833-xOn rare variants in principal component analysis of population stratificationShengqing Ma0Gang Shi1State Key Laboratory of Integrated Services Networks, Xidian UniversityState Key Laboratory of Integrated Services Networks, Xidian UniversityAbstract Background Population stratification is a known confounder of genome-wide association studies, as it can lead to false positive results. Principal component analysis (PCA) method is widely applied in the analysis of population structure with common variants. However, it is still unclear about the analysis performance when rare variants are used. Results We derive a mathematical expectation of the genetic relationship matrix. Variance and covariance elements of the expected matrix depend explicitly on allele frequencies of the genetic markers used in the PCA analysis. We show that inter-population variance is solely contained in K principal components (PCs) and mostly in the largest K-1 PCs, where K is the number of populations in the samples. We propose FPC, ratio of the inter-population variance to the intra-population variance in the K population informative PCs, and d 2, sum of squared distances among populations, as measures of population divergence. We show analytically that when allele frequencies become small, the ratio FPC abates, the population distance d 2 decreases, and portion of variance explained by the K PCs diminishes. The results are validated in the analysis of the 1000 Genomes Project data. The ratio FPC is 93.85, population distance d 2 is 444.38, and variance explained by the largest five PCs is 17.09% when using with common variants with allele frequencies between 0.4 and 0.5. However, the ratio, distance and percentage decrease to 1.83, 17.83 and 0.74%, respectively, with rare variants of frequencies between 0.0001 and 0.01. Conclusions The PCA of population stratification performs worse with rare variants than with common ones. It is necessary to restrict the selection to only the common variants when analyzing population stratification with sequencing data.http://link.springer.com/article/10.1186/s12863-020-0833-xRare variantPopulation stratificationPrincipal component analysisSingle nucleotide polymorphism
spellingShingle Shengqing Ma
Gang Shi
On rare variants in principal component analysis of population stratification
BMC Genetics
Rare variant
Population stratification
Principal component analysis
Single nucleotide polymorphism
title On rare variants in principal component analysis of population stratification
title_full On rare variants in principal component analysis of population stratification
title_fullStr On rare variants in principal component analysis of population stratification
title_full_unstemmed On rare variants in principal component analysis of population stratification
title_short On rare variants in principal component analysis of population stratification
title_sort on rare variants in principal component analysis of population stratification
topic Rare variant
Population stratification
Principal component analysis
Single nucleotide polymorphism
url http://link.springer.com/article/10.1186/s12863-020-0833-x
work_keys_str_mv AT shengqingma onrarevariantsinprincipalcomponentanalysisofpopulationstratification
AT gangshi onrarevariantsinprincipalcomponentanalysisofpopulationstratification