Limitations of principal components in quantitative genetic association models for human studies

Principal Component Analysis (PCA) and the Linear Mixed-effects Model (LMM), sometimes in combination, are the most common genetic association models. Previous PCA-LMM comparisons give mixed results, unclear guidance, and have several limitations, including not varying the number of principal compon...

Full description

Bibliographic Details
Main Authors: Yiqi Yao, Alejandro Ochoa
Format: Article
Language:English
Published: eLife Sciences Publications Ltd 2023-05-01
Series:eLife
Subjects:
Online Access:https://elifesciences.org/articles/79238
_version_ 1797814093575356416
author Yiqi Yao
Alejandro Ochoa
author_facet Yiqi Yao
Alejandro Ochoa
author_sort Yiqi Yao
collection DOAJ
description Principal Component Analysis (PCA) and the Linear Mixed-effects Model (LMM), sometimes in combination, are the most common genetic association models. Previous PCA-LMM comparisons give mixed results, unclear guidance, and have several limitations, including not varying the number of principal components (PCs), simulating simple population structures, and inconsistent use of real data and power evaluations. We evaluate PCA and LMM both varying number of PCs in realistic genotype and complex trait simulations including admixed families, subpopulation trees, and real multiethnic human datasets with simulated traits. We find that LMM without PCs usually performs best, with the largest effects in family simulations and real human datasets and traits without environment effects. Poor PCA performance on human datasets is driven by large numbers of distant relatives more than the smaller number of closer relatives. While PCA was known to fail on family data, we report strong effects of family relatedness in genetically diverse human datasets, not avoided by pruning close relatives. Environment effects driven by geography and ethnicity are better modeled with LMM including those labels instead of PCs. This work better characterizes the severe limitations of PCA compared to LMM in modeling the complex relatedness structures of multiethnic human data for association studies.
first_indexed 2024-03-13T08:02:28Z
format Article
id doaj.art-b08f660cef1740ae982d252388aeef7e
institution Directory Open Access Journal
issn 2050-084X
language English
last_indexed 2024-03-13T08:02:28Z
publishDate 2023-05-01
publisher eLife Sciences Publications Ltd
record_format Article
series eLife
spelling doaj.art-b08f660cef1740ae982d252388aeef7e2023-06-01T13:32:40ZengeLife Sciences Publications LtdeLife2050-084X2023-05-011210.7554/eLife.79238Limitations of principal components in quantitative genetic association models for human studiesYiqi Yao0Alejandro Ochoa1https://orcid.org/0000-0003-4928-3403Department of Biostatistics and Bioinformatics, Duke University, Durham, United StatesDepartment of Biostatistics and Bioinformatics, Duke University, Durham, United States; Duke Center for Statistical Genetics and Genomics, Duke University, Durham, United StatesPrincipal Component Analysis (PCA) and the Linear Mixed-effects Model (LMM), sometimes in combination, are the most common genetic association models. Previous PCA-LMM comparisons give mixed results, unclear guidance, and have several limitations, including not varying the number of principal components (PCs), simulating simple population structures, and inconsistent use of real data and power evaluations. We evaluate PCA and LMM both varying number of PCs in realistic genotype and complex trait simulations including admixed families, subpopulation trees, and real multiethnic human datasets with simulated traits. We find that LMM without PCs usually performs best, with the largest effects in family simulations and real human datasets and traits without environment effects. Poor PCA performance on human datasets is driven by large numbers of distant relatives more than the smaller number of closer relatives. While PCA was known to fail on family data, we report strong effects of family relatedness in genetically diverse human datasets, not avoided by pruning close relatives. Environment effects driven by geography and ethnicity are better modeled with LMM including those labels instead of PCs. This work better characterizes the severe limitations of PCA compared to LMM in modeling the complex relatedness structures of multiethnic human data for association studies.https://elifesciences.org/articles/79238genetic associationstatistical geneticspopulation structurecryptic relatednesscomplex quantitative traitsmultiethnic human data and simulations
spellingShingle Yiqi Yao
Alejandro Ochoa
Limitations of principal components in quantitative genetic association models for human studies
eLife
genetic association
statistical genetics
population structure
cryptic relatedness
complex quantitative traits
multiethnic human data and simulations
title Limitations of principal components in quantitative genetic association models for human studies
title_full Limitations of principal components in quantitative genetic association models for human studies
title_fullStr Limitations of principal components in quantitative genetic association models for human studies
title_full_unstemmed Limitations of principal components in quantitative genetic association models for human studies
title_short Limitations of principal components in quantitative genetic association models for human studies
title_sort limitations of principal components in quantitative genetic association models for human studies
topic genetic association
statistical genetics
population structure
cryptic relatedness
complex quantitative traits
multiethnic human data and simulations
url https://elifesciences.org/articles/79238
work_keys_str_mv AT yiqiyao limitationsofprincipalcomponentsinquantitativegeneticassociationmodelsforhumanstudies
AT alejandroochoa limitationsofprincipalcomponentsinquantitativegeneticassociationmodelsforhumanstudies