Penalized regression and model selection methods for polygenic scores on summary statistics.

Polygenic scores quantify the genetic risk associated with a given phenotype and are widely used to predict the risk of complex diseases. There has been recent interest in developing methods to construct polygenic risk scores using summary statistic data. We propose a method to construct polygenic r...

Full description

Bibliographic Details
Main Authors: Jack Pattee, Wei Pan
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2020-10-01
Series:PLoS Computational Biology
Online Access:https://doi.org/10.1371/journal.pcbi.1008271
_version_ 1818579044372840448
author Jack Pattee
Wei Pan
author_facet Jack Pattee
Wei Pan
author_sort Jack Pattee
collection DOAJ
description Polygenic scores quantify the genetic risk associated with a given phenotype and are widely used to predict the risk of complex diseases. There has been recent interest in developing methods to construct polygenic risk scores using summary statistic data. We propose a method to construct polygenic risk scores via penalized regression using summary statistic data and publicly available reference data. Our method bears similarity to existing method LassoSum, extending their framework to the Truncated Lasso Penalty (TLP) and the elastic net. We show via simulation and real data application that the TLP improves predictive accuracy as compared to the LASSO while imposing additional sparsity where appropriate. To facilitate model selection in the absence of validation data, we propose methods for estimating model fitting criteria AIC and BIC. These methods approximate the AIC and BIC in the case where we have a polygenic risk score estimated on summary statistic data and no validation data. Additionally, we propose the so-called quasi-correlation metric, which quantifies the predictive accuracy of a polygenic risk score applied to out-of-sample data for which we have only summary statistic information. In total, these methods facilitate estimation and model selection of polygenic risk scores on summary statistic data, and the application of these polygenic risk scores to out-of-sample data for which we have only summary statistic information. We demonstrate the utility of these methods by applying them to GWA studies of lipids, height, and lung cancer.
first_indexed 2024-12-16T06:55:26Z
format Article
id doaj.art-dcfad8ea8fb24fba8fd05b02399ae9f2
institution Directory Open Access Journal
issn 1553-734X
1553-7358
language English
last_indexed 2024-12-16T06:55:26Z
publishDate 2020-10-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS Computational Biology
spelling doaj.art-dcfad8ea8fb24fba8fd05b02399ae9f22022-12-21T22:40:18ZengPublic Library of Science (PLoS)PLoS Computational Biology1553-734X1553-73582020-10-011610e100827110.1371/journal.pcbi.1008271Penalized regression and model selection methods for polygenic scores on summary statistics.Jack PatteeWei PanPolygenic scores quantify the genetic risk associated with a given phenotype and are widely used to predict the risk of complex diseases. There has been recent interest in developing methods to construct polygenic risk scores using summary statistic data. We propose a method to construct polygenic risk scores via penalized regression using summary statistic data and publicly available reference data. Our method bears similarity to existing method LassoSum, extending their framework to the Truncated Lasso Penalty (TLP) and the elastic net. We show via simulation and real data application that the TLP improves predictive accuracy as compared to the LASSO while imposing additional sparsity where appropriate. To facilitate model selection in the absence of validation data, we propose methods for estimating model fitting criteria AIC and BIC. These methods approximate the AIC and BIC in the case where we have a polygenic risk score estimated on summary statistic data and no validation data. Additionally, we propose the so-called quasi-correlation metric, which quantifies the predictive accuracy of a polygenic risk score applied to out-of-sample data for which we have only summary statistic information. In total, these methods facilitate estimation and model selection of polygenic risk scores on summary statistic data, and the application of these polygenic risk scores to out-of-sample data for which we have only summary statistic information. We demonstrate the utility of these methods by applying them to GWA studies of lipids, height, and lung cancer.https://doi.org/10.1371/journal.pcbi.1008271
spellingShingle Jack Pattee
Wei Pan
Penalized regression and model selection methods for polygenic scores on summary statistics.
PLoS Computational Biology
title Penalized regression and model selection methods for polygenic scores on summary statistics.
title_full Penalized regression and model selection methods for polygenic scores on summary statistics.
title_fullStr Penalized regression and model selection methods for polygenic scores on summary statistics.
title_full_unstemmed Penalized regression and model selection methods for polygenic scores on summary statistics.
title_short Penalized regression and model selection methods for polygenic scores on summary statistics.
title_sort penalized regression and model selection methods for polygenic scores on summary statistics
url https://doi.org/10.1371/journal.pcbi.1008271
work_keys_str_mv AT jackpattee penalizedregressionandmodelselectionmethodsforpolygenicscoresonsummarystatistics
AT weipan penalizedregressionandmodelselectionmethodsforpolygenicscoresonsummarystatistics