Predicting gene expression using DNA methylation in three human populations

Background DNA methylation, an important epigenetic mark, is well known for its regulatory role in gene expression, especially the negative correlation in the promoter region. However, its correlation with gene expression across genome at human population level has not been well studied. In particul...

Full description

Bibliographic Details
Main Authors: Huan Zhong, Soyeon Kim, Degui Zhi, Xiangqin Cui
Format: Article
Language:English
Published: PeerJ Inc. 2019-05-01
Series:PeerJ
Subjects:
Online Access:https://peerj.com/articles/6757.pdf
_version_ 1797419383691149312
author Huan Zhong
Soyeon Kim
Degui Zhi
Xiangqin Cui
author_facet Huan Zhong
Soyeon Kim
Degui Zhi
Xiangqin Cui
author_sort Huan Zhong
collection DOAJ
description Background DNA methylation, an important epigenetic mark, is well known for its regulatory role in gene expression, especially the negative correlation in the promoter region. However, its correlation with gene expression across genome at human population level has not been well studied. In particular, it is unclear if genome-wide DNA methylation profile of an individual can predict her/his gene expression profile. Previous studies were mostly limited to association analyses between single CpG site methylation and gene expression. It is not known whether DNA methylation of a gene has enough prediction power to serve as a surrogate for gene expression in existing human study cohorts with DNA samples other than RNA samples. Results We examined DNA methylation in the gene region for predicting gene expression across individuals in non-cancer tissues of three human population datasets, adipose tissue of the Multiple Tissue Human Expression Resource Projects (MuTHER), peripheral blood mononuclear cell (PBMC) from Asthma and normal control study participates, and lymphoblastoid cell lines (LCL) from healthy individuals. Three prediction models were investigated, single linear regression, multiple linear regression, and least absolute shrinkage and selection operator (LASSO) penalized regression. Our results showed that LASSO regression has superior performance among these methods. However, the prediction power is generally low and varies across datasets. Only 30 and 42 genes were found to have cross-validation R2 greater than 0.3 in the PBMC and Adipose datasets, respectively. A substantially larger number of genes (258) were identified in the LCL dataset, which was generated from a more homogeneous cell line sample source. We also demonstrated that it gives better prediction power not to exclude any CpG probe due to cross hybridization or SNP effect. Conclusion In our three population analyses DNA methylation of CpG sites at gene region have limited prediction power for gene expression across individuals with linear regression models. The prediction power potentially varies depending on tissue, cell type, and data sources. In our analyses, the combination of LASSO regression and all probes not excluding any probe on the methylation array provides the best prediction for gene expression.
first_indexed 2024-03-09T06:46:35Z
format Article
id doaj.art-ac948df9eca94ec7b59f2bce6cbf8d4a
institution Directory Open Access Journal
issn 2167-8359
language English
last_indexed 2024-03-09T06:46:35Z
publishDate 2019-05-01
publisher PeerJ Inc.
record_format Article
series PeerJ
spelling doaj.art-ac948df9eca94ec7b59f2bce6cbf8d4a2023-12-03T10:35:04ZengPeerJ Inc.PeerJ2167-83592019-05-017e675710.7717/peerj.6757Predicting gene expression using DNA methylation in three human populationsHuan Zhong0Soyeon Kim1Degui Zhi2Xiangqin Cui3Department of Biology, Hong Kong Baptist University, Hong Kong, ChinaSchool of Medicine, University of Pittsburgh, Pittsburgh, PA, United States of AmericaSchool of Biomendical Informatics, University of Texas Health Center at Houston, Houston, TX, United States of AmericaDepartment of Biostatistics and Bioinformatics, Emory University, Atlanta, GA, United States of AmericaBackground DNA methylation, an important epigenetic mark, is well known for its regulatory role in gene expression, especially the negative correlation in the promoter region. However, its correlation with gene expression across genome at human population level has not been well studied. In particular, it is unclear if genome-wide DNA methylation profile of an individual can predict her/his gene expression profile. Previous studies were mostly limited to association analyses between single CpG site methylation and gene expression. It is not known whether DNA methylation of a gene has enough prediction power to serve as a surrogate for gene expression in existing human study cohorts with DNA samples other than RNA samples. Results We examined DNA methylation in the gene region for predicting gene expression across individuals in non-cancer tissues of three human population datasets, adipose tissue of the Multiple Tissue Human Expression Resource Projects (MuTHER), peripheral blood mononuclear cell (PBMC) from Asthma and normal control study participates, and lymphoblastoid cell lines (LCL) from healthy individuals. Three prediction models were investigated, single linear regression, multiple linear regression, and least absolute shrinkage and selection operator (LASSO) penalized regression. Our results showed that LASSO regression has superior performance among these methods. However, the prediction power is generally low and varies across datasets. Only 30 and 42 genes were found to have cross-validation R2 greater than 0.3 in the PBMC and Adipose datasets, respectively. A substantially larger number of genes (258) were identified in the LCL dataset, which was generated from a more homogeneous cell line sample source. We also demonstrated that it gives better prediction power not to exclude any CpG probe due to cross hybridization or SNP effect. Conclusion In our three population analyses DNA methylation of CpG sites at gene region have limited prediction power for gene expression across individuals with linear regression models. The prediction power potentially varies depending on tissue, cell type, and data sources. In our analyses, the combination of LASSO regression and all probes not excluding any probe on the methylation array provides the best prediction for gene expression.https://peerj.com/articles/6757.pdfDNA methylationMethylation microarrayTranscriptomeLASSO
spellingShingle Huan Zhong
Soyeon Kim
Degui Zhi
Xiangqin Cui
Predicting gene expression using DNA methylation in three human populations
PeerJ
DNA methylation
Methylation microarray
Transcriptome
LASSO
title Predicting gene expression using DNA methylation in three human populations
title_full Predicting gene expression using DNA methylation in three human populations
title_fullStr Predicting gene expression using DNA methylation in three human populations
title_full_unstemmed Predicting gene expression using DNA methylation in three human populations
title_short Predicting gene expression using DNA methylation in three human populations
title_sort predicting gene expression using dna methylation in three human populations
topic DNA methylation
Methylation microarray
Transcriptome
LASSO
url https://peerj.com/articles/6757.pdf
work_keys_str_mv AT huanzhong predictinggeneexpressionusingdnamethylationinthreehumanpopulations
AT soyeonkim predictinggeneexpressionusingdnamethylationinthreehumanpopulations
AT deguizhi predictinggeneexpressionusingdnamethylationinthreehumanpopulations
AT xiangqincui predictinggeneexpressionusingdnamethylationinthreehumanpopulations