Assisted clustering of gene expression data using regulatory data from partially overlapping sets of individuals

Abstract Background As omics measurements profiled on different molecular layers are interconnected, integrative approaches that incorporate the regulatory effect from multi-level omics data are needed. When the multi-level omics data are from the same individuals, gene expression (GE) clusters can...

Full description

Bibliographic Details
Main Authors: Wenqing Jiang, Roby Joehanes, Daniel Levy, George T O’Connor, Josée Dupuis
Format: Article
Language:English
Published: BMC 2022-12-01
Series:BMC Genomics
Subjects:
Online Access:https://doi.org/10.1186/s12864-022-09026-1
_version_ 1811204035042082816
author Wenqing Jiang
Roby Joehanes
Daniel Levy
George T O’Connor
Josée Dupuis
author_facet Wenqing Jiang
Roby Joehanes
Daniel Levy
George T O’Connor
Josée Dupuis
author_sort Wenqing Jiang
collection DOAJ
description Abstract Background As omics measurements profiled on different molecular layers are interconnected, integrative approaches that incorporate the regulatory effect from multi-level omics data are needed. When the multi-level omics data are from the same individuals, gene expression (GE) clusters can be identified using information from regulators like genetic variants and DNA methylation. When the multi-level omics data are from different individuals, the choice of integration approaches is limited. Methods We developed an approach to improve GE clustering from microarray data by integrating regulatory data from different but partially overlapping sets of individuals. We achieve this through (1) decomposing gene expression into the regulated component and the other component that is not regulated by measured factors, (2) optimizing the clustering goodness-of-fit objective function. We do not require the availability of different omics measurements on all individuals. A certain amount of individual overlap between GE data and the regulatory data is adequate for modeling the regulation, thus improving GE clustering. Results A simulation study shows that the performance of the proposed approach depends on the strength of the GE-regulator relationship, degree of missingness, data dimensionality, sample size, and the number of clusters. Across the various simulation settings, the proposed method shows competitive performance in terms of accuracy compared to the alternative K-means clustering method, especially when the clustering structure is due mostly to the regulated component, rather than the unregulated component. We further validate the approach with an application to 8,902 Framingham Heart Study participants with data on up to 17,873 genes and regulation information of DNA methylation and genotype from different but partially overlapping sets of participants. We identify clustering structures of genes associated with pulmonary function while incorporating the predicted regulation effect from the measured regulators. We further investigate the over-representation of these GE clusters in pathways of other diseases that may be related to lung function and respiratory health. Conclusion We propose a novel approach for clustering GE with the assistance of regulatory data that allowed for different but partially overlapping sets of individuals to be included in different omics data.
first_indexed 2024-04-12T03:04:44Z
format Article
id doaj.art-48c44e0f70e44f3dbc07b8dab7d1e75b
institution Directory Open Access Journal
issn 1471-2164
language English
last_indexed 2024-04-12T03:04:44Z
publishDate 2022-12-01
publisher BMC
record_format Article
series BMC Genomics
spelling doaj.art-48c44e0f70e44f3dbc07b8dab7d1e75b2022-12-22T03:50:32ZengBMCBMC Genomics1471-21642022-12-0123111910.1186/s12864-022-09026-1Assisted clustering of gene expression data using regulatory data from partially overlapping sets of individualsWenqing Jiang0Roby Joehanes1Daniel Levy2George T O’Connor3Josée Dupuis4Department of Biostatistics, Boston University School of Public HealthNational Heart, Lung, and Blood Institute’s Framingham Heart StudyNational Heart, Lung, and Blood Institute’s Framingham Heart StudyDepartment of Medicine, Pulmonary Center, Boston UniversityDepartment of Biostatistics, Boston University School of Public HealthAbstract Background As omics measurements profiled on different molecular layers are interconnected, integrative approaches that incorporate the regulatory effect from multi-level omics data are needed. When the multi-level omics data are from the same individuals, gene expression (GE) clusters can be identified using information from regulators like genetic variants and DNA methylation. When the multi-level omics data are from different individuals, the choice of integration approaches is limited. Methods We developed an approach to improve GE clustering from microarray data by integrating regulatory data from different but partially overlapping sets of individuals. We achieve this through (1) decomposing gene expression into the regulated component and the other component that is not regulated by measured factors, (2) optimizing the clustering goodness-of-fit objective function. We do not require the availability of different omics measurements on all individuals. A certain amount of individual overlap between GE data and the regulatory data is adequate for modeling the regulation, thus improving GE clustering. Results A simulation study shows that the performance of the proposed approach depends on the strength of the GE-regulator relationship, degree of missingness, data dimensionality, sample size, and the number of clusters. Across the various simulation settings, the proposed method shows competitive performance in terms of accuracy compared to the alternative K-means clustering method, especially when the clustering structure is due mostly to the regulated component, rather than the unregulated component. We further validate the approach with an application to 8,902 Framingham Heart Study participants with data on up to 17,873 genes and regulation information of DNA methylation and genotype from different but partially overlapping sets of participants. We identify clustering structures of genes associated with pulmonary function while incorporating the predicted regulation effect from the measured regulators. We further investigate the over-representation of these GE clusters in pathways of other diseases that may be related to lung function and respiratory health. Conclusion We propose a novel approach for clustering GE with the assistance of regulatory data that allowed for different but partially overlapping sets of individuals to be included in different omics data.https://doi.org/10.1186/s12864-022-09026-1Multi-omics data integrationGene expressionClusteringDNA methylationGenotypeFramingham Heart Study
spellingShingle Wenqing Jiang
Roby Joehanes
Daniel Levy
George T O’Connor
Josée Dupuis
Assisted clustering of gene expression data using regulatory data from partially overlapping sets of individuals
BMC Genomics
Multi-omics data integration
Gene expression
Clustering
DNA methylation
Genotype
Framingham Heart Study
title Assisted clustering of gene expression data using regulatory data from partially overlapping sets of individuals
title_full Assisted clustering of gene expression data using regulatory data from partially overlapping sets of individuals
title_fullStr Assisted clustering of gene expression data using regulatory data from partially overlapping sets of individuals
title_full_unstemmed Assisted clustering of gene expression data using regulatory data from partially overlapping sets of individuals
title_short Assisted clustering of gene expression data using regulatory data from partially overlapping sets of individuals
title_sort assisted clustering of gene expression data using regulatory data from partially overlapping sets of individuals
topic Multi-omics data integration
Gene expression
Clustering
DNA methylation
Genotype
Framingham Heart Study
url https://doi.org/10.1186/s12864-022-09026-1
work_keys_str_mv AT wenqingjiang assistedclusteringofgeneexpressiondatausingregulatorydatafrompartiallyoverlappingsetsofindividuals
AT robyjoehanes assistedclusteringofgeneexpressiondatausingregulatorydatafrompartiallyoverlappingsetsofindividuals
AT daniellevy assistedclusteringofgeneexpressiondatausingregulatorydatafrompartiallyoverlappingsetsofindividuals
AT georgetoconnor assistedclusteringofgeneexpressiondatausingregulatorydatafrompartiallyoverlappingsetsofindividuals
AT joseedupuis assistedclusteringofgeneexpressiondatausingregulatorydatafrompartiallyoverlappingsetsofindividuals