A non-negative spike-and-slab lasso generalized linear stacking prediction modeling method for high-dimensional omics data

Abstract Background High-dimensional omics data are increasingly utilized in clinical and public health research for disease risk prediction. Many previous sparse methods have been proposed that using prior knowledge, e.g., biological group structure information, to guide the model-building process....

Full description

Bibliographic Details
Main Authors: Junjie Shen, Shuo Wang, Yongfei Dong, Hao Sun, Xichao Wang, Zaixiang Tang
Format: Article
Language:English
Published: BMC 2024-03-01
Series:BMC Bioinformatics
Subjects:
Online Access:https://doi.org/10.1186/s12859-024-05741-6
_version_ 1797247009553383424
author Junjie Shen
Shuo Wang
Yongfei Dong
Hao Sun
Xichao Wang
Zaixiang Tang
author_facet Junjie Shen
Shuo Wang
Yongfei Dong
Hao Sun
Xichao Wang
Zaixiang Tang
author_sort Junjie Shen
collection DOAJ
description Abstract Background High-dimensional omics data are increasingly utilized in clinical and public health research for disease risk prediction. Many previous sparse methods have been proposed that using prior knowledge, e.g., biological group structure information, to guide the model-building process. However, these methods are still based on a single model, offen leading to overconfident inferences and inferior generalization. Results We proposed a novel stacking strategy based on a non-negative spike-and-slab Lasso (nsslasso) generalized linear model (GLM) for disease risk prediction in the context of high-dimensional omics data. Briefly, we used prior biological knowledge to segment omics data into a set of sub-data. Each sub-model was trained separately using the features from the group via a proper base learner. Then, the predictions of sub-models were ensembled by a super learner using nsslasso GLM. The proposed method was compared to several competitors, such as the Lasso, grlasso, and gsslasso, using simulated data and two open-access breast cancer data. As a result, the proposed method showed robustly superior prediction performance to the optimal single-model method in high-noise simulated data and real-world data. Furthermore, compared to the traditional stacking method, the proposed nsslasso stacking method can efficiently handle redundant sub-models and identify important sub-models. Conclusions The proposed nsslasso method demonstrated favorable predictive accuracy, stability, and biological interpretability. Additionally, the proposed method can also be used to detect new biomarkers and key group structures.
first_indexed 2024-04-24T19:51:53Z
format Article
id doaj.art-935b7147842b49e2aa08ed2f83118512
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-04-24T19:51:53Z
publishDate 2024-03-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-935b7147842b49e2aa08ed2f831185122024-03-24T12:35:30ZengBMCBMC Bioinformatics1471-21052024-03-0125112010.1186/s12859-024-05741-6A non-negative spike-and-slab lasso generalized linear stacking prediction modeling method for high-dimensional omics dataJunjie Shen0Shuo Wang1Yongfei Dong2Hao Sun3Xichao Wang4Zaixiang Tang5Department of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow UniversityInstitute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of FreiburgDepartment of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow UniversityDepartment of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow UniversityDepartment of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow UniversityDepartment of Biostatistics, School of Public Health, Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, MOE Key Laboratory of Geriatric Diseases and Immunology, Suzhou Medical College of Soochow UniversityAbstract Background High-dimensional omics data are increasingly utilized in clinical and public health research for disease risk prediction. Many previous sparse methods have been proposed that using prior knowledge, e.g., biological group structure information, to guide the model-building process. However, these methods are still based on a single model, offen leading to overconfident inferences and inferior generalization. Results We proposed a novel stacking strategy based on a non-negative spike-and-slab Lasso (nsslasso) generalized linear model (GLM) for disease risk prediction in the context of high-dimensional omics data. Briefly, we used prior biological knowledge to segment omics data into a set of sub-data. Each sub-model was trained separately using the features from the group via a proper base learner. Then, the predictions of sub-models were ensembled by a super learner using nsslasso GLM. The proposed method was compared to several competitors, such as the Lasso, grlasso, and gsslasso, using simulated data and two open-access breast cancer data. As a result, the proposed method showed robustly superior prediction performance to the optimal single-model method in high-noise simulated data and real-world data. Furthermore, compared to the traditional stacking method, the proposed nsslasso stacking method can efficiently handle redundant sub-models and identify important sub-models. Conclusions The proposed nsslasso method demonstrated favorable predictive accuracy, stability, and biological interpretability. Additionally, the proposed method can also be used to detect new biomarkers and key group structures.https://doi.org/10.1186/s12859-024-05741-6Stacking Bayesian methodNon-negative spike-and-slab priorOmics segmentation
spellingShingle Junjie Shen
Shuo Wang
Yongfei Dong
Hao Sun
Xichao Wang
Zaixiang Tang
A non-negative spike-and-slab lasso generalized linear stacking prediction modeling method for high-dimensional omics data
BMC Bioinformatics
Stacking Bayesian method
Non-negative spike-and-slab prior
Omics segmentation
title A non-negative spike-and-slab lasso generalized linear stacking prediction modeling method for high-dimensional omics data
title_full A non-negative spike-and-slab lasso generalized linear stacking prediction modeling method for high-dimensional omics data
title_fullStr A non-negative spike-and-slab lasso generalized linear stacking prediction modeling method for high-dimensional omics data
title_full_unstemmed A non-negative spike-and-slab lasso generalized linear stacking prediction modeling method for high-dimensional omics data
title_short A non-negative spike-and-slab lasso generalized linear stacking prediction modeling method for high-dimensional omics data
title_sort non negative spike and slab lasso generalized linear stacking prediction modeling method for high dimensional omics data
topic Stacking Bayesian method
Non-negative spike-and-slab prior
Omics segmentation
url https://doi.org/10.1186/s12859-024-05741-6
work_keys_str_mv AT junjieshen anonnegativespikeandslablassogeneralizedlinearstackingpredictionmodelingmethodforhighdimensionalomicsdata
AT shuowang anonnegativespikeandslablassogeneralizedlinearstackingpredictionmodelingmethodforhighdimensionalomicsdata
AT yongfeidong anonnegativespikeandslablassogeneralizedlinearstackingpredictionmodelingmethodforhighdimensionalomicsdata
AT haosun anonnegativespikeandslablassogeneralizedlinearstackingpredictionmodelingmethodforhighdimensionalomicsdata
AT xichaowang anonnegativespikeandslablassogeneralizedlinearstackingpredictionmodelingmethodforhighdimensionalomicsdata
AT zaixiangtang anonnegativespikeandslablassogeneralizedlinearstackingpredictionmodelingmethodforhighdimensionalomicsdata
AT junjieshen nonnegativespikeandslablassogeneralizedlinearstackingpredictionmodelingmethodforhighdimensionalomicsdata
AT shuowang nonnegativespikeandslablassogeneralizedlinearstackingpredictionmodelingmethodforhighdimensionalomicsdata
AT yongfeidong nonnegativespikeandslablassogeneralizedlinearstackingpredictionmodelingmethodforhighdimensionalomicsdata
AT haosun nonnegativespikeandslablassogeneralizedlinearstackingpredictionmodelingmethodforhighdimensionalomicsdata
AT xichaowang nonnegativespikeandslablassogeneralizedlinearstackingpredictionmodelingmethodforhighdimensionalomicsdata
AT zaixiangtang nonnegativespikeandslablassogeneralizedlinearstackingpredictionmodelingmethodforhighdimensionalomicsdata