Machine learning framework to extract the biomarker potential of plasma IgG N-glycans towards disease risk stratification
Effective management of chronic diseases and cancer can greatly benefit from disease-specific biomarkers that enable informative screening and timely diagnosis. IgG N-glycans found in human plasma have the potential to be minimally invasive disease-specific biomarkers for all stages of disease devel...
المؤلفون الرئيسيون: | , , , , , , |
---|---|
التنسيق: | مقال |
اللغة: | English |
منشور في: |
Elsevier
2024-12-01
|
سلاسل: | Computational and Structural Biotechnology Journal |
الموضوعات: | |
الوصول للمادة أونلاين: | http://www.sciencedirect.com/science/article/pii/S2001037024000618 |
_version_ | 1827311209045557248 |
---|---|
author | Konstantinos Flevaris Joseph Davies Shoh Nakai Frano Vučković Gordan Lauc Malcolm G. Dunlop Cleo Kontoravdi |
author_facet | Konstantinos Flevaris Joseph Davies Shoh Nakai Frano Vučković Gordan Lauc Malcolm G. Dunlop Cleo Kontoravdi |
author_sort | Konstantinos Flevaris |
collection | DOAJ |
description | Effective management of chronic diseases and cancer can greatly benefit from disease-specific biomarkers that enable informative screening and timely diagnosis. IgG N-glycans found in human plasma have the potential to be minimally invasive disease-specific biomarkers for all stages of disease development due to their plasticity in response to various genetic and environmental stimuli. Data analysis and machine learning (ML) approaches can assist in harnessing the potential of IgG glycomics towards biomarker discovery and the development of reliable predictive tools for disease screening. This study proposes an ML-based N-glycomic analysis framework that can be employed to build, optimise, and evaluate multiple ML pipelines to stratify patients based on disease risk in an interpretable manner. To design and test this framework, a published colorectal cancer (CRC) dataset from the Study of Colorectal Cancer in Scotland (SOCCS) cohort (1999–2006) was used. In particular, among the different pipelines tested, an XGBoost-based ML pipeline, which was tuned using multi-objective optimisation, calibrated using an inductive Venn-Abers predictor (IVAP), and evaluated via a nested cross-validation (NCV) scheme, achieved a mean area under the Receiver Operating Characteristic Curve (AUC-ROC) of 0.771 when classifying between age-, and sex-matched healthy controls and CRC patients. This performance suggests the potential of using the relative abundance of IgG N-glycans to define populations at elevated CRC risk who merit investigation or surveillance. Finally, the IgG N-glycans that highly impact CRC classification decisions were identified using a global model-agnostic interpretability technique, namely Accumulated Local Effects (ALE). We envision that open-source computational frameworks, such as the one presented herein, will be useful in supporting the translation of glycan-based biomarkers into clinical applications. |
first_indexed | 2024-04-24T20:13:10Z |
format | Article |
id | doaj.art-72332fe12d9c4affa08d4777944d1896 |
institution | Directory Open Access Journal |
issn | 2001-0370 |
language | English |
last_indexed | 2024-04-24T20:13:10Z |
publishDate | 2024-12-01 |
publisher | Elsevier |
record_format | Article |
series | Computational and Structural Biotechnology Journal |
spelling | doaj.art-72332fe12d9c4affa08d4777944d18962024-03-23T06:23:54ZengElsevierComputational and Structural Biotechnology Journal2001-03702024-12-012312341243Machine learning framework to extract the biomarker potential of plasma IgG N-glycans towards disease risk stratificationKonstantinos Flevaris0Joseph Davies1Shoh Nakai2Frano Vučković3Gordan Lauc4Malcolm G. Dunlop5Cleo Kontoravdi6Department of Chemical Engineering, Imperial College London, London SW7 2AZ, United Kingdom; Corresponding authors.Department of Chemical Engineering, Imperial College London, London SW7 2AZ, United KingdomDepartment of Chemical Engineering, Imperial College London, London SW7 2AZ, United KingdomGenos Glycoscience Research Laboratory, Zagreb 10000, CroatiaGenos Glycoscience Research Laboratory, Zagreb 10000, Croatia; Department of Biochemistry and Molecular Biology, Faculty of Pharmacy and Biochemistry, University of Zagreb, Zagreb, CroatiaColon Cancer Genetics Group, Institute of Genetics and Cancer, Cancer Research UK Scotland Centre, University of Edinburgh and Medical Research Council Human Genetics Unit, Edinburgh, United KingdomDepartment of Chemical Engineering, Imperial College London, London SW7 2AZ, United Kingdom; Corresponding authors.Effective management of chronic diseases and cancer can greatly benefit from disease-specific biomarkers that enable informative screening and timely diagnosis. IgG N-glycans found in human plasma have the potential to be minimally invasive disease-specific biomarkers for all stages of disease development due to their plasticity in response to various genetic and environmental stimuli. Data analysis and machine learning (ML) approaches can assist in harnessing the potential of IgG glycomics towards biomarker discovery and the development of reliable predictive tools for disease screening. This study proposes an ML-based N-glycomic analysis framework that can be employed to build, optimise, and evaluate multiple ML pipelines to stratify patients based on disease risk in an interpretable manner. To design and test this framework, a published colorectal cancer (CRC) dataset from the Study of Colorectal Cancer in Scotland (SOCCS) cohort (1999–2006) was used. In particular, among the different pipelines tested, an XGBoost-based ML pipeline, which was tuned using multi-objective optimisation, calibrated using an inductive Venn-Abers predictor (IVAP), and evaluated via a nested cross-validation (NCV) scheme, achieved a mean area under the Receiver Operating Characteristic Curve (AUC-ROC) of 0.771 when classifying between age-, and sex-matched healthy controls and CRC patients. This performance suggests the potential of using the relative abundance of IgG N-glycans to define populations at elevated CRC risk who merit investigation or surveillance. Finally, the IgG N-glycans that highly impact CRC classification decisions were identified using a global model-agnostic interpretability technique, namely Accumulated Local Effects (ALE). We envision that open-source computational frameworks, such as the one presented herein, will be useful in supporting the translation of glycan-based biomarkers into clinical applications.http://www.sciencedirect.com/science/article/pii/S2001037024000618GlycosylationCancerMulti-objective optimizationProbability calibrationInterpretable machine learning |
spellingShingle | Konstantinos Flevaris Joseph Davies Shoh Nakai Frano Vučković Gordan Lauc Malcolm G. Dunlop Cleo Kontoravdi Machine learning framework to extract the biomarker potential of plasma IgG N-glycans towards disease risk stratification Computational and Structural Biotechnology Journal Glycosylation Cancer Multi-objective optimization Probability calibration Interpretable machine learning |
title | Machine learning framework to extract the biomarker potential of plasma IgG N-glycans towards disease risk stratification |
title_full | Machine learning framework to extract the biomarker potential of plasma IgG N-glycans towards disease risk stratification |
title_fullStr | Machine learning framework to extract the biomarker potential of plasma IgG N-glycans towards disease risk stratification |
title_full_unstemmed | Machine learning framework to extract the biomarker potential of plasma IgG N-glycans towards disease risk stratification |
title_short | Machine learning framework to extract the biomarker potential of plasma IgG N-glycans towards disease risk stratification |
title_sort | machine learning framework to extract the biomarker potential of plasma igg n glycans towards disease risk stratification |
topic | Glycosylation Cancer Multi-objective optimization Probability calibration Interpretable machine learning |
url | http://www.sciencedirect.com/science/article/pii/S2001037024000618 |
work_keys_str_mv | AT konstantinosflevaris machinelearningframeworktoextractthebiomarkerpotentialofplasmaiggnglycanstowardsdiseaseriskstratification AT josephdavies machinelearningframeworktoextractthebiomarkerpotentialofplasmaiggnglycanstowardsdiseaseriskstratification AT shohnakai machinelearningframeworktoextractthebiomarkerpotentialofplasmaiggnglycanstowardsdiseaseriskstratification AT franovuckovic machinelearningframeworktoextractthebiomarkerpotentialofplasmaiggnglycanstowardsdiseaseriskstratification AT gordanlauc machinelearningframeworktoextractthebiomarkerpotentialofplasmaiggnglycanstowardsdiseaseriskstratification AT malcolmgdunlop machinelearningframeworktoextractthebiomarkerpotentialofplasmaiggnglycanstowardsdiseaseriskstratification AT cleokontoravdi machinelearningframeworktoextractthebiomarkerpotentialofplasmaiggnglycanstowardsdiseaseriskstratification |