Machine learning framework to extract the biomarker potential of plasma IgG N-glycans towards disease risk stratification

Effective management of chronic diseases and cancer can greatly benefit from disease-specific biomarkers that enable informative screening and timely diagnosis. IgG N-glycans found in human plasma have the potential to be minimally invasive disease-specific biomarkers for all stages of disease devel...

Full description

Bibliographic Details
Main Authors: Konstantinos Flevaris, Joseph Davies, Shoh Nakai, Frano Vučković, Gordan Lauc, Malcolm G. Dunlop, Cleo Kontoravdi
Format: Article
Language:English
Published: Elsevier 2024-12-01
Series:Computational and Structural Biotechnology Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2001037024000618
_version_ 1797248348880633856
author Konstantinos Flevaris
Joseph Davies
Shoh Nakai
Frano Vučković
Gordan Lauc
Malcolm G. Dunlop
Cleo Kontoravdi
author_facet Konstantinos Flevaris
Joseph Davies
Shoh Nakai
Frano Vučković
Gordan Lauc
Malcolm G. Dunlop
Cleo Kontoravdi
author_sort Konstantinos Flevaris
collection DOAJ
description Effective management of chronic diseases and cancer can greatly benefit from disease-specific biomarkers that enable informative screening and timely diagnosis. IgG N-glycans found in human plasma have the potential to be minimally invasive disease-specific biomarkers for all stages of disease development due to their plasticity in response to various genetic and environmental stimuli. Data analysis and machine learning (ML) approaches can assist in harnessing the potential of IgG glycomics towards biomarker discovery and the development of reliable predictive tools for disease screening. This study proposes an ML-based N-glycomic analysis framework that can be employed to build, optimise, and evaluate multiple ML pipelines to stratify patients based on disease risk in an interpretable manner. To design and test this framework, a published colorectal cancer (CRC) dataset from the Study of Colorectal Cancer in Scotland (SOCCS) cohort (1999–2006) was used. In particular, among the different pipelines tested, an XGBoost-based ML pipeline, which was tuned using multi-objective optimisation, calibrated using an inductive Venn-Abers predictor (IVAP), and evaluated via a nested cross-validation (NCV) scheme, achieved a mean area under the Receiver Operating Characteristic Curve (AUC-ROC) of 0.771 when classifying between age-, and sex-matched healthy controls and CRC patients. This performance suggests the potential of using the relative abundance of IgG N-glycans to define populations at elevated CRC risk who merit investigation or surveillance. Finally, the IgG N-glycans that highly impact CRC classification decisions were identified using a global model-agnostic interpretability technique, namely Accumulated Local Effects (ALE). We envision that open-source computational frameworks, such as the one presented herein, will be useful in supporting the translation of glycan-based biomarkers into clinical applications.
first_indexed 2024-04-24T20:13:10Z
format Article
id doaj.art-72332fe12d9c4affa08d4777944d1896
institution Directory Open Access Journal
issn 2001-0370
language English
last_indexed 2024-04-24T20:13:10Z
publishDate 2024-12-01
publisher Elsevier
record_format Article
series Computational and Structural Biotechnology Journal
spelling doaj.art-72332fe12d9c4affa08d4777944d18962024-03-23T06:23:54ZengElsevierComputational and Structural Biotechnology Journal2001-03702024-12-012312341243Machine learning framework to extract the biomarker potential of plasma IgG N-glycans towards disease risk stratificationKonstantinos Flevaris0Joseph Davies1Shoh Nakai2Frano Vučković3Gordan Lauc4Malcolm G. Dunlop5Cleo Kontoravdi6Department of Chemical Engineering, Imperial College London, London SW7 2AZ, United Kingdom; Corresponding authors.Department of Chemical Engineering, Imperial College London, London SW7 2AZ, United KingdomDepartment of Chemical Engineering, Imperial College London, London SW7 2AZ, United KingdomGenos Glycoscience Research Laboratory, Zagreb 10000, CroatiaGenos Glycoscience Research Laboratory, Zagreb 10000, Croatia; Department of Biochemistry and Molecular Biology, Faculty of Pharmacy and Biochemistry, University of Zagreb, Zagreb, CroatiaColon Cancer Genetics Group, Institute of Genetics and Cancer, Cancer Research UK Scotland Centre, University of Edinburgh and Medical Research Council Human Genetics Unit, Edinburgh, United KingdomDepartment of Chemical Engineering, Imperial College London, London SW7 2AZ, United Kingdom; Corresponding authors.Effective management of chronic diseases and cancer can greatly benefit from disease-specific biomarkers that enable informative screening and timely diagnosis. IgG N-glycans found in human plasma have the potential to be minimally invasive disease-specific biomarkers for all stages of disease development due to their plasticity in response to various genetic and environmental stimuli. Data analysis and machine learning (ML) approaches can assist in harnessing the potential of IgG glycomics towards biomarker discovery and the development of reliable predictive tools for disease screening. This study proposes an ML-based N-glycomic analysis framework that can be employed to build, optimise, and evaluate multiple ML pipelines to stratify patients based on disease risk in an interpretable manner. To design and test this framework, a published colorectal cancer (CRC) dataset from the Study of Colorectal Cancer in Scotland (SOCCS) cohort (1999–2006) was used. In particular, among the different pipelines tested, an XGBoost-based ML pipeline, which was tuned using multi-objective optimisation, calibrated using an inductive Venn-Abers predictor (IVAP), and evaluated via a nested cross-validation (NCV) scheme, achieved a mean area under the Receiver Operating Characteristic Curve (AUC-ROC) of 0.771 when classifying between age-, and sex-matched healthy controls and CRC patients. This performance suggests the potential of using the relative abundance of IgG N-glycans to define populations at elevated CRC risk who merit investigation or surveillance. Finally, the IgG N-glycans that highly impact CRC classification decisions were identified using a global model-agnostic interpretability technique, namely Accumulated Local Effects (ALE). We envision that open-source computational frameworks, such as the one presented herein, will be useful in supporting the translation of glycan-based biomarkers into clinical applications.http://www.sciencedirect.com/science/article/pii/S2001037024000618GlycosylationCancerMulti-objective optimizationProbability calibrationInterpretable machine learning
spellingShingle Konstantinos Flevaris
Joseph Davies
Shoh Nakai
Frano Vučković
Gordan Lauc
Malcolm G. Dunlop
Cleo Kontoravdi
Machine learning framework to extract the biomarker potential of plasma IgG N-glycans towards disease risk stratification
Computational and Structural Biotechnology Journal
Glycosylation
Cancer
Multi-objective optimization
Probability calibration
Interpretable machine learning
title Machine learning framework to extract the biomarker potential of plasma IgG N-glycans towards disease risk stratification
title_full Machine learning framework to extract the biomarker potential of plasma IgG N-glycans towards disease risk stratification
title_fullStr Machine learning framework to extract the biomarker potential of plasma IgG N-glycans towards disease risk stratification
title_full_unstemmed Machine learning framework to extract the biomarker potential of plasma IgG N-glycans towards disease risk stratification
title_short Machine learning framework to extract the biomarker potential of plasma IgG N-glycans towards disease risk stratification
title_sort machine learning framework to extract the biomarker potential of plasma igg n glycans towards disease risk stratification
topic Glycosylation
Cancer
Multi-objective optimization
Probability calibration
Interpretable machine learning
url http://www.sciencedirect.com/science/article/pii/S2001037024000618
work_keys_str_mv AT konstantinosflevaris machinelearningframeworktoextractthebiomarkerpotentialofplasmaiggnglycanstowardsdiseaseriskstratification
AT josephdavies machinelearningframeworktoextractthebiomarkerpotentialofplasmaiggnglycanstowardsdiseaseriskstratification
AT shohnakai machinelearningframeworktoextractthebiomarkerpotentialofplasmaiggnglycanstowardsdiseaseriskstratification
AT franovuckovic machinelearningframeworktoextractthebiomarkerpotentialofplasmaiggnglycanstowardsdiseaseriskstratification
AT gordanlauc machinelearningframeworktoextractthebiomarkerpotentialofplasmaiggnglycanstowardsdiseaseriskstratification
AT malcolmgdunlop machinelearningframeworktoextractthebiomarkerpotentialofplasmaiggnglycanstowardsdiseaseriskstratification
AT cleokontoravdi machinelearningframeworktoextractthebiomarkerpotentialofplasmaiggnglycanstowardsdiseaseriskstratification