Identifying subtypes of chronic kidney disease with machine learning: development, internal validation and prognostic validation using linked electronic health records in 350,067 individualsResearch in context

Summary: Background: Although chronic kidney disease (CKD) is associated with high multimorbidity, polypharmacy, morbidity and mortality, existing classification systems (mild to severe, usually based on estimated glomerular filtration rate, proteinuria or urine albumin-creatinine ratio) and risk p...

Full description

Bibliographic Details
Main Authors: Ashkan Dashtban, Mehrdad A. Mizani, Laura Pasea, Spiros Denaxas, Richard Corbett, Jil B. Mamza, He Gao, Tamsin Morris, Harry Hemingway, Amitava Banerjee
Format: Article
Language:English
Published: Elsevier 2023-03-01
Series:EBioMedicine
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2352396423000543
_version_ 1797893793217773568
author Ashkan Dashtban
Mehrdad A. Mizani
Laura Pasea
Spiros Denaxas
Richard Corbett
Jil B. Mamza
He Gao
Tamsin Morris
Harry Hemingway
Amitava Banerjee
author_facet Ashkan Dashtban
Mehrdad A. Mizani
Laura Pasea
Spiros Denaxas
Richard Corbett
Jil B. Mamza
He Gao
Tamsin Morris
Harry Hemingway
Amitava Banerjee
author_sort Ashkan Dashtban
collection DOAJ
description Summary: Background: Although chronic kidney disease (CKD) is associated with high multimorbidity, polypharmacy, morbidity and mortality, existing classification systems (mild to severe, usually based on estimated glomerular filtration rate, proteinuria or urine albumin-creatinine ratio) and risk prediction models largely ignore the complexity of CKD, its risk factors and its outcomes. Improved subtype definition could improve prediction of outcomes and inform effective interventions. Methods: We analysed individuals ≥18 years with incident and prevalent CKD (n = 350,067 and 195,422 respectively) from a population-based electronic health record resource (2006–2020; Clinical Practice Research Datalink, CPRD). We included factors (n = 264 with 2670 derived variables), e.g. demography, history, examination, blood laboratory values and medications. Using a published framework, we identified subtypes through seven unsupervised machine learning (ML) methods (K-means, Diana, HC, Fanny, PAM, Clara, Model-based) with 66 (of 2670) variables in each dataset. We evaluated subtypes for: (i) internal validity (within dataset, across methods); (ii) prognostic validity (predictive accuracy for 5-year all-cause mortality and admissions); and (iii) medications (new and existing by British National Formulary chapter). Findings: After identifying five clusters across seven approaches, we labelled CKD subtypes: 1. Early-onset, 2. Late-onset, 3. Cancer, 4. Metabolic, and 5. Cardiometabolic. Internal validity: We trained a high performing model (using XGBoost) that could predict disease subtypes with 95% accuracy for incident and prevalent CKD (Sensitivity: 0.81–0.98, F1 score:0.84–0.97). Prognostic validity: 5-year all-cause mortality, hospital admissions, and incidence of new chronic diseases differed across CKD subtypes. The 5-year risk of mortality and admissions in the overall incident CKD population were highest in cardiometabolic subtype: 43.3% (42.3–42.8%) and 29.5% (29.1–30.0%), respectively, and lowest in the early-onset subtype: 5.7% (5.5–5.9%) and 18.7% (18.4–19.1%). Medications: Across CKD subtypes, the distribution of prescription medication classes at baseline varied, with highest medication burden in cardiometabolic and metabolic subtypes, and higher burden in prevalent than incident CKD. Interpretation: In the largest CKD study using ML, to-date, we identified five distinct subtypes in individuals with incident and prevalent CKD. These subtypes have relevance to study of aetiology, therapeutics and risk prediction. Funding: AstraZeneca UK Ltd, Health Data Research UK.
first_indexed 2024-04-10T06:59:01Z
format Article
id doaj.art-ab7bcfacdae846f29bd9f1b8be42c253
institution Directory Open Access Journal
issn 2352-3964
language English
last_indexed 2024-04-10T06:59:01Z
publishDate 2023-03-01
publisher Elsevier
record_format Article
series EBioMedicine
spelling doaj.art-ab7bcfacdae846f29bd9f1b8be42c2532023-02-28T04:08:54ZengElsevierEBioMedicine2352-39642023-03-0189104489Identifying subtypes of chronic kidney disease with machine learning: development, internal validation and prognostic validation using linked electronic health records in 350,067 individualsResearch in contextAshkan Dashtban0Mehrdad A. Mizani1Laura Pasea2Spiros Denaxas3Richard Corbett4Jil B. Mamza5He Gao6Tamsin Morris7Harry Hemingway8Amitava Banerjee9Institute of Health Informatics, University College London, London, UKInstitute of Health Informatics, University College London, London, UK; British Heart Foundation Data Science Centre, Health Data Research UK, London, UKInstitute of Health Informatics, University College London, London, UKInstitute of Health Informatics, University College London, London, UKImperial College Healthcare NHS Trust, London, UKMedical and Scientific Affairs, BioPharmaceuticals Medical, AstraZeneca, London, UKMedical and Scientific Affairs, BioPharmaceuticals Medical, AstraZeneca, London, UKMedical and Scientific Affairs, BioPharmaceuticals Medical, AstraZeneca, London, UKInstitute of Health Informatics, University College London, London, UK; Health Data Research UK, University College London, London, UKInstitute of Health Informatics, University College London, London, UK; Barts Health NHS Trust, London, UK; University College London Hospitals NHS Trust, London, UK; Corresponding author. Institute of Health Informatics, University College London, 222 Euston Road, London NW1 2DA, UK.Summary: Background: Although chronic kidney disease (CKD) is associated with high multimorbidity, polypharmacy, morbidity and mortality, existing classification systems (mild to severe, usually based on estimated glomerular filtration rate, proteinuria or urine albumin-creatinine ratio) and risk prediction models largely ignore the complexity of CKD, its risk factors and its outcomes. Improved subtype definition could improve prediction of outcomes and inform effective interventions. Methods: We analysed individuals ≥18 years with incident and prevalent CKD (n = 350,067 and 195,422 respectively) from a population-based electronic health record resource (2006–2020; Clinical Practice Research Datalink, CPRD). We included factors (n = 264 with 2670 derived variables), e.g. demography, history, examination, blood laboratory values and medications. Using a published framework, we identified subtypes through seven unsupervised machine learning (ML) methods (K-means, Diana, HC, Fanny, PAM, Clara, Model-based) with 66 (of 2670) variables in each dataset. We evaluated subtypes for: (i) internal validity (within dataset, across methods); (ii) prognostic validity (predictive accuracy for 5-year all-cause mortality and admissions); and (iii) medications (new and existing by British National Formulary chapter). Findings: After identifying five clusters across seven approaches, we labelled CKD subtypes: 1. Early-onset, 2. Late-onset, 3. Cancer, 4. Metabolic, and 5. Cardiometabolic. Internal validity: We trained a high performing model (using XGBoost) that could predict disease subtypes with 95% accuracy for incident and prevalent CKD (Sensitivity: 0.81–0.98, F1 score:0.84–0.97). Prognostic validity: 5-year all-cause mortality, hospital admissions, and incidence of new chronic diseases differed across CKD subtypes. The 5-year risk of mortality and admissions in the overall incident CKD population were highest in cardiometabolic subtype: 43.3% (42.3–42.8%) and 29.5% (29.1–30.0%), respectively, and lowest in the early-onset subtype: 5.7% (5.5–5.9%) and 18.7% (18.4–19.1%). Medications: Across CKD subtypes, the distribution of prescription medication classes at baseline varied, with highest medication burden in cardiometabolic and metabolic subtypes, and higher burden in prevalent than incident CKD. Interpretation: In the largest CKD study using ML, to-date, we identified five distinct subtypes in individuals with incident and prevalent CKD. These subtypes have relevance to study of aetiology, therapeutics and risk prediction. Funding: AstraZeneca UK Ltd, Health Data Research UK.http://www.sciencedirect.com/science/article/pii/S2352396423000543CKD subtypeCluster analysisMachine learningUnsupervised clusteringSurvival analysis
spellingShingle Ashkan Dashtban
Mehrdad A. Mizani
Laura Pasea
Spiros Denaxas
Richard Corbett
Jil B. Mamza
He Gao
Tamsin Morris
Harry Hemingway
Amitava Banerjee
Identifying subtypes of chronic kidney disease with machine learning: development, internal validation and prognostic validation using linked electronic health records in 350,067 individualsResearch in context
EBioMedicine
CKD subtype
Cluster analysis
Machine learning
Unsupervised clustering
Survival analysis
title Identifying subtypes of chronic kidney disease with machine learning: development, internal validation and prognostic validation using linked electronic health records in 350,067 individualsResearch in context
title_full Identifying subtypes of chronic kidney disease with machine learning: development, internal validation and prognostic validation using linked electronic health records in 350,067 individualsResearch in context
title_fullStr Identifying subtypes of chronic kidney disease with machine learning: development, internal validation and prognostic validation using linked electronic health records in 350,067 individualsResearch in context
title_full_unstemmed Identifying subtypes of chronic kidney disease with machine learning: development, internal validation and prognostic validation using linked electronic health records in 350,067 individualsResearch in context
title_short Identifying subtypes of chronic kidney disease with machine learning: development, internal validation and prognostic validation using linked electronic health records in 350,067 individualsResearch in context
title_sort identifying subtypes of chronic kidney disease with machine learning development internal validation and prognostic validation using linked electronic health records in 350 067 individualsresearch in context
topic CKD subtype
Cluster analysis
Machine learning
Unsupervised clustering
Survival analysis
url http://www.sciencedirect.com/science/article/pii/S2352396423000543
work_keys_str_mv AT ashkandashtban identifyingsubtypesofchronickidneydiseasewithmachinelearningdevelopmentinternalvalidationandprognosticvalidationusinglinkedelectronichealthrecordsin350067individualsresearchincontext
AT mehrdadamizani identifyingsubtypesofchronickidneydiseasewithmachinelearningdevelopmentinternalvalidationandprognosticvalidationusinglinkedelectronichealthrecordsin350067individualsresearchincontext
AT laurapasea identifyingsubtypesofchronickidneydiseasewithmachinelearningdevelopmentinternalvalidationandprognosticvalidationusinglinkedelectronichealthrecordsin350067individualsresearchincontext
AT spirosdenaxas identifyingsubtypesofchronickidneydiseasewithmachinelearningdevelopmentinternalvalidationandprognosticvalidationusinglinkedelectronichealthrecordsin350067individualsresearchincontext
AT richardcorbett identifyingsubtypesofchronickidneydiseasewithmachinelearningdevelopmentinternalvalidationandprognosticvalidationusinglinkedelectronichealthrecordsin350067individualsresearchincontext
AT jilbmamza identifyingsubtypesofchronickidneydiseasewithmachinelearningdevelopmentinternalvalidationandprognosticvalidationusinglinkedelectronichealthrecordsin350067individualsresearchincontext
AT hegao identifyingsubtypesofchronickidneydiseasewithmachinelearningdevelopmentinternalvalidationandprognosticvalidationusinglinkedelectronichealthrecordsin350067individualsresearchincontext
AT tamsinmorris identifyingsubtypesofchronickidneydiseasewithmachinelearningdevelopmentinternalvalidationandprognosticvalidationusinglinkedelectronichealthrecordsin350067individualsresearchincontext
AT harryhemingway identifyingsubtypesofchronickidneydiseasewithmachinelearningdevelopmentinternalvalidationandprognosticvalidationusinglinkedelectronichealthrecordsin350067individualsresearchincontext
AT amitavabanerjee identifyingsubtypesofchronickidneydiseasewithmachinelearningdevelopmentinternalvalidationandprognosticvalidationusinglinkedelectronichealthrecordsin350067individualsresearchincontext