Performance and clinical utility of a new supervised machine-learning pipeline in detecting rare ciliopathy patients based on deep phenotyping from electronic health records and semantic similarity

Abstract Background Rare diseases affect approximately 400 million people worldwide. Many of them suffer from delayed diagnosis. Among them, NPHP1-related renal ciliopathies need to be diagnosed as early as possible as potential treatments have been recently investigated with promising results. Our...

Full description

Bibliographic Details
Main Authors: Carole Faviez, Marc Vincent, Nicolas Garcelon, Olivia Boyer, Bertrand Knebelmann, Laurence Heidet, Sophie Saunier, Xiaoyi Chen, Anita Burgun
Format: Article
Language:English
Published: BMC 2024-02-01
Series:Orphanet Journal of Rare Diseases
Subjects:
Online Access:https://doi.org/10.1186/s13023-024-03063-7
_version_ 1827326238217207808
author Carole Faviez
Marc Vincent
Nicolas Garcelon
Olivia Boyer
Bertrand Knebelmann
Laurence Heidet
Sophie Saunier
Xiaoyi Chen
Anita Burgun
author_facet Carole Faviez
Marc Vincent
Nicolas Garcelon
Olivia Boyer
Bertrand Knebelmann
Laurence Heidet
Sophie Saunier
Xiaoyi Chen
Anita Burgun
author_sort Carole Faviez
collection DOAJ
description Abstract Background Rare diseases affect approximately 400 million people worldwide. Many of them suffer from delayed diagnosis. Among them, NPHP1-related renal ciliopathies need to be diagnosed as early as possible as potential treatments have been recently investigated with promising results. Our objective was to develop a supervised machine learning pipeline for the detection of NPHP1 ciliopathy patients from a large number of nephrology patients using electronic health records (EHRs). Methods and results We designed a pipeline combining a phenotyping module re-using unstructured EHR data, a semantic similarity module to address the phenotype dependence, a feature selection step to deal with high dimensionality, an undersampling step to address the class imbalance, and a classification step with multiple train-test split for the small number of rare cases. The pipeline was applied to thirty NPHP1 patients and 7231 controls and achieved good performances (sensitivity 86% with specificity 90%). A qualitative review of the EHRs of 40 misclassified controls showed that 25% had phenotypes belonging to the ciliopathy spectrum, which demonstrates the ability of our system to detect patients with similar conditions. Conclusions Our pipeline reached very encouraging performance scores for pre-diagnosing ciliopathy patients. The identified patients could then undergo genetic testing. The same data-driven approach can be adapted to other rare diseases facing underdiagnosis challenges.
first_indexed 2024-03-07T14:41:38Z
format Article
id doaj.art-554f62835a944de6bb50e39eb453ba23
institution Directory Open Access Journal
issn 1750-1172
language English
last_indexed 2024-03-07T14:41:38Z
publishDate 2024-02-01
publisher BMC
record_format Article
series Orphanet Journal of Rare Diseases
spelling doaj.art-554f62835a944de6bb50e39eb453ba232024-03-05T20:20:07ZengBMCOrphanet Journal of Rare Diseases1750-11722024-02-0119111210.1186/s13023-024-03063-7Performance and clinical utility of a new supervised machine-learning pipeline in detecting rare ciliopathy patients based on deep phenotyping from electronic health records and semantic similarityCarole Faviez0Marc Vincent1Nicolas Garcelon2Olivia Boyer3Bertrand Knebelmann4Laurence Heidet5Sophie Saunier6Xiaoyi Chen7Anita Burgun8Centre de Recherche des Cordeliers, Université Paris Cité, Sorbonne Université, INSERM UMR 1138Université Paris Cité, Imagine Institute, Data Science Platform, INSERM UMR 1163Centre de Recherche des Cordeliers, Université Paris Cité, Sorbonne Université, INSERM UMR 1138Department of Pediatric Nephrology, APHP-Centre, Reference Center for Inherited Renal Diseases (MARHEA), Imagine Institute, Hôpital Necker-Enfants Malades, Université Paris CitéNephrology and Transplantation Department, MARHEA, Hôpital Necker-Enfants Malades, AP-HP, Université Paris CitéDepartment of Pediatric Nephrology, APHP-Centre, Reference Center for Inherited Renal Diseases (MARHEA), Imagine Institute, Hôpital Necker-Enfants Malades, Université Paris CitéLaboratory of Renal Hereditary Diseases, INSERM UMR 1163, Imagine Institute, Université Paris CitéCentre de Recherche des Cordeliers, Université Paris Cité, Sorbonne Université, INSERM UMR 1138Centre de Recherche des Cordeliers, Université Paris Cité, Sorbonne Université, INSERM UMR 1138Abstract Background Rare diseases affect approximately 400 million people worldwide. Many of them suffer from delayed diagnosis. Among them, NPHP1-related renal ciliopathies need to be diagnosed as early as possible as potential treatments have been recently investigated with promising results. Our objective was to develop a supervised machine learning pipeline for the detection of NPHP1 ciliopathy patients from a large number of nephrology patients using electronic health records (EHRs). Methods and results We designed a pipeline combining a phenotyping module re-using unstructured EHR data, a semantic similarity module to address the phenotype dependence, a feature selection step to deal with high dimensionality, an undersampling step to address the class imbalance, and a classification step with multiple train-test split for the small number of rare cases. The pipeline was applied to thirty NPHP1 patients and 7231 controls and achieved good performances (sensitivity 86% with specificity 90%). A qualitative review of the EHRs of 40 misclassified controls showed that 25% had phenotypes belonging to the ciliopathy spectrum, which demonstrates the ability of our system to detect patients with similar conditions. Conclusions Our pipeline reached very encouraging performance scores for pre-diagnosing ciliopathy patients. The identified patients could then undergo genetic testing. The same data-driven approach can be adapted to other rare diseases facing underdiagnosis challenges.https://doi.org/10.1186/s13023-024-03063-7Diagnosis supportElectronic health recordSupervised machine learningSemantic similarityImbalanced datasetRare disease
spellingShingle Carole Faviez
Marc Vincent
Nicolas Garcelon
Olivia Boyer
Bertrand Knebelmann
Laurence Heidet
Sophie Saunier
Xiaoyi Chen
Anita Burgun
Performance and clinical utility of a new supervised machine-learning pipeline in detecting rare ciliopathy patients based on deep phenotyping from electronic health records and semantic similarity
Orphanet Journal of Rare Diseases
Diagnosis support
Electronic health record
Supervised machine learning
Semantic similarity
Imbalanced dataset
Rare disease
title Performance and clinical utility of a new supervised machine-learning pipeline in detecting rare ciliopathy patients based on deep phenotyping from electronic health records and semantic similarity
title_full Performance and clinical utility of a new supervised machine-learning pipeline in detecting rare ciliopathy patients based on deep phenotyping from electronic health records and semantic similarity
title_fullStr Performance and clinical utility of a new supervised machine-learning pipeline in detecting rare ciliopathy patients based on deep phenotyping from electronic health records and semantic similarity
title_full_unstemmed Performance and clinical utility of a new supervised machine-learning pipeline in detecting rare ciliopathy patients based on deep phenotyping from electronic health records and semantic similarity
title_short Performance and clinical utility of a new supervised machine-learning pipeline in detecting rare ciliopathy patients based on deep phenotyping from electronic health records and semantic similarity
title_sort performance and clinical utility of a new supervised machine learning pipeline in detecting rare ciliopathy patients based on deep phenotyping from electronic health records and semantic similarity
topic Diagnosis support
Electronic health record
Supervised machine learning
Semantic similarity
Imbalanced dataset
Rare disease
url https://doi.org/10.1186/s13023-024-03063-7
work_keys_str_mv AT carolefaviez performanceandclinicalutilityofanewsupervisedmachinelearningpipelineindetectingrareciliopathypatientsbasedondeepphenotypingfromelectronichealthrecordsandsemanticsimilarity
AT marcvincent performanceandclinicalutilityofanewsupervisedmachinelearningpipelineindetectingrareciliopathypatientsbasedondeepphenotypingfromelectronichealthrecordsandsemanticsimilarity
AT nicolasgarcelon performanceandclinicalutilityofanewsupervisedmachinelearningpipelineindetectingrareciliopathypatientsbasedondeepphenotypingfromelectronichealthrecordsandsemanticsimilarity
AT oliviaboyer performanceandclinicalutilityofanewsupervisedmachinelearningpipelineindetectingrareciliopathypatientsbasedondeepphenotypingfromelectronichealthrecordsandsemanticsimilarity
AT bertrandknebelmann performanceandclinicalutilityofanewsupervisedmachinelearningpipelineindetectingrareciliopathypatientsbasedondeepphenotypingfromelectronichealthrecordsandsemanticsimilarity
AT laurenceheidet performanceandclinicalutilityofanewsupervisedmachinelearningpipelineindetectingrareciliopathypatientsbasedondeepphenotypingfromelectronichealthrecordsandsemanticsimilarity
AT sophiesaunier performanceandclinicalutilityofanewsupervisedmachinelearningpipelineindetectingrareciliopathypatientsbasedondeepphenotypingfromelectronichealthrecordsandsemanticsimilarity
AT xiaoyichen performanceandclinicalutilityofanewsupervisedmachinelearningpipelineindetectingrareciliopathypatientsbasedondeepphenotypingfromelectronichealthrecordsandsemanticsimilarity
AT anitaburgun performanceandclinicalutilityofanewsupervisedmachinelearningpipelineindetectingrareciliopathypatientsbasedondeepphenotypingfromelectronichealthrecordsandsemanticsimilarity