Strategies to Address the Lack of Labeled Data for Supervised Machine Learning Training With Electronic Health Records: Case Study for the Extraction of Symptoms From Clinical Notes

BackgroundAutomated extraction of symptoms from clinical notes is a challenging task owing to the multidimensional nature of symptom description. The availability of labeled training data is extremely limited owing to the nature of the data containing protected health informa...

Full description

Bibliographic Details
Main Authors: Marie Humbert-Droz, Pritam Mukherjee, Olivier Gevaert
Format: Article
Language:English
Published: JMIR Publications 2022-03-01
Series:JMIR Medical Informatics
Online Access:https://medinform.jmir.org/2022/3/e32903
_version_ 1797735252897038336
author Marie Humbert-Droz
Pritam Mukherjee
Olivier Gevaert
author_facet Marie Humbert-Droz
Pritam Mukherjee
Olivier Gevaert
author_sort Marie Humbert-Droz
collection DOAJ
description BackgroundAutomated extraction of symptoms from clinical notes is a challenging task owing to the multidimensional nature of symptom description. The availability of labeled training data is extremely limited owing to the nature of the data containing protected health information. Natural language processing and machine learning to process clinical text for such a task have great potential. However, supervised machine learning requires a great amount of labeled data to train a model, which is at the origin of the main bottleneck in model development. ObjectiveThe aim of this study is to address the lack of labeled data by proposing 2 alternatives to manual labeling for the generation of training labels for supervised machine learning with English clinical text. We aim to demonstrate that using lower-quality labels for training leads to good classification results. MethodsWe addressed the lack of labels with 2 strategies. The first approach took advantage of the structured part of electronic health records and used diagnosis codes (International Classification of Disease–10th revision) to derive training labels. The second approach used weak supervision and data programming principles to derive training labels. We propose to apply the developed framework to the extraction of symptom information from outpatient visit progress notes of patients with cardiovascular diseases. ResultsWe used >500,000 notes for training our classification model with International Classification of Disease–10th revision codes as labels and >800,000 notes for training using labels derived from weak supervision. We show that the dependence between prevalence and recall becomes flat provided a sufficiently large training set is used (>500,000 documents). We further demonstrate that using weak labels for training rather than the electronic health record codes derived from the patient encounter leads to an overall improved recall score (10% improvement, on average). Finally, the external validation of our models shows excellent predictive performance and transferability, with an overall increase of 20% in the recall score. ConclusionsThis work demonstrates the power of using a weak labeling pipeline to annotate and extract symptom mentions in clinical text, with the prospects to facilitate symptom information integration for a downstream clinical task such as clinical decision support.
first_indexed 2024-03-12T12:55:28Z
format Article
id doaj.art-41940a9941fd4927bf2b8d017c7b5035
institution Directory Open Access Journal
issn 2291-9694
language English
last_indexed 2024-03-12T12:55:28Z
publishDate 2022-03-01
publisher JMIR Publications
record_format Article
series JMIR Medical Informatics
spelling doaj.art-41940a9941fd4927bf2b8d017c7b50352023-08-28T21:04:16ZengJMIR PublicationsJMIR Medical Informatics2291-96942022-03-01103e3290310.2196/32903Strategies to Address the Lack of Labeled Data for Supervised Machine Learning Training With Electronic Health Records: Case Study for the Extraction of Symptoms From Clinical NotesMarie Humbert-Drozhttps://orcid.org/0000-0001-6814-544XPritam Mukherjeehttps://orcid.org/0000-0002-9975-9994Olivier Gevaerthttps://orcid.org/0000-0002-9965-5466 BackgroundAutomated extraction of symptoms from clinical notes is a challenging task owing to the multidimensional nature of symptom description. The availability of labeled training data is extremely limited owing to the nature of the data containing protected health information. Natural language processing and machine learning to process clinical text for such a task have great potential. However, supervised machine learning requires a great amount of labeled data to train a model, which is at the origin of the main bottleneck in model development. ObjectiveThe aim of this study is to address the lack of labeled data by proposing 2 alternatives to manual labeling for the generation of training labels for supervised machine learning with English clinical text. We aim to demonstrate that using lower-quality labels for training leads to good classification results. MethodsWe addressed the lack of labels with 2 strategies. The first approach took advantage of the structured part of electronic health records and used diagnosis codes (International Classification of Disease–10th revision) to derive training labels. The second approach used weak supervision and data programming principles to derive training labels. We propose to apply the developed framework to the extraction of symptom information from outpatient visit progress notes of patients with cardiovascular diseases. ResultsWe used >500,000 notes for training our classification model with International Classification of Disease–10th revision codes as labels and >800,000 notes for training using labels derived from weak supervision. We show that the dependence between prevalence and recall becomes flat provided a sufficiently large training set is used (>500,000 documents). We further demonstrate that using weak labels for training rather than the electronic health record codes derived from the patient encounter leads to an overall improved recall score (10% improvement, on average). Finally, the external validation of our models shows excellent predictive performance and transferability, with an overall increase of 20% in the recall score. ConclusionsThis work demonstrates the power of using a weak labeling pipeline to annotate and extract symptom mentions in clinical text, with the prospects to facilitate symptom information integration for a downstream clinical task such as clinical decision support.https://medinform.jmir.org/2022/3/e32903
spellingShingle Marie Humbert-Droz
Pritam Mukherjee
Olivier Gevaert
Strategies to Address the Lack of Labeled Data for Supervised Machine Learning Training With Electronic Health Records: Case Study for the Extraction of Symptoms From Clinical Notes
JMIR Medical Informatics
title Strategies to Address the Lack of Labeled Data for Supervised Machine Learning Training With Electronic Health Records: Case Study for the Extraction of Symptoms From Clinical Notes
title_full Strategies to Address the Lack of Labeled Data for Supervised Machine Learning Training With Electronic Health Records: Case Study for the Extraction of Symptoms From Clinical Notes
title_fullStr Strategies to Address the Lack of Labeled Data for Supervised Machine Learning Training With Electronic Health Records: Case Study for the Extraction of Symptoms From Clinical Notes
title_full_unstemmed Strategies to Address the Lack of Labeled Data for Supervised Machine Learning Training With Electronic Health Records: Case Study for the Extraction of Symptoms From Clinical Notes
title_short Strategies to Address the Lack of Labeled Data for Supervised Machine Learning Training With Electronic Health Records: Case Study for the Extraction of Symptoms From Clinical Notes
title_sort strategies to address the lack of labeled data for supervised machine learning training with electronic health records case study for the extraction of symptoms from clinical notes
url https://medinform.jmir.org/2022/3/e32903
work_keys_str_mv AT mariehumbertdroz strategiestoaddressthelackoflabeleddataforsupervisedmachinelearningtrainingwithelectronichealthrecordscasestudyfortheextractionofsymptomsfromclinicalnotes
AT pritammukherjee strategiestoaddressthelackoflabeleddataforsupervisedmachinelearningtrainingwithelectronichealthrecordscasestudyfortheextractionofsymptomsfromclinicalnotes
AT oliviergevaert strategiestoaddressthelackoflabeleddataforsupervisedmachinelearningtrainingwithelectronichealthrecordscasestudyfortheextractionofsymptomsfromclinicalnotes