A Natural Language Processing Algorithm to Improve Completeness of ECOG Performance Status in Real-World Data

Our goal was to develop and characterize a Natural Language Processing (NLP) algorithm to extract Eastern Cooperative Oncology Group Performance Status (ECOG PS) from unstructured electronic health record (EHR) sources to enhance observational datasets. By scanning unstructured EHR-derived documents...

Full description

Bibliographic Details
Main Authors: Aaron B. Cohen, Andrej Rosic, Katherine Harrison, Madeline Richey, Sheila Nemeth, Geetu Ambwani, Rebecca Miksad, Benjamin Haaland, Chengsheng Jiang
Format: Article
Language:English
Published: MDPI AG 2023-05-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/13/10/6209
_version_ 1797601170898812928
author Aaron B. Cohen
Andrej Rosic
Katherine Harrison
Madeline Richey
Sheila Nemeth
Geetu Ambwani
Rebecca Miksad
Benjamin Haaland
Chengsheng Jiang
author_facet Aaron B. Cohen
Andrej Rosic
Katherine Harrison
Madeline Richey
Sheila Nemeth
Geetu Ambwani
Rebecca Miksad
Benjamin Haaland
Chengsheng Jiang
author_sort Aaron B. Cohen
collection DOAJ
description Our goal was to develop and characterize a Natural Language Processing (NLP) algorithm to extract Eastern Cooperative Oncology Group Performance Status (ECOG PS) from unstructured electronic health record (EHR) sources to enhance observational datasets. By scanning unstructured EHR-derived documents from a real-world database, the NLP algorithm assigned ECOG PS scores to patients diagnosed with one of 21 cancer types who lacked structured ECOG PS numerical scores, anchored to the initiation of treatment lines. Manually abstracted ECOG PS scores were used as a source of truth to both develop the algorithm and evaluate accuracy, sensitivity, and positive predictive value (PPV). Algorithm performance was further characterized by investigating the prognostic value of composite ECOG PS scores in patients with advanced non-small cell lung cancer receiving first line treatment. Of N = 480,825 patient-lines, structured ECOG PS scores were available for 290,343 (60.4%). After applying NLP-extraction, the availability increased to 73.2%. The algorithm’s overall accuracy, sensitivity, and PPV were 93% (95% CI: 92–94%), 88% (95% CI: 87–89%), and 88% (95% CI: 87–89%), respectively across all cancer types. In a cohort of N = 51,948 aNSCLC patients receiving 1L therapy, the algorithm improved ECOG PS completeness from 61.5% to 75.6%. Stratification by ECOG PS showed worse real-world overall survival (rwOS) for patients with worse ECOG PS scores. We developed an NLP algorithm to extract ECOG PS scores from unstructured EHR documents with high accuracy, improving data completeness for EHR-derived oncology cohorts.
first_indexed 2024-03-11T03:57:37Z
format Article
id doaj.art-5d2bcd0146754cc0a929b18b43f38c5c
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-11T03:57:37Z
publishDate 2023-05-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-5d2bcd0146754cc0a929b18b43f38c5c2023-11-18T00:22:12ZengMDPI AGApplied Sciences2076-34172023-05-011310620910.3390/app13106209A Natural Language Processing Algorithm to Improve Completeness of ECOG Performance Status in Real-World DataAaron B. Cohen0Andrej Rosic1Katherine Harrison2Madeline Richey3Sheila Nemeth4Geetu Ambwani5Rebecca Miksad6Benjamin Haaland7Chengsheng Jiang8Flatiron Health Inc., 233 Spring St., New York, NY 10013, USAFlatiron Health Inc., 233 Spring St., New York, NY 10013, USAFlatiron Health Inc., 233 Spring St., New York, NY 10013, USAFlatiron Health Inc., 233 Spring St., New York, NY 10013, USAFlatiron Health Inc., 233 Spring St., New York, NY 10013, USAFlatiron Health Inc., 233 Spring St., New York, NY 10013, USAFlatiron Health Inc., 233 Spring St., New York, NY 10013, USAHuntsman Cancer Institute, University of Utah, Salt Lake City, UT 84112, USAFlatiron Health Inc., 233 Spring St., New York, NY 10013, USAOur goal was to develop and characterize a Natural Language Processing (NLP) algorithm to extract Eastern Cooperative Oncology Group Performance Status (ECOG PS) from unstructured electronic health record (EHR) sources to enhance observational datasets. By scanning unstructured EHR-derived documents from a real-world database, the NLP algorithm assigned ECOG PS scores to patients diagnosed with one of 21 cancer types who lacked structured ECOG PS numerical scores, anchored to the initiation of treatment lines. Manually abstracted ECOG PS scores were used as a source of truth to both develop the algorithm and evaluate accuracy, sensitivity, and positive predictive value (PPV). Algorithm performance was further characterized by investigating the prognostic value of composite ECOG PS scores in patients with advanced non-small cell lung cancer receiving first line treatment. Of N = 480,825 patient-lines, structured ECOG PS scores were available for 290,343 (60.4%). After applying NLP-extraction, the availability increased to 73.2%. The algorithm’s overall accuracy, sensitivity, and PPV were 93% (95% CI: 92–94%), 88% (95% CI: 87–89%), and 88% (95% CI: 87–89%), respectively across all cancer types. In a cohort of N = 51,948 aNSCLC patients receiving 1L therapy, the algorithm improved ECOG PS completeness from 61.5% to 75.6%. Stratification by ECOG PS showed worse real-world overall survival (rwOS) for patients with worse ECOG PS scores. We developed an NLP algorithm to extract ECOG PS scores from unstructured EHR documents with high accuracy, improving data completeness for EHR-derived oncology cohorts.https://www.mdpi.com/2076-3417/13/10/6209EHRmachine learningECOG PSRWDRWENLP
spellingShingle Aaron B. Cohen
Andrej Rosic
Katherine Harrison
Madeline Richey
Sheila Nemeth
Geetu Ambwani
Rebecca Miksad
Benjamin Haaland
Chengsheng Jiang
A Natural Language Processing Algorithm to Improve Completeness of ECOG Performance Status in Real-World Data
Applied Sciences
EHR
machine learning
ECOG PS
RWD
RWE
NLP
title A Natural Language Processing Algorithm to Improve Completeness of ECOG Performance Status in Real-World Data
title_full A Natural Language Processing Algorithm to Improve Completeness of ECOG Performance Status in Real-World Data
title_fullStr A Natural Language Processing Algorithm to Improve Completeness of ECOG Performance Status in Real-World Data
title_full_unstemmed A Natural Language Processing Algorithm to Improve Completeness of ECOG Performance Status in Real-World Data
title_short A Natural Language Processing Algorithm to Improve Completeness of ECOG Performance Status in Real-World Data
title_sort natural language processing algorithm to improve completeness of ecog performance status in real world data
topic EHR
machine learning
ECOG PS
RWD
RWE
NLP
url https://www.mdpi.com/2076-3417/13/10/6209
work_keys_str_mv AT aaronbcohen anaturallanguageprocessingalgorithmtoimprovecompletenessofecogperformancestatusinrealworlddata
AT andrejrosic anaturallanguageprocessingalgorithmtoimprovecompletenessofecogperformancestatusinrealworlddata
AT katherineharrison anaturallanguageprocessingalgorithmtoimprovecompletenessofecogperformancestatusinrealworlddata
AT madelinerichey anaturallanguageprocessingalgorithmtoimprovecompletenessofecogperformancestatusinrealworlddata
AT sheilanemeth anaturallanguageprocessingalgorithmtoimprovecompletenessofecogperformancestatusinrealworlddata
AT geetuambwani anaturallanguageprocessingalgorithmtoimprovecompletenessofecogperformancestatusinrealworlddata
AT rebeccamiksad anaturallanguageprocessingalgorithmtoimprovecompletenessofecogperformancestatusinrealworlddata
AT benjaminhaaland anaturallanguageprocessingalgorithmtoimprovecompletenessofecogperformancestatusinrealworlddata
AT chengshengjiang anaturallanguageprocessingalgorithmtoimprovecompletenessofecogperformancestatusinrealworlddata
AT aaronbcohen naturallanguageprocessingalgorithmtoimprovecompletenessofecogperformancestatusinrealworlddata
AT andrejrosic naturallanguageprocessingalgorithmtoimprovecompletenessofecogperformancestatusinrealworlddata
AT katherineharrison naturallanguageprocessingalgorithmtoimprovecompletenessofecogperformancestatusinrealworlddata
AT madelinerichey naturallanguageprocessingalgorithmtoimprovecompletenessofecogperformancestatusinrealworlddata
AT sheilanemeth naturallanguageprocessingalgorithmtoimprovecompletenessofecogperformancestatusinrealworlddata
AT geetuambwani naturallanguageprocessingalgorithmtoimprovecompletenessofecogperformancestatusinrealworlddata
AT rebeccamiksad naturallanguageprocessingalgorithmtoimprovecompletenessofecogperformancestatusinrealworlddata
AT benjaminhaaland naturallanguageprocessingalgorithmtoimprovecompletenessofecogperformancestatusinrealworlddata
AT chengshengjiang naturallanguageprocessingalgorithmtoimprovecompletenessofecogperformancestatusinrealworlddata