A Simple Free-Text-like Method for Extracting Semi-Structured Data from Electronic Health Records: Exemplified in Prediction of In-Hospital Mortality

The Epic electronic health record (EHR) is a commonly used EHR in the United States. This EHR contain large semi-structured “flowsheet” fields. Flowsheet fields lack a well-defined data dictionary and are unique to each site. We evaluated a simple free-text-like method to extract these data. As a us...

Full description

Bibliographic Details
Main Authors: Eyal Klang, Matthew A. Levin, Shelly Soffer, Alexis Zebrowski, Benjamin S. Glicksberg, Brendan G. Carr, Jolion Mcgreevy, David L. Reich, Robert Freeman
Format: Article
Language:English
Published: MDPI AG 2021-08-01
Series:Big Data and Cognitive Computing
Subjects:
Online Access:https://www.mdpi.com/2504-2289/5/3/40
Description
Summary:The Epic electronic health record (EHR) is a commonly used EHR in the United States. This EHR contain large semi-structured “flowsheet” fields. Flowsheet fields lack a well-defined data dictionary and are unique to each site. We evaluated a simple free-text-like method to extract these data. As a use case, we demonstrate this method in predicting mortality during emergency department (ED) triage. We retrieved demographic and clinical data for ED visits from the Epic EHR (1/2014–12/2018). Data included structured, semi-structured flowsheet records and free-text notes. The study outcome was in-hospital death within 48 h. Most of the data were coded using a free-text-like Bag-of-Words (BoW) approach. Two machine-learning models were trained: gradient boosting and logistic regression. Term frequency-inverse document frequency was employed in the logistic regression model (LR-tf-idf). An ensemble of LR-tf-idf and gradient boosting was evaluated. Models were trained on years 2014–2017 and tested on year 2018. Among 412,859 visits, the 48-h mortality rate was 0.2%. LR-tf-idf showed AUC 0.98 (95% CI: 0.98–0.99). Gradient boosting showed AUC 0.97 (95% CI: 0.96–0.99). An ensemble of both showed AUC 0.99 (95% CI: 0.98–0.99). In conclusion, a free-text-like approach can be useful for extracting knowledge from large amounts of complex semi-structured EHR data.
ISSN:2504-2289