Text data extraction for a prospective, research-focused data mart: implementation and validation

<p>Abstract</p> <p>Background</p> <p>Translational research typically requires data abstracted from medical records as well as data collected specifically for research. Unfortunately, many data within electronic health records are represented as text that is not amenabl...

Full description

Bibliographic Details
Main Authors: Hinchcliff Monique, Just Eric, Podlusky Sofia, Varga John, Chang Rowland W, Kibbe Warren A
Format: Article
Language:English
Published: BMC 2012-09-01
Series:BMC Medical Informatics and Decision Making
Subjects:
Online Access:http://www.biomedcentral.com/1472-6947/12/106
_version_ 1811314649742704640
author Hinchcliff Monique
Just Eric
Podlusky Sofia
Varga John
Chang Rowland W
Kibbe Warren A
author_facet Hinchcliff Monique
Just Eric
Podlusky Sofia
Varga John
Chang Rowland W
Kibbe Warren A
author_sort Hinchcliff Monique
collection DOAJ
description <p>Abstract</p> <p>Background</p> <p>Translational research typically requires data abstracted from medical records as well as data collected specifically for research. Unfortunately, many data within electronic health records are represented as text that is not amenable to aggregation for analyses. We present a scalable open source SQL Server Integration Services package, called Regextractor, for including regular expression parsers into a classic extract, transform, and load workflow. We have used Regextractor to abstract discrete data from textual reports from a number of ‘machine generated’ sources. To validate this package, we created a pulmonary function test data mart and analyzed the quality of the data mart versus manual chart review.</p> <p>Methods</p> <p>Eleven variables from pulmonary function tests performed closest to the initial clinical evaluation date were studied for 100 randomly selected subjects with scleroderma. One research assistant manually reviewed, abstracted, and entered relevant data into a database. Correlation with data obtained from the automated pulmonary function test data mart within the Northwestern Medical Enterprise Data Warehouse was determined.</p> <p>Results</p> <p>There was a near perfect (99.5%) agreement between results generated from the Regextractor package and those obtained via manual chart abstraction. The pulmonary function test data mart has been used subsequently to monitor disease progression of patients in the Northwestern Scleroderma Registry. In addition to the pulmonary function test example presented in this manuscript, the Regextractor package has been used to create cardiac catheterization and echocardiography data marts. The Regextractor package was released as open source software in October 2009 and has been downloaded 552 times as of 6/1/2012.</p> <p>Conclusions</p> <p>Collaboration between clinical researchers and biomedical informatics experts enabled the development and validation of a tool (Regextractor) to parse, abstract and assemble structured data from text data contained in the electronic health record. Regextractor has been successfully used to create additional data marts in other medical domains and is available to the public.</p>
first_indexed 2024-04-13T11:16:11Z
format Article
id doaj.art-135e9b64dc254cc8ad320acdc3074983
institution Directory Open Access Journal
issn 1472-6947
language English
last_indexed 2024-04-13T11:16:11Z
publishDate 2012-09-01
publisher BMC
record_format Article
series BMC Medical Informatics and Decision Making
spelling doaj.art-135e9b64dc254cc8ad320acdc30749832022-12-22T02:48:58ZengBMCBMC Medical Informatics and Decision Making1472-69472012-09-0112110610.1186/1472-6947-12-106Text data extraction for a prospective, research-focused data mart: implementation and validationHinchcliff MoniqueJust EricPodlusky SofiaVarga JohnChang Rowland WKibbe Warren A<p>Abstract</p> <p>Background</p> <p>Translational research typically requires data abstracted from medical records as well as data collected specifically for research. Unfortunately, many data within electronic health records are represented as text that is not amenable to aggregation for analyses. We present a scalable open source SQL Server Integration Services package, called Regextractor, for including regular expression parsers into a classic extract, transform, and load workflow. We have used Regextractor to abstract discrete data from textual reports from a number of ‘machine generated’ sources. To validate this package, we created a pulmonary function test data mart and analyzed the quality of the data mart versus manual chart review.</p> <p>Methods</p> <p>Eleven variables from pulmonary function tests performed closest to the initial clinical evaluation date were studied for 100 randomly selected subjects with scleroderma. One research assistant manually reviewed, abstracted, and entered relevant data into a database. Correlation with data obtained from the automated pulmonary function test data mart within the Northwestern Medical Enterprise Data Warehouse was determined.</p> <p>Results</p> <p>There was a near perfect (99.5%) agreement between results generated from the Regextractor package and those obtained via manual chart abstraction. The pulmonary function test data mart has been used subsequently to monitor disease progression of patients in the Northwestern Scleroderma Registry. In addition to the pulmonary function test example presented in this manuscript, the Regextractor package has been used to create cardiac catheterization and echocardiography data marts. The Regextractor package was released as open source software in October 2009 and has been downloaded 552 times as of 6/1/2012.</p> <p>Conclusions</p> <p>Collaboration between clinical researchers and biomedical informatics experts enabled the development and validation of a tool (Regextractor) to parse, abstract and assemble structured data from text data contained in the electronic health record. Regextractor has been successfully used to create additional data marts in other medical domains and is available to the public.</p>http://www.biomedcentral.com/1472-6947/12/106Medical informaticsInformation storage and retrievalInformation systemsElectronic health recordsAutomatic data processing
spellingShingle Hinchcliff Monique
Just Eric
Podlusky Sofia
Varga John
Chang Rowland W
Kibbe Warren A
Text data extraction for a prospective, research-focused data mart: implementation and validation
BMC Medical Informatics and Decision Making
Medical informatics
Information storage and retrieval
Information systems
Electronic health records
Automatic data processing
title Text data extraction for a prospective, research-focused data mart: implementation and validation
title_full Text data extraction for a prospective, research-focused data mart: implementation and validation
title_fullStr Text data extraction for a prospective, research-focused data mart: implementation and validation
title_full_unstemmed Text data extraction for a prospective, research-focused data mart: implementation and validation
title_short Text data extraction for a prospective, research-focused data mart: implementation and validation
title_sort text data extraction for a prospective research focused data mart implementation and validation
topic Medical informatics
Information storage and retrieval
Information systems
Electronic health records
Automatic data processing
url http://www.biomedcentral.com/1472-6947/12/106
work_keys_str_mv AT hinchcliffmonique textdataextractionforaprospectiveresearchfocuseddatamartimplementationandvalidation
AT justeric textdataextractionforaprospectiveresearchfocuseddatamartimplementationandvalidation
AT podluskysofia textdataextractionforaprospectiveresearchfocuseddatamartimplementationandvalidation
AT vargajohn textdataextractionforaprospectiveresearchfocuseddatamartimplementationandvalidation
AT changrowlandw textdataextractionforaprospectiveresearchfocuseddatamartimplementationandvalidation
AT kibbewarrena textdataextractionforaprospectiveresearchfocuseddatamartimplementationandvalidation