Evaluation of a semi-automated data extraction tool for public health literature-based reviews: Dextr

Introduction: There has been limited development and uptake of machine-learning methods to automate data extraction for literature-based assessments. Although advanced extraction approaches have been applied to some clinical research reviews, existing methods are not well suited for addressing toxic...

Full description

Bibliographic Details
Main Authors: Vickie R. Walker, Charles P. Schmitt, Mary S. Wolfe, Artur J. Nowak, Kuba Kulesza, Ashley R. Williams, Rob Shin, Jonathan Cohen, Dave Burch, Matthew D. Stout, Kelly A. Shipkowski, Andrew A. Rooney
Format: Article
Language:English
Published: Elsevier 2022-01-01
Series:Environment International
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S0160412021006504
_version_ 1818753610014523392
author Vickie R. Walker
Charles P. Schmitt
Mary S. Wolfe
Artur J. Nowak
Kuba Kulesza
Ashley R. Williams
Rob Shin
Jonathan Cohen
Dave Burch
Matthew D. Stout
Kelly A. Shipkowski
Andrew A. Rooney
author_facet Vickie R. Walker
Charles P. Schmitt
Mary S. Wolfe
Artur J. Nowak
Kuba Kulesza
Ashley R. Williams
Rob Shin
Jonathan Cohen
Dave Burch
Matthew D. Stout
Kelly A. Shipkowski
Andrew A. Rooney
author_sort Vickie R. Walker
collection DOAJ
description Introduction: There has been limited development and uptake of machine-learning methods to automate data extraction for literature-based assessments. Although advanced extraction approaches have been applied to some clinical research reviews, existing methods are not well suited for addressing toxicology or environmental health questions due to unique data needs to support reviews in these fields. Objectives: To develop and evaluate a flexible, web-based tool for semi-automated data extraction that: 1) makes data extraction predictions with user verification, 2) integrates token-level annotations, and 3) connects extracted entities to support hierarchical data extraction. Methods: Dextr was developed with Agile software methodology using a two-team approach. The development team outlined proposed features and coded the software. The advisory team guided developers and evaluated Dextr’s performance on precision, recall, and extraction time by comparing a manual extraction workflow to a semi-automated extraction workflow using a dataset of 51 environmental health animal studies. Results: The semi-automated workflow did not appear to affect precision rate (96.0% vs. 95.4% manual, p = 0.38), resulted in a small reduction in recall rate (91.8% vs. 97.0% manual, p < 0.01), and substantially reduced the median extraction time (436 s vs. 933 s per study manual, p < 0.01) compared to a manual workflow. Discussion: Dextr provides similar performance to manual extraction in terms of recall and precision and greatly reduces data extraction time. Unlike other tools, Dextr provides the ability to extract complex concepts (e.g., multiple experiments with various exposures and doses within a single study), properly connect the extracted elements within a study, and effectively limit the work required by researchers to generate machine-readable, annotated exports. The Dextr tool addresses data-extraction challenges associated with environmental health sciences literature with a simple user interface, incorporates the key capabilities of user verification and entity connecting, provides a platform for further automation developments, and has the potential to improve data extraction for literature reviews in this and other fields.
first_indexed 2024-12-18T05:10:05Z
format Article
id doaj.art-42972b504dc14de7b4308a659c2da38b
institution Directory Open Access Journal
issn 0160-4120
language English
last_indexed 2024-12-18T05:10:05Z
publishDate 2022-01-01
publisher Elsevier
record_format Article
series Environment International
spelling doaj.art-42972b504dc14de7b4308a659c2da38b2022-12-21T21:19:55ZengElsevierEnvironment International0160-41202022-01-01159107025Evaluation of a semi-automated data extraction tool for public health literature-based reviews: DextrVickie R. Walker0Charles P. Schmitt1Mary S. Wolfe2Artur J. Nowak3Kuba Kulesza4Ashley R. Williams5Rob Shin6Jonathan Cohen7Dave Burch8Matthew D. Stout9Kelly A. Shipkowski10Andrew A. Rooney11Division of the National Toxicology Program (DNTP), National Institute of Environmental Health Sciences (NIEHS), National Institutes of Health (NIH), Research Triangle Park, NC, USA; Corresponding author at: NIEHS, P.O. Box 12233, Mail Drop K2–04, Research Triangle Park, NC 27709, USA. Express mail: 530 Davis Drive, Morrisville, NC 27560, USA.Division of the National Toxicology Program (DNTP), National Institute of Environmental Health Sciences (NIEHS), National Institutes of Health (NIH), Research Triangle Park, NC, USADivision of the National Toxicology Program (DNTP), National Institute of Environmental Health Sciences (NIEHS), National Institutes of Health (NIH), Research Triangle Park, NC, USAEvidence Prime Inc, Krakow, PolandEvidence Prime Inc, Krakow, PolandICF, Research Triangle Park, NC, USAICF, Research Triangle Park, NC, USAICF, Research Triangle Park, NC, USAICF, Research Triangle Park, NC, USADivision of the National Toxicology Program (DNTP), National Institute of Environmental Health Sciences (NIEHS), National Institutes of Health (NIH), Research Triangle Park, NC, USADivision of the National Toxicology Program (DNTP), National Institute of Environmental Health Sciences (NIEHS), National Institutes of Health (NIH), Research Triangle Park, NC, USADivision of the National Toxicology Program (DNTP), National Institute of Environmental Health Sciences (NIEHS), National Institutes of Health (NIH), Research Triangle Park, NC, USAIntroduction: There has been limited development and uptake of machine-learning methods to automate data extraction for literature-based assessments. Although advanced extraction approaches have been applied to some clinical research reviews, existing methods are not well suited for addressing toxicology or environmental health questions due to unique data needs to support reviews in these fields. Objectives: To develop and evaluate a flexible, web-based tool for semi-automated data extraction that: 1) makes data extraction predictions with user verification, 2) integrates token-level annotations, and 3) connects extracted entities to support hierarchical data extraction. Methods: Dextr was developed with Agile software methodology using a two-team approach. The development team outlined proposed features and coded the software. The advisory team guided developers and evaluated Dextr’s performance on precision, recall, and extraction time by comparing a manual extraction workflow to a semi-automated extraction workflow using a dataset of 51 environmental health animal studies. Results: The semi-automated workflow did not appear to affect precision rate (96.0% vs. 95.4% manual, p = 0.38), resulted in a small reduction in recall rate (91.8% vs. 97.0% manual, p < 0.01), and substantially reduced the median extraction time (436 s vs. 933 s per study manual, p < 0.01) compared to a manual workflow. Discussion: Dextr provides similar performance to manual extraction in terms of recall and precision and greatly reduces data extraction time. Unlike other tools, Dextr provides the ability to extract complex concepts (e.g., multiple experiments with various exposures and doses within a single study), properly connect the extracted elements within a study, and effectively limit the work required by researchers to generate machine-readable, annotated exports. The Dextr tool addresses data-extraction challenges associated with environmental health sciences literature with a simple user interface, incorporates the key capabilities of user verification and entity connecting, provides a platform for further automation developments, and has the potential to improve data extraction for literature reviews in this and other fields.http://www.sciencedirect.com/science/article/pii/S0160412021006504AutomationText miningMachine learningNatural language processingLiterature reviewSystematic review
spellingShingle Vickie R. Walker
Charles P. Schmitt
Mary S. Wolfe
Artur J. Nowak
Kuba Kulesza
Ashley R. Williams
Rob Shin
Jonathan Cohen
Dave Burch
Matthew D. Stout
Kelly A. Shipkowski
Andrew A. Rooney
Evaluation of a semi-automated data extraction tool for public health literature-based reviews: Dextr
Environment International
Automation
Text mining
Machine learning
Natural language processing
Literature review
Systematic review
title Evaluation of a semi-automated data extraction tool for public health literature-based reviews: Dextr
title_full Evaluation of a semi-automated data extraction tool for public health literature-based reviews: Dextr
title_fullStr Evaluation of a semi-automated data extraction tool for public health literature-based reviews: Dextr
title_full_unstemmed Evaluation of a semi-automated data extraction tool for public health literature-based reviews: Dextr
title_short Evaluation of a semi-automated data extraction tool for public health literature-based reviews: Dextr
title_sort evaluation of a semi automated data extraction tool for public health literature based reviews dextr
topic Automation
Text mining
Machine learning
Natural language processing
Literature review
Systematic review
url http://www.sciencedirect.com/science/article/pii/S0160412021006504
work_keys_str_mv AT vickierwalker evaluationofasemiautomateddataextractiontoolforpublichealthliteraturebasedreviewsdextr
AT charlespschmitt evaluationofasemiautomateddataextractiontoolforpublichealthliteraturebasedreviewsdextr
AT maryswolfe evaluationofasemiautomateddataextractiontoolforpublichealthliteraturebasedreviewsdextr
AT arturjnowak evaluationofasemiautomateddataextractiontoolforpublichealthliteraturebasedreviewsdextr
AT kubakulesza evaluationofasemiautomateddataextractiontoolforpublichealthliteraturebasedreviewsdextr
AT ashleyrwilliams evaluationofasemiautomateddataextractiontoolforpublichealthliteraturebasedreviewsdextr
AT robshin evaluationofasemiautomateddataextractiontoolforpublichealthliteraturebasedreviewsdextr
AT jonathancohen evaluationofasemiautomateddataextractiontoolforpublichealthliteraturebasedreviewsdextr
AT daveburch evaluationofasemiautomateddataextractiontoolforpublichealthliteraturebasedreviewsdextr
AT matthewdstout evaluationofasemiautomateddataextractiontoolforpublichealthliteraturebasedreviewsdextr
AT kellyashipkowski evaluationofasemiautomateddataextractiontoolforpublichealthliteraturebasedreviewsdextr
AT andrewarooney evaluationofasemiautomateddataextractiontoolforpublichealthliteraturebasedreviewsdextr