Evaluation of a semi-automated data extraction tool for public health literature-based reviews: Dextr
Introduction: There has been limited development and uptake of machine-learning methods to automate data extraction for literature-based assessments. Although advanced extraction approaches have been applied to some clinical research reviews, existing methods are not well suited for addressing toxic...
Main Authors: | , , , , , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2022-01-01
|
Series: | Environment International |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S0160412021006504 |
_version_ | 1818753610014523392 |
---|---|
author | Vickie R. Walker Charles P. Schmitt Mary S. Wolfe Artur J. Nowak Kuba Kulesza Ashley R. Williams Rob Shin Jonathan Cohen Dave Burch Matthew D. Stout Kelly A. Shipkowski Andrew A. Rooney |
author_facet | Vickie R. Walker Charles P. Schmitt Mary S. Wolfe Artur J. Nowak Kuba Kulesza Ashley R. Williams Rob Shin Jonathan Cohen Dave Burch Matthew D. Stout Kelly A. Shipkowski Andrew A. Rooney |
author_sort | Vickie R. Walker |
collection | DOAJ |
description | Introduction: There has been limited development and uptake of machine-learning methods to automate data extraction for literature-based assessments. Although advanced extraction approaches have been applied to some clinical research reviews, existing methods are not well suited for addressing toxicology or environmental health questions due to unique data needs to support reviews in these fields. Objectives: To develop and evaluate a flexible, web-based tool for semi-automated data extraction that: 1) makes data extraction predictions with user verification, 2) integrates token-level annotations, and 3) connects extracted entities to support hierarchical data extraction. Methods: Dextr was developed with Agile software methodology using a two-team approach. The development team outlined proposed features and coded the software. The advisory team guided developers and evaluated Dextr’s performance on precision, recall, and extraction time by comparing a manual extraction workflow to a semi-automated extraction workflow using a dataset of 51 environmental health animal studies. Results: The semi-automated workflow did not appear to affect precision rate (96.0% vs. 95.4% manual, p = 0.38), resulted in a small reduction in recall rate (91.8% vs. 97.0% manual, p < 0.01), and substantially reduced the median extraction time (436 s vs. 933 s per study manual, p < 0.01) compared to a manual workflow. Discussion: Dextr provides similar performance to manual extraction in terms of recall and precision and greatly reduces data extraction time. Unlike other tools, Dextr provides the ability to extract complex concepts (e.g., multiple experiments with various exposures and doses within a single study), properly connect the extracted elements within a study, and effectively limit the work required by researchers to generate machine-readable, annotated exports. The Dextr tool addresses data-extraction challenges associated with environmental health sciences literature with a simple user interface, incorporates the key capabilities of user verification and entity connecting, provides a platform for further automation developments, and has the potential to improve data extraction for literature reviews in this and other fields. |
first_indexed | 2024-12-18T05:10:05Z |
format | Article |
id | doaj.art-42972b504dc14de7b4308a659c2da38b |
institution | Directory Open Access Journal |
issn | 0160-4120 |
language | English |
last_indexed | 2024-12-18T05:10:05Z |
publishDate | 2022-01-01 |
publisher | Elsevier |
record_format | Article |
series | Environment International |
spelling | doaj.art-42972b504dc14de7b4308a659c2da38b2022-12-21T21:19:55ZengElsevierEnvironment International0160-41202022-01-01159107025Evaluation of a semi-automated data extraction tool for public health literature-based reviews: DextrVickie R. Walker0Charles P. Schmitt1Mary S. Wolfe2Artur J. Nowak3Kuba Kulesza4Ashley R. Williams5Rob Shin6Jonathan Cohen7Dave Burch8Matthew D. Stout9Kelly A. Shipkowski10Andrew A. Rooney11Division of the National Toxicology Program (DNTP), National Institute of Environmental Health Sciences (NIEHS), National Institutes of Health (NIH), Research Triangle Park, NC, USA; Corresponding author at: NIEHS, P.O. Box 12233, Mail Drop K2–04, Research Triangle Park, NC 27709, USA. Express mail: 530 Davis Drive, Morrisville, NC 27560, USA.Division of the National Toxicology Program (DNTP), National Institute of Environmental Health Sciences (NIEHS), National Institutes of Health (NIH), Research Triangle Park, NC, USADivision of the National Toxicology Program (DNTP), National Institute of Environmental Health Sciences (NIEHS), National Institutes of Health (NIH), Research Triangle Park, NC, USAEvidence Prime Inc, Krakow, PolandEvidence Prime Inc, Krakow, PolandICF, Research Triangle Park, NC, USAICF, Research Triangle Park, NC, USAICF, Research Triangle Park, NC, USAICF, Research Triangle Park, NC, USADivision of the National Toxicology Program (DNTP), National Institute of Environmental Health Sciences (NIEHS), National Institutes of Health (NIH), Research Triangle Park, NC, USADivision of the National Toxicology Program (DNTP), National Institute of Environmental Health Sciences (NIEHS), National Institutes of Health (NIH), Research Triangle Park, NC, USADivision of the National Toxicology Program (DNTP), National Institute of Environmental Health Sciences (NIEHS), National Institutes of Health (NIH), Research Triangle Park, NC, USAIntroduction: There has been limited development and uptake of machine-learning methods to automate data extraction for literature-based assessments. Although advanced extraction approaches have been applied to some clinical research reviews, existing methods are not well suited for addressing toxicology or environmental health questions due to unique data needs to support reviews in these fields. Objectives: To develop and evaluate a flexible, web-based tool for semi-automated data extraction that: 1) makes data extraction predictions with user verification, 2) integrates token-level annotations, and 3) connects extracted entities to support hierarchical data extraction. Methods: Dextr was developed with Agile software methodology using a two-team approach. The development team outlined proposed features and coded the software. The advisory team guided developers and evaluated Dextr’s performance on precision, recall, and extraction time by comparing a manual extraction workflow to a semi-automated extraction workflow using a dataset of 51 environmental health animal studies. Results: The semi-automated workflow did not appear to affect precision rate (96.0% vs. 95.4% manual, p = 0.38), resulted in a small reduction in recall rate (91.8% vs. 97.0% manual, p < 0.01), and substantially reduced the median extraction time (436 s vs. 933 s per study manual, p < 0.01) compared to a manual workflow. Discussion: Dextr provides similar performance to manual extraction in terms of recall and precision and greatly reduces data extraction time. Unlike other tools, Dextr provides the ability to extract complex concepts (e.g., multiple experiments with various exposures and doses within a single study), properly connect the extracted elements within a study, and effectively limit the work required by researchers to generate machine-readable, annotated exports. The Dextr tool addresses data-extraction challenges associated with environmental health sciences literature with a simple user interface, incorporates the key capabilities of user verification and entity connecting, provides a platform for further automation developments, and has the potential to improve data extraction for literature reviews in this and other fields.http://www.sciencedirect.com/science/article/pii/S0160412021006504AutomationText miningMachine learningNatural language processingLiterature reviewSystematic review |
spellingShingle | Vickie R. Walker Charles P. Schmitt Mary S. Wolfe Artur J. Nowak Kuba Kulesza Ashley R. Williams Rob Shin Jonathan Cohen Dave Burch Matthew D. Stout Kelly A. Shipkowski Andrew A. Rooney Evaluation of a semi-automated data extraction tool for public health literature-based reviews: Dextr Environment International Automation Text mining Machine learning Natural language processing Literature review Systematic review |
title | Evaluation of a semi-automated data extraction tool for public health literature-based reviews: Dextr |
title_full | Evaluation of a semi-automated data extraction tool for public health literature-based reviews: Dextr |
title_fullStr | Evaluation of a semi-automated data extraction tool for public health literature-based reviews: Dextr |
title_full_unstemmed | Evaluation of a semi-automated data extraction tool for public health literature-based reviews: Dextr |
title_short | Evaluation of a semi-automated data extraction tool for public health literature-based reviews: Dextr |
title_sort | evaluation of a semi automated data extraction tool for public health literature based reviews dextr |
topic | Automation Text mining Machine learning Natural language processing Literature review Systematic review |
url | http://www.sciencedirect.com/science/article/pii/S0160412021006504 |
work_keys_str_mv | AT vickierwalker evaluationofasemiautomateddataextractiontoolforpublichealthliteraturebasedreviewsdextr AT charlespschmitt evaluationofasemiautomateddataextractiontoolforpublichealthliteraturebasedreviewsdextr AT maryswolfe evaluationofasemiautomateddataextractiontoolforpublichealthliteraturebasedreviewsdextr AT arturjnowak evaluationofasemiautomateddataextractiontoolforpublichealthliteraturebasedreviewsdextr AT kubakulesza evaluationofasemiautomateddataextractiontoolforpublichealthliteraturebasedreviewsdextr AT ashleyrwilliams evaluationofasemiautomateddataextractiontoolforpublichealthliteraturebasedreviewsdextr AT robshin evaluationofasemiautomateddataextractiontoolforpublichealthliteraturebasedreviewsdextr AT jonathancohen evaluationofasemiautomateddataextractiontoolforpublichealthliteraturebasedreviewsdextr AT daveburch evaluationofasemiautomateddataextractiontoolforpublichealthliteraturebasedreviewsdextr AT matthewdstout evaluationofasemiautomateddataextractiontoolforpublichealthliteraturebasedreviewsdextr AT kellyashipkowski evaluationofasemiautomateddataextractiontoolforpublichealthliteraturebasedreviewsdextr AT andrewarooney evaluationofasemiautomateddataextractiontoolforpublichealthliteraturebasedreviewsdextr |