Novel text analytics approach to identify relevant literature for human health risk assessments: A pilot study with health effects of in utero exposures

Background: Systematic reviews involve mining literature databases to identify relevant studies. Identifying potentially relevant studies can be informed by computational tools comparing text similarity between candidate studies and selected key (i.e., seed) references.ChallengeUsing computational a...

Full description

Bibliographic Details
Main Authors: Michelle Cawley, Renee Beardslee, Brandy Beverly, Andrew Hotchkiss, Ellen Kirrane, Reeder Sams, II, Arun Varghese, Jessica Wignall, John Cowden
Format: Article
Language:English
Published: Elsevier 2020-01-01
Series:Environment International
Online Access:http://www.sciencedirect.com/science/article/pii/S0160412019308463
_version_ 1819152322831319040
author Michelle Cawley
Renee Beardslee
Brandy Beverly
Andrew Hotchkiss
Ellen Kirrane
Reeder Sams, II
Arun Varghese
Jessica Wignall
John Cowden
author_facet Michelle Cawley
Renee Beardslee
Brandy Beverly
Andrew Hotchkiss
Ellen Kirrane
Reeder Sams, II
Arun Varghese
Jessica Wignall
John Cowden
author_sort Michelle Cawley
collection DOAJ
description Background: Systematic reviews involve mining literature databases to identify relevant studies. Identifying potentially relevant studies can be informed by computational tools comparing text similarity between candidate studies and selected key (i.e., seed) references.ChallengeUsing computational approaches to identify relevant studies for risk assessments is challenging, as these assessments examine multiple chemical effects across lifestages (e.g., human health risk assessments) or specific effects of multiple chemicals (e.g., cumulative risk). The broad scope of potentially relevant literature can make selection of seed references difficult.ApproachWe developed a generalized computational scoping strategy to identify human health relevant studies for multiple chemicals and multiple effects. We used semi-supervised machine learning to prioritize studies to review manually with training data derived from references cited in the hazard identification sections of several US EPA Integrated Risk Information System (IRIS) assessments. These generic training data or seed studies were clustered with the unclassified corpus to group studies based on text similarity. Clusters containing a high proportion of seed studies were prioritized for manual review. Chemical names were removed from seed studies prior to clustering resulting in a generic, chemical-independent method for identifying potentially human health relevant studies. We developed a case study that focused on identifying the array of chemicals that have been studied with respect to in utero exposure to test the recall of this novel literature searching strategy. We then evaluated the general strategy of using generic, chemical-independent training data with two previous IRIS assessments by comparing studies predicted relevant to those used in the assessments (i.e., total relevant).OutcomeA keyword search designed to retrieve studies that examined the in utero effects of environmental chemicals identified over 54,000 candidate references. Clustering algorithms were applied using 1456 studies from multiple IRIS assessments with chemical names removed as training data or seeds (i.e., semi-supervised learning). Using a six-algorithm ensemble approach 2602 articles, or approximately 5% of candidate references, were “voted” relevant by four or more clustering algorithms and manual review confirmed nearly 50% of these studies were relevant. Further evaluations on two IRIS assessments, using a nine-algorithm ensemble approach and a set of generic, chemical-independent, externally-derived seed studies correctly identified 77–83% of hazard identification studies published in the assessments and eliminated the need to manually screen more than 75% of search results on average.LimitationsThe chemical-independent approach used to build the training literature set provides a broad and unbiased picture across a variety of endpoints and environmental exposures but does not systematically identify all available data. Variance between actual and predicted relevant studies will be greater because of the external and non-random origin of seed study selection. This approach depends on access to readily available generic training data that can be used to locate relevant references in an unclassified corpus.ImpactA generic approach to identifying human health relevant studies could be an important first step in literature evaluation for risk assessments. This initial scoping approach could facilitate faster literature evaluation by focusing reviewer efforts, as well as potentially minimize reviewer bias in selection of key studies. Using externally-derived training data has applicability particularly for databases with very low search precision where identifying training data may be cost-prohibitive.
first_indexed 2024-12-22T14:47:27Z
format Article
id doaj.art-6b7bb88f939848c298ad6460cd7832eb
institution Directory Open Access Journal
issn 0160-4120
language English
last_indexed 2024-12-22T14:47:27Z
publishDate 2020-01-01
publisher Elsevier
record_format Article
series Environment International
spelling doaj.art-6b7bb88f939848c298ad6460cd7832eb2022-12-21T18:22:25ZengElsevierEnvironment International0160-41202020-01-01134Novel text analytics approach to identify relevant literature for human health risk assessments: A pilot study with health effects of in utero exposuresMichelle Cawley0Renee Beardslee1Brandy Beverly2Andrew Hotchkiss3Ellen Kirrane4Reeder Sams, II5Arun Varghese6Jessica Wignall7John Cowden8ICF, Durham, NC 27713, United StatesNational Center for Environmental Assessment, U.S. Environmental Protection Agency, Durham, NC 27711, United StatesNational Center for Environmental Assessment, U.S. Environmental Protection Agency, Durham, NC 27711, United StatesNational Center for Environmental Assessment, U.S. Environmental Protection Agency, Durham, NC 27711, United StatesNational Center for Environmental Assessment, U.S. Environmental Protection Agency, Durham, NC 27711, United StatesNational Center for Environmental Assessment, U.S. Environmental Protection Agency, Durham, NC 27711, United StatesICF, Durham, NC 27713, United StatesICF, Durham, NC 27713, United StatesNational Center for Computational Toxicology, U.S. Environmental Protection Agency, Durham, NC 27711, United States; Corresponding author.Background: Systematic reviews involve mining literature databases to identify relevant studies. Identifying potentially relevant studies can be informed by computational tools comparing text similarity between candidate studies and selected key (i.e., seed) references.ChallengeUsing computational approaches to identify relevant studies for risk assessments is challenging, as these assessments examine multiple chemical effects across lifestages (e.g., human health risk assessments) or specific effects of multiple chemicals (e.g., cumulative risk). The broad scope of potentially relevant literature can make selection of seed references difficult.ApproachWe developed a generalized computational scoping strategy to identify human health relevant studies for multiple chemicals and multiple effects. We used semi-supervised machine learning to prioritize studies to review manually with training data derived from references cited in the hazard identification sections of several US EPA Integrated Risk Information System (IRIS) assessments. These generic training data or seed studies were clustered with the unclassified corpus to group studies based on text similarity. Clusters containing a high proportion of seed studies were prioritized for manual review. Chemical names were removed from seed studies prior to clustering resulting in a generic, chemical-independent method for identifying potentially human health relevant studies. We developed a case study that focused on identifying the array of chemicals that have been studied with respect to in utero exposure to test the recall of this novel literature searching strategy. We then evaluated the general strategy of using generic, chemical-independent training data with two previous IRIS assessments by comparing studies predicted relevant to those used in the assessments (i.e., total relevant).OutcomeA keyword search designed to retrieve studies that examined the in utero effects of environmental chemicals identified over 54,000 candidate references. Clustering algorithms were applied using 1456 studies from multiple IRIS assessments with chemical names removed as training data or seeds (i.e., semi-supervised learning). Using a six-algorithm ensemble approach 2602 articles, or approximately 5% of candidate references, were “voted” relevant by four or more clustering algorithms and manual review confirmed nearly 50% of these studies were relevant. Further evaluations on two IRIS assessments, using a nine-algorithm ensemble approach and a set of generic, chemical-independent, externally-derived seed studies correctly identified 77–83% of hazard identification studies published in the assessments and eliminated the need to manually screen more than 75% of search results on average.LimitationsThe chemical-independent approach used to build the training literature set provides a broad and unbiased picture across a variety of endpoints and environmental exposures but does not systematically identify all available data. Variance between actual and predicted relevant studies will be greater because of the external and non-random origin of seed study selection. This approach depends on access to readily available generic training data that can be used to locate relevant references in an unclassified corpus.ImpactA generic approach to identifying human health relevant studies could be an important first step in literature evaluation for risk assessments. This initial scoping approach could facilitate faster literature evaluation by focusing reviewer efforts, as well as potentially minimize reviewer bias in selection of key studies. Using externally-derived training data has applicability particularly for databases with very low search precision where identifying training data may be cost-prohibitive.http://www.sciencedirect.com/science/article/pii/S0160412019308463
spellingShingle Michelle Cawley
Renee Beardslee
Brandy Beverly
Andrew Hotchkiss
Ellen Kirrane
Reeder Sams, II
Arun Varghese
Jessica Wignall
John Cowden
Novel text analytics approach to identify relevant literature for human health risk assessments: A pilot study with health effects of in utero exposures
Environment International
title Novel text analytics approach to identify relevant literature for human health risk assessments: A pilot study with health effects of in utero exposures
title_full Novel text analytics approach to identify relevant literature for human health risk assessments: A pilot study with health effects of in utero exposures
title_fullStr Novel text analytics approach to identify relevant literature for human health risk assessments: A pilot study with health effects of in utero exposures
title_full_unstemmed Novel text analytics approach to identify relevant literature for human health risk assessments: A pilot study with health effects of in utero exposures
title_short Novel text analytics approach to identify relevant literature for human health risk assessments: A pilot study with health effects of in utero exposures
title_sort novel text analytics approach to identify relevant literature for human health risk assessments a pilot study with health effects of in utero exposures
url http://www.sciencedirect.com/science/article/pii/S0160412019308463
work_keys_str_mv AT michellecawley noveltextanalyticsapproachtoidentifyrelevantliteratureforhumanhealthriskassessmentsapilotstudywithhealtheffectsofinuteroexposures
AT reneebeardslee noveltextanalyticsapproachtoidentifyrelevantliteratureforhumanhealthriskassessmentsapilotstudywithhealtheffectsofinuteroexposures
AT brandybeverly noveltextanalyticsapproachtoidentifyrelevantliteratureforhumanhealthriskassessmentsapilotstudywithhealtheffectsofinuteroexposures
AT andrewhotchkiss noveltextanalyticsapproachtoidentifyrelevantliteratureforhumanhealthriskassessmentsapilotstudywithhealtheffectsofinuteroexposures
AT ellenkirrane noveltextanalyticsapproachtoidentifyrelevantliteratureforhumanhealthriskassessmentsapilotstudywithhealtheffectsofinuteroexposures
AT reedersamsii noveltextanalyticsapproachtoidentifyrelevantliteratureforhumanhealthriskassessmentsapilotstudywithhealtheffectsofinuteroexposures
AT arunvarghese noveltextanalyticsapproachtoidentifyrelevantliteratureforhumanhealthriskassessmentsapilotstudywithhealtheffectsofinuteroexposures
AT jessicawignall noveltextanalyticsapproachtoidentifyrelevantliteratureforhumanhealthriskassessmentsapilotstudywithhealtheffectsofinuteroexposures
AT johncowden noveltextanalyticsapproachtoidentifyrelevantliteratureforhumanhealthriskassessmentsapilotstudywithhealtheffectsofinuteroexposures