Comparison of Resampling Algorithms to Address Class Imbalance when Developing Machine Learning Models to Predict Foodborne Pathogen Presence in Agricultural Water

Recent studies have shown that predictive models can supplement or provide alternatives to E. coli-testing for assessing the potential presence of food safety hazards in water used for produce production. However, these studies used balanced training data and focused on enteric pathogens. As such, r...

Full description

Bibliographic Details
Main Authors:	Daniel Lowell Weller, Tanzy M. T. Love, Martin Wiedmann
Format:	Article
Language:	English
Published:	Frontiers Media S.A. 2021-06-01
Series:	Frontiers in Environmental Science
Subjects:	Listeria Listeria (L.) monocytogenes machine learning predictive modeling agricultural water food safety
Online Access:	https://www.frontiersin.org/articles/10.3389/fenvs.2021.701288/full

_version_	1819289959633256448
author	Daniel Lowell Weller Daniel Lowell Weller Daniel Lowell Weller Tanzy M. T. Love Martin Wiedmann
author_facet	Daniel Lowell Weller Daniel Lowell Weller Daniel Lowell Weller Tanzy M. T. Love Martin Wiedmann
author_sort	Daniel Lowell Weller
collection	DOAJ
description	Recent studies have shown that predictive models can supplement or provide alternatives to E. coli-testing for assessing the potential presence of food safety hazards in water used for produce production. However, these studies used balanced training data and focused on enteric pathogens. As such, research is needed to determine 1) if predictive models can be used to assess Listeria contamination of agricultural water, and 2) how resampling (to deal with imbalanced data) affects performance of these models. To address these knowledge gaps, this study developed models that predict nonpathogenic Listeria spp. (excluding L. monocytogenes) and L. monocytogenes presence in agricultural water using various combinations of learner (e.g., random forest, regression), feature type, and resampling method (none, oversampling, SMOTE). Four feature types were used in model training: microbial, physicochemical, spatial, and weather. “Full models” were trained using all four feature types, while “nested models” used between one and three types. In total, 45 full (15 learners3 resampling approaches) and 108 nested (5 learners9 feature sets*3 resampling approaches) models were trained per outcome. Model performance was compared against baseline models where E. coli concentration was the sole predictor. Overall, the machine learning models outperformed the baseline E. coli models, with random forests outperforming models built using other learners (e.g., rule-based learners). Resampling produced more accurate models than not resampling, with SMOTE models outperforming, on average, oversampling models. Regardless of resampling method, spatial and physicochemical water quality features drove accurate predictions for the nonpathogenic Listeria spp. and L. monocytogenes models, respectively. Overall, these findings 1) illustrate the need for alternatives to existing E. coli-based monitoring programs for assessing agricultural water for the presence of potential food safety hazards, and 2) suggest that predictive models may be one such alternative. Moreover, these findings provide a conceptual framework for how such models can be developed in the future with the ultimate aim of developing models that can be integrated into on-farm risk management programs. For example, future studies should consider using random forest learners, SMOTE resampling, and spatial features to develop models to predict the presence of foodborne pathogens, such as L. monocytogenes, in agricultural water when the training data is imbalanced.
first_indexed	2024-12-24T03:15:08Z
format	Article
id	doaj.art-75775a780b9b4626990907545151bd9a
institution	Directory Open Access Journal
issn	2296-665X
language	English
last_indexed	2024-12-24T03:15:08Z
publishDate	2021-06-01
publisher	Frontiers Media S.A.
record_format	Article
series	Frontiers in Environmental Science
spelling	doaj.art-75775a780b9b4626990907545151bd9a2022-12-21T17:17:40ZengFrontiers Media S.A.Frontiers in Environmental Science2296-665X2021-06-01910.3389/fenvs.2021.701288701288Comparison of Resampling Algorithms to Address Class Imbalance when Developing Machine Learning Models to Predict Foodborne Pathogen Presence in Agricultural WaterDaniel Lowell Weller0Daniel Lowell Weller1Daniel Lowell Weller2Tanzy M. T. Love3Martin Wiedmann4Department of Biostatistics and Computational Biology, University of Rochester, Rochester, NY, United StatesDepartment of Environmental and Forest Biology, State University of New York, Environmental Science and Forestry, Syracuse, NY, United StatesDepartment of Food Science, Cornell University, Ithaca, NY, United StatesDepartment of Biostatistics and Computational Biology, University of Rochester, Rochester, NY, United StatesDepartment of Food Science, Cornell University, Ithaca, NY, United StatesRecent studies have shown that predictive models can supplement or provide alternatives to E. coli-testing for assessing the potential presence of food safety hazards in water used for produce production. However, these studies used balanced training data and focused on enteric pathogens. As such, research is needed to determine 1) if predictive models can be used to assess Listeria contamination of agricultural water, and 2) how resampling (to deal with imbalanced data) affects performance of these models. To address these knowledge gaps, this study developed models that predict nonpathogenic Listeria spp. (excluding L. monocytogenes) and L. monocytogenes presence in agricultural water using various combinations of learner (e.g., random forest, regression), feature type, and resampling method (none, oversampling, SMOTE). Four feature types were used in model training: microbial, physicochemical, spatial, and weather. “Full models” were trained using all four feature types, while “nested models” used between one and three types. In total, 45 full (15 learners3 resampling approaches) and 108 nested (5 learners9 feature sets*3 resampling approaches) models were trained per outcome. Model performance was compared against baseline models where E. coli concentration was the sole predictor. Overall, the machine learning models outperformed the baseline E. coli models, with random forests outperforming models built using other learners (e.g., rule-based learners). Resampling produced more accurate models than not resampling, with SMOTE models outperforming, on average, oversampling models. Regardless of resampling method, spatial and physicochemical water quality features drove accurate predictions for the nonpathogenic Listeria spp. and L. monocytogenes models, respectively. Overall, these findings 1) illustrate the need for alternatives to existing E. coli-based monitoring programs for assessing agricultural water for the presence of potential food safety hazards, and 2) suggest that predictive models may be one such alternative. Moreover, these findings provide a conceptual framework for how such models can be developed in the future with the ultimate aim of developing models that can be integrated into on-farm risk management programs. For example, future studies should consider using random forest learners, SMOTE resampling, and spatial features to develop models to predict the presence of foodborne pathogens, such as L. monocytogenes, in agricultural water when the training data is imbalanced.https://www.frontiersin.org/articles/10.3389/fenvs.2021.701288/fullListeriaListeria (L.) monocytogenesmachine learningpredictive modelingagricultural waterfood safety
spellingShingle	Daniel Lowell Weller Daniel Lowell Weller Daniel Lowell Weller Tanzy M. T. Love Martin Wiedmann Comparison of Resampling Algorithms to Address Class Imbalance when Developing Machine Learning Models to Predict Foodborne Pathogen Presence in Agricultural Water Frontiers in Environmental Science Listeria Listeria (L.) monocytogenes machine learning predictive modeling agricultural water food safety
title	Comparison of Resampling Algorithms to Address Class Imbalance when Developing Machine Learning Models to Predict Foodborne Pathogen Presence in Agricultural Water
title_full	Comparison of Resampling Algorithms to Address Class Imbalance when Developing Machine Learning Models to Predict Foodborne Pathogen Presence in Agricultural Water
title_fullStr	Comparison of Resampling Algorithms to Address Class Imbalance when Developing Machine Learning Models to Predict Foodborne Pathogen Presence in Agricultural Water
title_full_unstemmed	Comparison of Resampling Algorithms to Address Class Imbalance when Developing Machine Learning Models to Predict Foodborne Pathogen Presence in Agricultural Water
title_short	Comparison of Resampling Algorithms to Address Class Imbalance when Developing Machine Learning Models to Predict Foodborne Pathogen Presence in Agricultural Water
title_sort	comparison of resampling algorithms to address class imbalance when developing machine learning models to predict foodborne pathogen presence in agricultural water
topic	Listeria Listeria (L.) monocytogenes machine learning predictive modeling agricultural water food safety
url	https://www.frontiersin.org/articles/10.3389/fenvs.2021.701288/full
work_keys_str_mv	AT daniellowellweller comparisonofresamplingalgorithmstoaddressclassimbalancewhendevelopingmachinelearningmodelstopredictfoodbornepathogenpresenceinagriculturalwater AT daniellowellweller comparisonofresamplingalgorithmstoaddressclassimbalancewhendevelopingmachinelearningmodelstopredictfoodbornepathogenpresenceinagriculturalwater AT daniellowellweller comparisonofresamplingalgorithmstoaddressclassimbalancewhendevelopingmachinelearningmodelstopredictfoodbornepathogenpresenceinagriculturalwater AT tanzymtlove comparisonofresamplingalgorithmstoaddressclassimbalancewhendevelopingmachinelearningmodelstopredictfoodbornepathogenpresenceinagriculturalwater AT martinwiedmann comparisonofresamplingalgorithmstoaddressclassimbalancewhendevelopingmachinelearningmodelstopredictfoodbornepathogenpresenceinagriculturalwater

Comparison of Resampling Algorithms to Address Class Imbalance when Developing Machine Learning Models to Predict Foodborne Pathogen Presence in Agricultural Water

Similar Items