Predicting environmental stressor levels with machine learning: a comparison between amplicon sequencing, metagenomics, and total RNA sequencing based on taxonomically assigned data

IntroductionMicrobes are increasingly (re)considered for environmental assessments because they are powerful indicators for the health of ecosystems. The complexity of microbial communities necessitates powerful novel tools to derive conclusions for environmental decision-makers, and machine learnin...

Full description

Bibliographic Details
Main Authors: Christopher A. Hempel, Dominik Buchner, Leoni Mack, Marie V. Brasseur, Dan Tulpan, Florian Leese, Dirk Steinke
Format: Article
Language:English
Published: Frontiers Media S.A. 2023-11-01
Series:Frontiers in Microbiology
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fmicb.2023.1217750/full
_version_ 1797465561255378944
author Christopher A. Hempel
Christopher A. Hempel
Dominik Buchner
Leoni Mack
Marie V. Brasseur
Dan Tulpan
Dan Tulpan
Florian Leese
Florian Leese
Dirk Steinke
Dirk Steinke
author_facet Christopher A. Hempel
Christopher A. Hempel
Dominik Buchner
Leoni Mack
Marie V. Brasseur
Dan Tulpan
Dan Tulpan
Florian Leese
Florian Leese
Dirk Steinke
Dirk Steinke
author_sort Christopher A. Hempel
collection DOAJ
description IntroductionMicrobes are increasingly (re)considered for environmental assessments because they are powerful indicators for the health of ecosystems. The complexity of microbial communities necessitates powerful novel tools to derive conclusions for environmental decision-makers, and machine learning is a promising option in that context. While amplicon sequencing is typically applied to assess microbial communities, metagenomics and total RNA sequencing (herein summarized as omics-based methods) can provide a more holistic picture of microbial biodiversity at sufficient sequencing depths. Despite this advantage, amplicon sequencing and omics-based methods have not yet been compared for taxonomy-based environmental assessments with machine learning.MethodsIn this study, we applied 16S and ITS-2 sequencing, metagenomics, and total RNA sequencing to samples from a stream mesocosm experiment that investigated the impacts of two aquatic stressors, insecticide and increased fine sediment deposition, on stream biodiversity. We processed the data using similarity clustering and denoising (only applicable to amplicon sequencing) as well as multiple taxonomic levels, data types, feature selection, and machine learning algorithms and evaluated the stressor prediction performance of each generated model for a total of 1,536 evaluated combinations of taxonomic datasets and data-processing methods.ResultsSequencing and data-processing methods had a substantial impact on stressor prediction. While omics-based methods detected a higher diversity of taxa than amplicon sequencing, 16S sequencing outperformed all other sequencing methods in terms of stressor prediction based on the Matthews Correlation Coefficient. However, even the highest observed performance for 16S sequencing was still only moderate. Omics-based methods performed poorly overall, but this was likely due to insufficient sequencing depth. Data types had no impact on performance while feature selection significantly improved performance for omics-based methods but not for amplicon sequencing.DiscussionWe conclude that amplicon sequencing might be a better candidate for machine-learning-based environmental stressor prediction than omics-based methods, but the latter require further research at higher sequencing depths to confirm this conclusion. More sampling could improve stressor prediction performance, and while this was not possible in the context of our study, thousands of sampling sites are monitored for routine environmental assessments, providing an ideal framework to further refine the approach for possible implementation in environmental diagnostics.
first_indexed 2024-03-09T18:23:12Z
format Article
id doaj.art-2c29af2200d54461afcd1aa297c11cd0
institution Directory Open Access Journal
issn 1664-302X
language English
last_indexed 2024-03-09T18:23:12Z
publishDate 2023-11-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Microbiology
spelling doaj.art-2c29af2200d54461afcd1aa297c11cd02023-11-24T08:04:42ZengFrontiers Media S.A.Frontiers in Microbiology1664-302X2023-11-011410.3389/fmicb.2023.12177501217750Predicting environmental stressor levels with machine learning: a comparison between amplicon sequencing, metagenomics, and total RNA sequencing based on taxonomically assigned dataChristopher A. Hempel0Christopher A. Hempel1Dominik Buchner2Leoni Mack3Marie V. Brasseur4Dan Tulpan5Dan Tulpan6Florian Leese7Florian Leese8Dirk Steinke9Dirk Steinke10Department of Integrative Biology, University of Guelph, Guelph, ON, CanadaCentre for Biodiversity Genomics, University of Guelph, Guelph, ON, CanadaAquatic Ecosystem Research, University of Duisburg-Essen, Essen, GermanyFaculty of Aquatic Ecology, University of Duisburg-Essen, Essen, GermanyLeibniz Institute for the Analysis of Biodiversity Change, Zoological Research Museum A. Koenig, Bonn, GermanySchool of Computer Science, University of Guelph, Guelph, ON, CanadaDepartment of Animal Biosciences, University of Guelph, Guelph, ON, CanadaAquatic Ecosystem Research, University of Duisburg-Essen, Essen, GermanyCentre for Water and Environmental Research (ZWU), University of Duisburg-Essen, Essen, GermanyDepartment of Integrative Biology, University of Guelph, Guelph, ON, CanadaCentre for Biodiversity Genomics, University of Guelph, Guelph, ON, CanadaIntroductionMicrobes are increasingly (re)considered for environmental assessments because they are powerful indicators for the health of ecosystems. The complexity of microbial communities necessitates powerful novel tools to derive conclusions for environmental decision-makers, and machine learning is a promising option in that context. While amplicon sequencing is typically applied to assess microbial communities, metagenomics and total RNA sequencing (herein summarized as omics-based methods) can provide a more holistic picture of microbial biodiversity at sufficient sequencing depths. Despite this advantage, amplicon sequencing and omics-based methods have not yet been compared for taxonomy-based environmental assessments with machine learning.MethodsIn this study, we applied 16S and ITS-2 sequencing, metagenomics, and total RNA sequencing to samples from a stream mesocosm experiment that investigated the impacts of two aquatic stressors, insecticide and increased fine sediment deposition, on stream biodiversity. We processed the data using similarity clustering and denoising (only applicable to amplicon sequencing) as well as multiple taxonomic levels, data types, feature selection, and machine learning algorithms and evaluated the stressor prediction performance of each generated model for a total of 1,536 evaluated combinations of taxonomic datasets and data-processing methods.ResultsSequencing and data-processing methods had a substantial impact on stressor prediction. While omics-based methods detected a higher diversity of taxa than amplicon sequencing, 16S sequencing outperformed all other sequencing methods in terms of stressor prediction based on the Matthews Correlation Coefficient. However, even the highest observed performance for 16S sequencing was still only moderate. Omics-based methods performed poorly overall, but this was likely due to insufficient sequencing depth. Data types had no impact on performance while feature selection significantly improved performance for omics-based methods but not for amplicon sequencing.DiscussionWe conclude that amplicon sequencing might be a better candidate for machine-learning-based environmental stressor prediction than omics-based methods, but the latter require further research at higher sequencing depths to confirm this conclusion. More sampling could improve stressor prediction performance, and while this was not possible in the context of our study, thousands of sampling sites are monitored for routine environmental assessments, providing an ideal framework to further refine the approach for possible implementation in environmental diagnostics.https://www.frontiersin.org/articles/10.3389/fmicb.2023.1217750/fullmetabarcodingmetatranscriptomicsfreshwaterstressor predictionbioinformaticsExStream
spellingShingle Christopher A. Hempel
Christopher A. Hempel
Dominik Buchner
Leoni Mack
Marie V. Brasseur
Dan Tulpan
Dan Tulpan
Florian Leese
Florian Leese
Dirk Steinke
Dirk Steinke
Predicting environmental stressor levels with machine learning: a comparison between amplicon sequencing, metagenomics, and total RNA sequencing based on taxonomically assigned data
Frontiers in Microbiology
metabarcoding
metatranscriptomics
freshwater
stressor prediction
bioinformatics
ExStream
title Predicting environmental stressor levels with machine learning: a comparison between amplicon sequencing, metagenomics, and total RNA sequencing based on taxonomically assigned data
title_full Predicting environmental stressor levels with machine learning: a comparison between amplicon sequencing, metagenomics, and total RNA sequencing based on taxonomically assigned data
title_fullStr Predicting environmental stressor levels with machine learning: a comparison between amplicon sequencing, metagenomics, and total RNA sequencing based on taxonomically assigned data
title_full_unstemmed Predicting environmental stressor levels with machine learning: a comparison between amplicon sequencing, metagenomics, and total RNA sequencing based on taxonomically assigned data
title_short Predicting environmental stressor levels with machine learning: a comparison between amplicon sequencing, metagenomics, and total RNA sequencing based on taxonomically assigned data
title_sort predicting environmental stressor levels with machine learning a comparison between amplicon sequencing metagenomics and total rna sequencing based on taxonomically assigned data
topic metabarcoding
metatranscriptomics
freshwater
stressor prediction
bioinformatics
ExStream
url https://www.frontiersin.org/articles/10.3389/fmicb.2023.1217750/full
work_keys_str_mv AT christopherahempel predictingenvironmentalstressorlevelswithmachinelearningacomparisonbetweenampliconsequencingmetagenomicsandtotalrnasequencingbasedontaxonomicallyassigneddata
AT christopherahempel predictingenvironmentalstressorlevelswithmachinelearningacomparisonbetweenampliconsequencingmetagenomicsandtotalrnasequencingbasedontaxonomicallyassigneddata
AT dominikbuchner predictingenvironmentalstressorlevelswithmachinelearningacomparisonbetweenampliconsequencingmetagenomicsandtotalrnasequencingbasedontaxonomicallyassigneddata
AT leonimack predictingenvironmentalstressorlevelswithmachinelearningacomparisonbetweenampliconsequencingmetagenomicsandtotalrnasequencingbasedontaxonomicallyassigneddata
AT marievbrasseur predictingenvironmentalstressorlevelswithmachinelearningacomparisonbetweenampliconsequencingmetagenomicsandtotalrnasequencingbasedontaxonomicallyassigneddata
AT dantulpan predictingenvironmentalstressorlevelswithmachinelearningacomparisonbetweenampliconsequencingmetagenomicsandtotalrnasequencingbasedontaxonomicallyassigneddata
AT dantulpan predictingenvironmentalstressorlevelswithmachinelearningacomparisonbetweenampliconsequencingmetagenomicsandtotalrnasequencingbasedontaxonomicallyassigneddata
AT florianleese predictingenvironmentalstressorlevelswithmachinelearningacomparisonbetweenampliconsequencingmetagenomicsandtotalrnasequencingbasedontaxonomicallyassigneddata
AT florianleese predictingenvironmentalstressorlevelswithmachinelearningacomparisonbetweenampliconsequencingmetagenomicsandtotalrnasequencingbasedontaxonomicallyassigneddata
AT dirksteinke predictingenvironmentalstressorlevelswithmachinelearningacomparisonbetweenampliconsequencingmetagenomicsandtotalrnasequencingbasedontaxonomicallyassigneddata
AT dirksteinke predictingenvironmentalstressorlevelswithmachinelearningacomparisonbetweenampliconsequencingmetagenomicsandtotalrnasequencingbasedontaxonomicallyassigneddata