A benchmark dataset for machine learning in ecotoxicology

Abstract The use of machine learning for predicting ecotoxicological outcomes is promising, but underutilized. The curation of data with informative features requires both expertise in machine learning as well as a strong biological and ecotoxicological background, which we consider a barrier of ent...

Full description

Bibliographic Details
Main Authors: Christoph Schür, Lilian Gasser, Fernando Perez-Cruz, Kristin Schirmer, Marco Baity-Jesi
Format: Article
Language:English
Published: Nature Portfolio 2023-10-01
Series:Scientific Data
Online Access:https://doi.org/10.1038/s41597-023-02612-2
_version_ 1797578262786867200
author Christoph Schür
Lilian Gasser
Fernando Perez-Cruz
Kristin Schirmer
Marco Baity-Jesi
author_facet Christoph Schür
Lilian Gasser
Fernando Perez-Cruz
Kristin Schirmer
Marco Baity-Jesi
author_sort Christoph Schür
collection DOAJ
description Abstract The use of machine learning for predicting ecotoxicological outcomes is promising, but underutilized. The curation of data with informative features requires both expertise in machine learning as well as a strong biological and ecotoxicological background, which we consider a barrier of entry for this kind of research. Additionally, model performances can only be compared across studies when the same dataset, cleaning, and splittings were used. Therefore, we provide ADORE, an extensive and well-described dataset on acute aquatic toxicity in three relevant taxonomic groups (fish, crustaceans, and algae). The core dataset describes ecotoxicological experiments and is expanded with phylogenetic and species-specific data on the species as well as chemical properties and molecular representations. Apart from challenging other researchers to try and achieve the best model performances across the whole dataset, we propose specific relevant challenges on subsets of the data and include datasets and splittings corresponding to each of these challenge as well as in-depth characterization and discussion of train-test splitting approaches.
first_indexed 2024-03-10T22:19:30Z
format Article
id doaj.art-94727c1144ae4964a6c9a9436b41f898
institution Directory Open Access Journal
issn 2052-4463
language English
last_indexed 2024-03-10T22:19:30Z
publishDate 2023-10-01
publisher Nature Portfolio
record_format Article
series Scientific Data
spelling doaj.art-94727c1144ae4964a6c9a9436b41f8982023-11-19T12:20:16ZengNature PortfolioScientific Data2052-44632023-10-0110112010.1038/s41597-023-02612-2A benchmark dataset for machine learning in ecotoxicologyChristoph Schür0Lilian Gasser1Fernando Perez-Cruz2Kristin Schirmer3Marco Baity-Jesi4Eawag, Swiss Federal Institute of Aquatic Science and TechnologySwiss Data Science Center (SDSC)Swiss Data Science Center (SDSC)Eawag, Swiss Federal Institute of Aquatic Science and TechnologyEawag, Swiss Federal Institute of Aquatic Science and TechnologyAbstract The use of machine learning for predicting ecotoxicological outcomes is promising, but underutilized. The curation of data with informative features requires both expertise in machine learning as well as a strong biological and ecotoxicological background, which we consider a barrier of entry for this kind of research. Additionally, model performances can only be compared across studies when the same dataset, cleaning, and splittings were used. Therefore, we provide ADORE, an extensive and well-described dataset on acute aquatic toxicity in three relevant taxonomic groups (fish, crustaceans, and algae). The core dataset describes ecotoxicological experiments and is expanded with phylogenetic and species-specific data on the species as well as chemical properties and molecular representations. Apart from challenging other researchers to try and achieve the best model performances across the whole dataset, we propose specific relevant challenges on subsets of the data and include datasets and splittings corresponding to each of these challenge as well as in-depth characterization and discussion of train-test splitting approaches.https://doi.org/10.1038/s41597-023-02612-2
spellingShingle Christoph Schür
Lilian Gasser
Fernando Perez-Cruz
Kristin Schirmer
Marco Baity-Jesi
A benchmark dataset for machine learning in ecotoxicology
Scientific Data
title A benchmark dataset for machine learning in ecotoxicology
title_full A benchmark dataset for machine learning in ecotoxicology
title_fullStr A benchmark dataset for machine learning in ecotoxicology
title_full_unstemmed A benchmark dataset for machine learning in ecotoxicology
title_short A benchmark dataset for machine learning in ecotoxicology
title_sort benchmark dataset for machine learning in ecotoxicology
url https://doi.org/10.1038/s41597-023-02612-2
work_keys_str_mv AT christophschur abenchmarkdatasetformachinelearninginecotoxicology
AT liliangasser abenchmarkdatasetformachinelearninginecotoxicology
AT fernandoperezcruz abenchmarkdatasetformachinelearninginecotoxicology
AT kristinschirmer abenchmarkdatasetformachinelearninginecotoxicology
AT marcobaityjesi abenchmarkdatasetformachinelearninginecotoxicology
AT christophschur benchmarkdatasetformachinelearninginecotoxicology
AT liliangasser benchmarkdatasetformachinelearninginecotoxicology
AT fernandoperezcruz benchmarkdatasetformachinelearninginecotoxicology
AT kristinschirmer benchmarkdatasetformachinelearninginecotoxicology
AT marcobaityjesi benchmarkdatasetformachinelearninginecotoxicology