A benchmark dataset for machine learning in ecotoxicology
Abstract The use of machine learning for predicting ecotoxicological outcomes is promising, but underutilized. The curation of data with informative features requires both expertise in machine learning as well as a strong biological and ecotoxicological background, which we consider a barrier of ent...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Nature Portfolio
2023-10-01
|
Series: | Scientific Data |
Online Access: | https://doi.org/10.1038/s41597-023-02612-2 |
_version_ | 1797578262786867200 |
---|---|
author | Christoph Schür Lilian Gasser Fernando Perez-Cruz Kristin Schirmer Marco Baity-Jesi |
author_facet | Christoph Schür Lilian Gasser Fernando Perez-Cruz Kristin Schirmer Marco Baity-Jesi |
author_sort | Christoph Schür |
collection | DOAJ |
description | Abstract The use of machine learning for predicting ecotoxicological outcomes is promising, but underutilized. The curation of data with informative features requires both expertise in machine learning as well as a strong biological and ecotoxicological background, which we consider a barrier of entry for this kind of research. Additionally, model performances can only be compared across studies when the same dataset, cleaning, and splittings were used. Therefore, we provide ADORE, an extensive and well-described dataset on acute aquatic toxicity in three relevant taxonomic groups (fish, crustaceans, and algae). The core dataset describes ecotoxicological experiments and is expanded with phylogenetic and species-specific data on the species as well as chemical properties and molecular representations. Apart from challenging other researchers to try and achieve the best model performances across the whole dataset, we propose specific relevant challenges on subsets of the data and include datasets and splittings corresponding to each of these challenge as well as in-depth characterization and discussion of train-test splitting approaches. |
first_indexed | 2024-03-10T22:19:30Z |
format | Article |
id | doaj.art-94727c1144ae4964a6c9a9436b41f898 |
institution | Directory Open Access Journal |
issn | 2052-4463 |
language | English |
last_indexed | 2024-03-10T22:19:30Z |
publishDate | 2023-10-01 |
publisher | Nature Portfolio |
record_format | Article |
series | Scientific Data |
spelling | doaj.art-94727c1144ae4964a6c9a9436b41f8982023-11-19T12:20:16ZengNature PortfolioScientific Data2052-44632023-10-0110112010.1038/s41597-023-02612-2A benchmark dataset for machine learning in ecotoxicologyChristoph Schür0Lilian Gasser1Fernando Perez-Cruz2Kristin Schirmer3Marco Baity-Jesi4Eawag, Swiss Federal Institute of Aquatic Science and TechnologySwiss Data Science Center (SDSC)Swiss Data Science Center (SDSC)Eawag, Swiss Federal Institute of Aquatic Science and TechnologyEawag, Swiss Federal Institute of Aquatic Science and TechnologyAbstract The use of machine learning for predicting ecotoxicological outcomes is promising, but underutilized. The curation of data with informative features requires both expertise in machine learning as well as a strong biological and ecotoxicological background, which we consider a barrier of entry for this kind of research. Additionally, model performances can only be compared across studies when the same dataset, cleaning, and splittings were used. Therefore, we provide ADORE, an extensive and well-described dataset on acute aquatic toxicity in three relevant taxonomic groups (fish, crustaceans, and algae). The core dataset describes ecotoxicological experiments and is expanded with phylogenetic and species-specific data on the species as well as chemical properties and molecular representations. Apart from challenging other researchers to try and achieve the best model performances across the whole dataset, we propose specific relevant challenges on subsets of the data and include datasets and splittings corresponding to each of these challenge as well as in-depth characterization and discussion of train-test splitting approaches.https://doi.org/10.1038/s41597-023-02612-2 |
spellingShingle | Christoph Schür Lilian Gasser Fernando Perez-Cruz Kristin Schirmer Marco Baity-Jesi A benchmark dataset for machine learning in ecotoxicology Scientific Data |
title | A benchmark dataset for machine learning in ecotoxicology |
title_full | A benchmark dataset for machine learning in ecotoxicology |
title_fullStr | A benchmark dataset for machine learning in ecotoxicology |
title_full_unstemmed | A benchmark dataset for machine learning in ecotoxicology |
title_short | A benchmark dataset for machine learning in ecotoxicology |
title_sort | benchmark dataset for machine learning in ecotoxicology |
url | https://doi.org/10.1038/s41597-023-02612-2 |
work_keys_str_mv | AT christophschur abenchmarkdatasetformachinelearninginecotoxicology AT liliangasser abenchmarkdatasetformachinelearninginecotoxicology AT fernandoperezcruz abenchmarkdatasetformachinelearninginecotoxicology AT kristinschirmer abenchmarkdatasetformachinelearninginecotoxicology AT marcobaityjesi abenchmarkdatasetformachinelearninginecotoxicology AT christophschur benchmarkdatasetformachinelearninginecotoxicology AT liliangasser benchmarkdatasetformachinelearninginecotoxicology AT fernandoperezcruz benchmarkdatasetformachinelearninginecotoxicology AT kristinschirmer benchmarkdatasetformachinelearninginecotoxicology AT marcobaityjesi benchmarkdatasetformachinelearninginecotoxicology |