Studying and mitigating the effects of data drifts on ML model performance at the example of chemical toxicity data

Abstract Machine learning models are widely applied to predict molecular properties or the biological activity of small molecules on a specific protein. Models can be integrated in a conformal prediction (CP) framework which adds a calibration step to estimate the confidence of the predictions. CP m...

Full description

Bibliographic Details
Main Authors: Andrea Morger, Marina Garcia de Lomana, Ulf Norinder, Fredrik Svensson, Johannes Kirchmair, Miriam Mathea, Andrea Volkamer
Format: Article
Language:English
Published: Nature Portfolio 2022-05-01
Series:Scientific Reports
Online Access:https://doi.org/10.1038/s41598-022-09309-3
_version_ 1817985249540308992
author Andrea Morger
Marina Garcia de Lomana
Ulf Norinder
Fredrik Svensson
Johannes Kirchmair
Miriam Mathea
Andrea Volkamer
author_facet Andrea Morger
Marina Garcia de Lomana
Ulf Norinder
Fredrik Svensson
Johannes Kirchmair
Miriam Mathea
Andrea Volkamer
author_sort Andrea Morger
collection DOAJ
description Abstract Machine learning models are widely applied to predict molecular properties or the biological activity of small molecules on a specific protein. Models can be integrated in a conformal prediction (CP) framework which adds a calibration step to estimate the confidence of the predictions. CP models present the advantage of ensuring a predefined error rate under the assumption that test and calibration set are exchangeable. In cases where the test data have drifted away from the descriptor space of the training data, or where assay setups have changed, this assumption might not be fulfilled and the models are not guaranteed to be valid. In this study, the performance of internally valid CP models when applied to either newer time-split data or to external data was evaluated. In detail, temporal data drifts were analysed based on twelve datasets from the ChEMBL database. In addition, discrepancies between models trained on publicly-available data and applied to proprietary data for the liver toxicity and MNT in vivo endpoints were investigated. In most cases, a drastic decrease in the validity of the models was observed when applied to the time-split or external (holdout) test sets. To overcome the decrease in model validity, a strategy for updating the calibration set with data more similar to the holdout set was investigated. Updating the calibration set generally improved the validity, restoring it completely to its expected value in many cases. The restored validity is the first requisite for applying the CP models with confidence. However, the increased validity comes at the cost of a decrease in model efficiency, as more predictions are identified as inconclusive. This study presents a strategy to recalibrate CP models to mitigate the effects of data drifts. Updating the calibration sets without having to retrain the model has proven to be a useful approach to restore the validity of most models.
first_indexed 2024-04-13T23:54:38Z
format Article
id doaj.art-9640412f156b4083ba647cd502b86051
institution Directory Open Access Journal
issn 2045-2322
language English
last_indexed 2024-04-13T23:54:38Z
publishDate 2022-05-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj.art-9640412f156b4083ba647cd502b860512022-12-22T02:23:55ZengNature PortfolioScientific Reports2045-23222022-05-0112111310.1038/s41598-022-09309-3Studying and mitigating the effects of data drifts on ML model performance at the example of chemical toxicity dataAndrea Morger0Marina Garcia de Lomana1Ulf Norinder2Fredrik Svensson3Johannes Kirchmair4Miriam Mathea5Andrea Volkamer6In Silico Toxicology and Structural Bioinformatics, Institute of Physiology, Charité Universitätsmedizin BerlinBASF SEDepartment of Pharmaceutical Biosciences, Uppsala UniversityAlzheimer’s Research UK UCL Drug Discovery InstituteDivision of Pharmaceutical Chemistry, Department of Pharmaceutical Sciences, University of ViennaBASF SEIn Silico Toxicology and Structural Bioinformatics, Institute of Physiology, Charité Universitätsmedizin BerlinAbstract Machine learning models are widely applied to predict molecular properties or the biological activity of small molecules on a specific protein. Models can be integrated in a conformal prediction (CP) framework which adds a calibration step to estimate the confidence of the predictions. CP models present the advantage of ensuring a predefined error rate under the assumption that test and calibration set are exchangeable. In cases where the test data have drifted away from the descriptor space of the training data, or where assay setups have changed, this assumption might not be fulfilled and the models are not guaranteed to be valid. In this study, the performance of internally valid CP models when applied to either newer time-split data or to external data was evaluated. In detail, temporal data drifts were analysed based on twelve datasets from the ChEMBL database. In addition, discrepancies between models trained on publicly-available data and applied to proprietary data for the liver toxicity and MNT in vivo endpoints were investigated. In most cases, a drastic decrease in the validity of the models was observed when applied to the time-split or external (holdout) test sets. To overcome the decrease in model validity, a strategy for updating the calibration set with data more similar to the holdout set was investigated. Updating the calibration set generally improved the validity, restoring it completely to its expected value in many cases. The restored validity is the first requisite for applying the CP models with confidence. However, the increased validity comes at the cost of a decrease in model efficiency, as more predictions are identified as inconclusive. This study presents a strategy to recalibrate CP models to mitigate the effects of data drifts. Updating the calibration sets without having to retrain the model has proven to be a useful approach to restore the validity of most models.https://doi.org/10.1038/s41598-022-09309-3
spellingShingle Andrea Morger
Marina Garcia de Lomana
Ulf Norinder
Fredrik Svensson
Johannes Kirchmair
Miriam Mathea
Andrea Volkamer
Studying and mitigating the effects of data drifts on ML model performance at the example of chemical toxicity data
Scientific Reports
title Studying and mitigating the effects of data drifts on ML model performance at the example of chemical toxicity data
title_full Studying and mitigating the effects of data drifts on ML model performance at the example of chemical toxicity data
title_fullStr Studying and mitigating the effects of data drifts on ML model performance at the example of chemical toxicity data
title_full_unstemmed Studying and mitigating the effects of data drifts on ML model performance at the example of chemical toxicity data
title_short Studying and mitigating the effects of data drifts on ML model performance at the example of chemical toxicity data
title_sort studying and mitigating the effects of data drifts on ml model performance at the example of chemical toxicity data
url https://doi.org/10.1038/s41598-022-09309-3
work_keys_str_mv AT andreamorger studyingandmitigatingtheeffectsofdatadriftsonmlmodelperformanceattheexampleofchemicaltoxicitydata
AT marinagarciadelomana studyingandmitigatingtheeffectsofdatadriftsonmlmodelperformanceattheexampleofchemicaltoxicitydata
AT ulfnorinder studyingandmitigatingtheeffectsofdatadriftsonmlmodelperformanceattheexampleofchemicaltoxicitydata
AT fredriksvensson studyingandmitigatingtheeffectsofdatadriftsonmlmodelperformanceattheexampleofchemicaltoxicitydata
AT johanneskirchmair studyingandmitigatingtheeffectsofdatadriftsonmlmodelperformanceattheexampleofchemicaltoxicitydata
AT miriammathea studyingandmitigatingtheeffectsofdatadriftsonmlmodelperformanceattheexampleofchemicaltoxicitydata
AT andreavolkamer studyingandmitigatingtheeffectsofdatadriftsonmlmodelperformanceattheexampleofchemicaltoxicitydata