Exploring the ability of machine learning-based virtual screening models to identify the functional groups responsible for binding

Abstract Many recently proposed structure-based virtual screening models appear to be able to accurately distinguish high affinity binders from non-binders. However, several recent studies have shown that they often do so by exploiting ligand-specific biases in the dataset, rather than identifying f...

Full description

Bibliographic Details
Main Authors: Thomas E. Hadfield, Jack Scantlebury, Charlotte M. Deane
Format: Article
Language:English
Published: BMC 2023-09-01
Series:Journal of Cheminformatics
Subjects:
Online Access:https://doi.org/10.1186/s13321-023-00755-3
_version_ 1797556834587901952
author Thomas E. Hadfield
Jack Scantlebury
Charlotte M. Deane
author_facet Thomas E. Hadfield
Jack Scantlebury
Charlotte M. Deane
author_sort Thomas E. Hadfield
collection DOAJ
description Abstract Many recently proposed structure-based virtual screening models appear to be able to accurately distinguish high affinity binders from non-binders. However, several recent studies have shown that they often do so by exploiting ligand-specific biases in the dataset, rather than identifying favourable intermolecular interactions in the input protein-ligand complex. In this work we propose a novel approach for assessing the extent to which machine learning-based virtual screening models are able to identify the functional groups responsible for binding. To sidestep the difficulty in establishing the ground truth importance of each atom of a large scale set of protein-ligand complexes, we propose a protocol for generating synthetic data. Each ligand in the dataset is surrounded by a randomly sampled point cloud of pharmacophores, and the label assigned to the synthetic protein-ligand complex is determined by a 3-dimensional deterministic binding rule. This allows us to precisely quantify the ground truth importance of each atom and compare it to the model generated attributions. Using our generated datasets, we demonstrate that a recently proposed deep learning-based virtual screening model, PointVS, identified the most important functional groups with 39% more efficiency than a fingerprint-based random forest, suggesting that it would generalise more effectively to new examples. In addition, we found that ligand-specific biases, such as those present in widely used virtual screening datasets, substantially impaired the ability of all ML models to identify the most important functional groups. We have made our synthetic data generation framework available to facilitate the benchmarking of new virtual screening models. Code is available at https://github.com/tomhadfield95/synthVS .
first_indexed 2024-03-10T17:08:36Z
format Article
id doaj.art-35e3cbb9c86245468e7c48f5051c1d0f
institution Directory Open Access Journal
issn 1758-2946
language English
last_indexed 2024-03-10T17:08:36Z
publishDate 2023-09-01
publisher BMC
record_format Article
series Journal of Cheminformatics
spelling doaj.art-35e3cbb9c86245468e7c48f5051c1d0f2023-11-20T10:43:41ZengBMCJournal of Cheminformatics1758-29462023-09-0115111510.1186/s13321-023-00755-3Exploring the ability of machine learning-based virtual screening models to identify the functional groups responsible for bindingThomas E. Hadfield0Jack Scantlebury1Charlotte M. Deane2Oxford Protein Informatics Group, Department of Statistics, University of OxfordOxford Protein Informatics Group, Department of Statistics, University of OxfordOxford Protein Informatics Group, Department of Statistics, University of OxfordAbstract Many recently proposed structure-based virtual screening models appear to be able to accurately distinguish high affinity binders from non-binders. However, several recent studies have shown that they often do so by exploiting ligand-specific biases in the dataset, rather than identifying favourable intermolecular interactions in the input protein-ligand complex. In this work we propose a novel approach for assessing the extent to which machine learning-based virtual screening models are able to identify the functional groups responsible for binding. To sidestep the difficulty in establishing the ground truth importance of each atom of a large scale set of protein-ligand complexes, we propose a protocol for generating synthetic data. Each ligand in the dataset is surrounded by a randomly sampled point cloud of pharmacophores, and the label assigned to the synthetic protein-ligand complex is determined by a 3-dimensional deterministic binding rule. This allows us to precisely quantify the ground truth importance of each atom and compare it to the model generated attributions. Using our generated datasets, we demonstrate that a recently proposed deep learning-based virtual screening model, PointVS, identified the most important functional groups with 39% more efficiency than a fingerprint-based random forest, suggesting that it would generalise more effectively to new examples. In addition, we found that ligand-specific biases, such as those present in widely used virtual screening datasets, substantially impaired the ability of all ML models to identify the most important functional groups. We have made our synthetic data generation framework available to facilitate the benchmarking of new virtual screening models. Code is available at https://github.com/tomhadfield95/synthVS .https://doi.org/10.1186/s13321-023-00755-3Structure-based virtual screeningMachine learningInterpretability
spellingShingle Thomas E. Hadfield
Jack Scantlebury
Charlotte M. Deane
Exploring the ability of machine learning-based virtual screening models to identify the functional groups responsible for binding
Journal of Cheminformatics
Structure-based virtual screening
Machine learning
Interpretability
title Exploring the ability of machine learning-based virtual screening models to identify the functional groups responsible for binding
title_full Exploring the ability of machine learning-based virtual screening models to identify the functional groups responsible for binding
title_fullStr Exploring the ability of machine learning-based virtual screening models to identify the functional groups responsible for binding
title_full_unstemmed Exploring the ability of machine learning-based virtual screening models to identify the functional groups responsible for binding
title_short Exploring the ability of machine learning-based virtual screening models to identify the functional groups responsible for binding
title_sort exploring the ability of machine learning based virtual screening models to identify the functional groups responsible for binding
topic Structure-based virtual screening
Machine learning
Interpretability
url https://doi.org/10.1186/s13321-023-00755-3
work_keys_str_mv AT thomasehadfield exploringtheabilityofmachinelearningbasedvirtualscreeningmodelstoidentifythefunctionalgroupsresponsibleforbinding
AT jackscantlebury exploringtheabilityofmachinelearningbasedvirtualscreeningmodelstoidentifythefunctionalgroupsresponsibleforbinding
AT charlottemdeane exploringtheabilityofmachinelearningbasedvirtualscreeningmodelstoidentifythefunctionalgroupsresponsibleforbinding