Efficient iterative virtual screening with Apache Spark and conformal prediction

Abstract Background Docking and scoring large libraries of ligands against target proteins forms the basis of structure-based virtual screening. The problem is trivially parallelizable, and calculations are generally carried out on computer clusters or on large workstations in a brute force manner,...

Full description

Bibliographic Details
Main Authors: Laeeq Ahmed, Valentin Georgiev, Marco Capuccini, Salman Toor, Wesley Schaal, Erwin Laure, Ola Spjuth
Format: Article
Language:English
Published: BMC 2018-03-01
Series:Journal of Cheminformatics
Subjects:
Online Access:http://link.springer.com/article/10.1186/s13321-018-0265-z
_version_ 1830518860505153536
author Laeeq Ahmed
Valentin Georgiev
Marco Capuccini
Salman Toor
Wesley Schaal
Erwin Laure
Ola Spjuth
author_facet Laeeq Ahmed
Valentin Georgiev
Marco Capuccini
Salman Toor
Wesley Schaal
Erwin Laure
Ola Spjuth
author_sort Laeeq Ahmed
collection DOAJ
description Abstract Background Docking and scoring large libraries of ligands against target proteins forms the basis of structure-based virtual screening. The problem is trivially parallelizable, and calculations are generally carried out on computer clusters or on large workstations in a brute force manner, by docking and scoring all available ligands. Contribution In this study we propose a strategy that is based on iteratively docking a set of ligands to form a training set, training a ligand-based model on this set, and predicting the remainder of the ligands to exclude those predicted as ‘low-scoring’ ligands. Then, another set of ligands are docked, the model is retrained and the process is repeated until a certain model efficiency level is reached. Thereafter, the remaining ligands are docked or excluded based on this model. We use SVM and conformal prediction to deliver valid prediction intervals for ranking the predicted ligands, and Apache Spark to parallelize both the docking and the modeling. Results We show on 4 different targets that conformal prediction based virtual screening (CPVS) is able to reduce the number of docked molecules by 62.61% while retaining an accuracy for the top 30 hits of 94% on average and a speedup of 3.7. The implementation is available as open source via GitHub (https://github.com/laeeq80/spark-cpvs) and can be run on high-performance computers as well as on cloud resources.
first_indexed 2024-12-22T04:27:00Z
format Article
id doaj.art-647b2172fc6b440c92b120f8f2e9bb73
institution Directory Open Access Journal
issn 1758-2946
language English
last_indexed 2024-12-22T04:27:00Z
publishDate 2018-03-01
publisher BMC
record_format Article
series Journal of Cheminformatics
spelling doaj.art-647b2172fc6b440c92b120f8f2e9bb732022-12-21T18:39:08ZengBMCJournal of Cheminformatics1758-29462018-03-011011810.1186/s13321-018-0265-zEfficient iterative virtual screening with Apache Spark and conformal predictionLaeeq Ahmed0Valentin Georgiev1Marco Capuccini2Salman Toor3Wesley Schaal4Erwin Laure5Ola Spjuth6Department of Computational Science and Technology, Royal Institute of Technology (KTH)Department of Pharmaceutical Biosciences, Uppsala UniversityDepartment of Pharmaceutical Biosciences, Uppsala UniversityDepartment of Information Technology, Uppsala UniversityDepartment of Pharmaceutical Biosciences, Uppsala UniversityDepartment of Computational Science and Technology, Royal Institute of Technology (KTH)Department of Pharmaceutical Biosciences, Uppsala UniversityAbstract Background Docking and scoring large libraries of ligands against target proteins forms the basis of structure-based virtual screening. The problem is trivially parallelizable, and calculations are generally carried out on computer clusters or on large workstations in a brute force manner, by docking and scoring all available ligands. Contribution In this study we propose a strategy that is based on iteratively docking a set of ligands to form a training set, training a ligand-based model on this set, and predicting the remainder of the ligands to exclude those predicted as ‘low-scoring’ ligands. Then, another set of ligands are docked, the model is retrained and the process is repeated until a certain model efficiency level is reached. Thereafter, the remaining ligands are docked or excluded based on this model. We use SVM and conformal prediction to deliver valid prediction intervals for ranking the predicted ligands, and Apache Spark to parallelize both the docking and the modeling. Results We show on 4 different targets that conformal prediction based virtual screening (CPVS) is able to reduce the number of docked molecules by 62.61% while retaining an accuracy for the top 30 hits of 94% on average and a speedup of 3.7. The implementation is available as open source via GitHub (https://github.com/laeeq80/spark-cpvs) and can be run on high-performance computers as well as on cloud resources.http://link.springer.com/article/10.1186/s13321-018-0265-zVirtual screeningDockingConformal predictionCloud computingApache Spark
spellingShingle Laeeq Ahmed
Valentin Georgiev
Marco Capuccini
Salman Toor
Wesley Schaal
Erwin Laure
Ola Spjuth
Efficient iterative virtual screening with Apache Spark and conformal prediction
Journal of Cheminformatics
Virtual screening
Docking
Conformal prediction
Cloud computing
Apache Spark
title Efficient iterative virtual screening with Apache Spark and conformal prediction
title_full Efficient iterative virtual screening with Apache Spark and conformal prediction
title_fullStr Efficient iterative virtual screening with Apache Spark and conformal prediction
title_full_unstemmed Efficient iterative virtual screening with Apache Spark and conformal prediction
title_short Efficient iterative virtual screening with Apache Spark and conformal prediction
title_sort efficient iterative virtual screening with apache spark and conformal prediction
topic Virtual screening
Docking
Conformal prediction
Cloud computing
Apache Spark
url http://link.springer.com/article/10.1186/s13321-018-0265-z
work_keys_str_mv AT laeeqahmed efficientiterativevirtualscreeningwithapachesparkandconformalprediction
AT valentingeorgiev efficientiterativevirtualscreeningwithapachesparkandconformalprediction
AT marcocapuccini efficientiterativevirtualscreeningwithapachesparkandconformalprediction
AT salmantoor efficientiterativevirtualscreeningwithapachesparkandconformalprediction
AT wesleyschaal efficientiterativevirtualscreeningwithapachesparkandconformalprediction
AT erwinlaure efficientiterativevirtualscreeningwithapachesparkandconformalprediction
AT olaspjuth efficientiterativevirtualscreeningwithapachesparkandconformalprediction