PLAS-20k: Extended Dataset of Protein-Ligand Affinities from MD Simulations for Machine Learning Applications

Abstract Computing binding affinities is of great importance in drug discovery pipeline and its prediction using advanced machine learning methods still remains a major challenge as the existing datasets and models do not consider the dynamic features of protein-ligand interactions. To this end, we...

Full description

Bibliographic Details
Main Authors: Divya B. Korlepara, Vasavi C. S., Rakesh Srivastava, Pradeep Kumar Pal, Saalim H. Raza, Vishal Kumar, Shivam Pandit, Aathira G. Nair, Sanjana Pandey, Shubham Sharma, Shruti Jeurkar, Kavita Thakran, Reena Jaglan, Shivangi Verma, Indhu Ramachandran, Prathit Chatterjee, Divya Nayar, U. Deva Priyakumar
Format: Article
Language:English
Published: Nature Portfolio 2024-02-01
Series:Scientific Data
Online Access:https://doi.org/10.1038/s41597-023-02872-y
_version_ 1797275918482276352
author Divya B. Korlepara
Vasavi C. S.
Rakesh Srivastava
Pradeep Kumar Pal
Saalim H. Raza
Vishal Kumar
Shivam Pandit
Aathira G. Nair
Sanjana Pandey
Shubham Sharma
Shruti Jeurkar
Kavita Thakran
Reena Jaglan
Shivangi Verma
Indhu Ramachandran
Prathit Chatterjee
Divya Nayar
U. Deva Priyakumar
author_facet Divya B. Korlepara
Vasavi C. S.
Rakesh Srivastava
Pradeep Kumar Pal
Saalim H. Raza
Vishal Kumar
Shivam Pandit
Aathira G. Nair
Sanjana Pandey
Shubham Sharma
Shruti Jeurkar
Kavita Thakran
Reena Jaglan
Shivangi Verma
Indhu Ramachandran
Prathit Chatterjee
Divya Nayar
U. Deva Priyakumar
author_sort Divya B. Korlepara
collection DOAJ
description Abstract Computing binding affinities is of great importance in drug discovery pipeline and its prediction using advanced machine learning methods still remains a major challenge as the existing datasets and models do not consider the dynamic features of protein-ligand interactions. To this end, we have developed PLAS-20k dataset, an extension of previously developed PLAS-5k, with 97,500 independent simulations on a total of 19,500 different protein-ligand complexes. Our results show good correlation with the available experimental values, performing better than docking scores. This holds true even for a subset of ligands that follows Lipinski’s rule, and for diverse clusters of complex structures, thereby highlighting the importance of PLAS-20k dataset in developing new ML models. Along with this, our dataset is also beneficial in classifying strong and weak binders compared to docking. Further, OnionNet model has been retrained on PLAS-20k dataset and is provided as a baseline for the prediction of binding affinities. We believe that large-scale MD-based datasets along with trajectories will form new synergy, paving the way for accelerating drug discovery.
first_indexed 2024-03-07T15:20:55Z
format Article
id doaj.art-5b18ec8f9dee48fface94244fcebe6b1
institution Directory Open Access Journal
issn 2052-4463
language English
last_indexed 2024-03-07T15:20:55Z
publishDate 2024-02-01
publisher Nature Portfolio
record_format Article
series Scientific Data
spelling doaj.art-5b18ec8f9dee48fface94244fcebe6b12024-03-05T17:39:42ZengNature PortfolioScientific Data2052-44632024-02-011111910.1038/s41597-023-02872-yPLAS-20k: Extended Dataset of Protein-Ligand Affinities from MD Simulations for Machine Learning ApplicationsDivya B. Korlepara0Vasavi C. S.1Rakesh Srivastava2Pradeep Kumar Pal3Saalim H. Raza4Vishal Kumar5Shivam Pandit6Aathira G. Nair7Sanjana Pandey8Shubham Sharma9Shruti Jeurkar10Kavita Thakran11Reena Jaglan12Shivangi Verma13Indhu Ramachandran14Prathit Chatterjee15Divya Nayar16U. Deva Priyakumar17IHub-Data, International Institute of Information TechnologyIHub-Data, International Institute of Information TechnologyCentre for Computational Natural Sciences and Bioinformatics, International Institute of Information TechnologyCentre for Computational Natural Sciences and Bioinformatics, International Institute of Information TechnologyIHub-Data, International Institute of Information TechnologyCentre for Computational Natural Sciences and Bioinformatics, International Institute of Information TechnologyIHub-Data, International Institute of Information TechnologyIHub-Data, International Institute of Information TechnologyIHub-Data, International Institute of Information TechnologyIHub-Data, International Institute of Information TechnologyCentre for Computational Natural Sciences and Bioinformatics, International Institute of Information TechnologyIHub-Data, International Institute of Information TechnologyIHub-Data, International Institute of Information TechnologyIHub-Data, International Institute of Information TechnologyIHub-Data, International Institute of Information TechnologyIHub-Data, International Institute of Information TechnologyDepartment of Materials Science and Engineering, Indian Institute of Technology DelhiIHub-Data, International Institute of Information TechnologyAbstract Computing binding affinities is of great importance in drug discovery pipeline and its prediction using advanced machine learning methods still remains a major challenge as the existing datasets and models do not consider the dynamic features of protein-ligand interactions. To this end, we have developed PLAS-20k dataset, an extension of previously developed PLAS-5k, with 97,500 independent simulations on a total of 19,500 different protein-ligand complexes. Our results show good correlation with the available experimental values, performing better than docking scores. This holds true even for a subset of ligands that follows Lipinski’s rule, and for diverse clusters of complex structures, thereby highlighting the importance of PLAS-20k dataset in developing new ML models. Along with this, our dataset is also beneficial in classifying strong and weak binders compared to docking. Further, OnionNet model has been retrained on PLAS-20k dataset and is provided as a baseline for the prediction of binding affinities. We believe that large-scale MD-based datasets along with trajectories will form new synergy, paving the way for accelerating drug discovery.https://doi.org/10.1038/s41597-023-02872-y
spellingShingle Divya B. Korlepara
Vasavi C. S.
Rakesh Srivastava
Pradeep Kumar Pal
Saalim H. Raza
Vishal Kumar
Shivam Pandit
Aathira G. Nair
Sanjana Pandey
Shubham Sharma
Shruti Jeurkar
Kavita Thakran
Reena Jaglan
Shivangi Verma
Indhu Ramachandran
Prathit Chatterjee
Divya Nayar
U. Deva Priyakumar
PLAS-20k: Extended Dataset of Protein-Ligand Affinities from MD Simulations for Machine Learning Applications
Scientific Data
title PLAS-20k: Extended Dataset of Protein-Ligand Affinities from MD Simulations for Machine Learning Applications
title_full PLAS-20k: Extended Dataset of Protein-Ligand Affinities from MD Simulations for Machine Learning Applications
title_fullStr PLAS-20k: Extended Dataset of Protein-Ligand Affinities from MD Simulations for Machine Learning Applications
title_full_unstemmed PLAS-20k: Extended Dataset of Protein-Ligand Affinities from MD Simulations for Machine Learning Applications
title_short PLAS-20k: Extended Dataset of Protein-Ligand Affinities from MD Simulations for Machine Learning Applications
title_sort plas 20k extended dataset of protein ligand affinities from md simulations for machine learning applications
url https://doi.org/10.1038/s41597-023-02872-y
work_keys_str_mv AT divyabkorlepara plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications
AT vasavics plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications
AT rakeshsrivastava plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications
AT pradeepkumarpal plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications
AT saalimhraza plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications
AT vishalkumar plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications
AT shivampandit plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications
AT aathiragnair plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications
AT sanjanapandey plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications
AT shubhamsharma plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications
AT shrutijeurkar plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications
AT kavitathakran plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications
AT reenajaglan plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications
AT shivangiverma plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications
AT indhuramachandran plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications
AT prathitchatterjee plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications
AT divyanayar plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications
AT udevapriyakumar plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications