PLAS-20k: Extended Dataset of Protein-Ligand Affinities from MD Simulations for Machine Learning Applications
Abstract Computing binding affinities is of great importance in drug discovery pipeline and its prediction using advanced machine learning methods still remains a major challenge as the existing datasets and models do not consider the dynamic features of protein-ligand interactions. To this end, we...
Main Authors: | , , , , , , , , , , , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Nature Portfolio
2024-02-01
|
Series: | Scientific Data |
Online Access: | https://doi.org/10.1038/s41597-023-02872-y |
_version_ | 1797275918482276352 |
---|---|
author | Divya B. Korlepara Vasavi C. S. Rakesh Srivastava Pradeep Kumar Pal Saalim H. Raza Vishal Kumar Shivam Pandit Aathira G. Nair Sanjana Pandey Shubham Sharma Shruti Jeurkar Kavita Thakran Reena Jaglan Shivangi Verma Indhu Ramachandran Prathit Chatterjee Divya Nayar U. Deva Priyakumar |
author_facet | Divya B. Korlepara Vasavi C. S. Rakesh Srivastava Pradeep Kumar Pal Saalim H. Raza Vishal Kumar Shivam Pandit Aathira G. Nair Sanjana Pandey Shubham Sharma Shruti Jeurkar Kavita Thakran Reena Jaglan Shivangi Verma Indhu Ramachandran Prathit Chatterjee Divya Nayar U. Deva Priyakumar |
author_sort | Divya B. Korlepara |
collection | DOAJ |
description | Abstract Computing binding affinities is of great importance in drug discovery pipeline and its prediction using advanced machine learning methods still remains a major challenge as the existing datasets and models do not consider the dynamic features of protein-ligand interactions. To this end, we have developed PLAS-20k dataset, an extension of previously developed PLAS-5k, with 97,500 independent simulations on a total of 19,500 different protein-ligand complexes. Our results show good correlation with the available experimental values, performing better than docking scores. This holds true even for a subset of ligands that follows Lipinski’s rule, and for diverse clusters of complex structures, thereby highlighting the importance of PLAS-20k dataset in developing new ML models. Along with this, our dataset is also beneficial in classifying strong and weak binders compared to docking. Further, OnionNet model has been retrained on PLAS-20k dataset and is provided as a baseline for the prediction of binding affinities. We believe that large-scale MD-based datasets along with trajectories will form new synergy, paving the way for accelerating drug discovery. |
first_indexed | 2024-03-07T15:20:55Z |
format | Article |
id | doaj.art-5b18ec8f9dee48fface94244fcebe6b1 |
institution | Directory Open Access Journal |
issn | 2052-4463 |
language | English |
last_indexed | 2024-03-07T15:20:55Z |
publishDate | 2024-02-01 |
publisher | Nature Portfolio |
record_format | Article |
series | Scientific Data |
spelling | doaj.art-5b18ec8f9dee48fface94244fcebe6b12024-03-05T17:39:42ZengNature PortfolioScientific Data2052-44632024-02-011111910.1038/s41597-023-02872-yPLAS-20k: Extended Dataset of Protein-Ligand Affinities from MD Simulations for Machine Learning ApplicationsDivya B. Korlepara0Vasavi C. S.1Rakesh Srivastava2Pradeep Kumar Pal3Saalim H. Raza4Vishal Kumar5Shivam Pandit6Aathira G. Nair7Sanjana Pandey8Shubham Sharma9Shruti Jeurkar10Kavita Thakran11Reena Jaglan12Shivangi Verma13Indhu Ramachandran14Prathit Chatterjee15Divya Nayar16U. Deva Priyakumar17IHub-Data, International Institute of Information TechnologyIHub-Data, International Institute of Information TechnologyCentre for Computational Natural Sciences and Bioinformatics, International Institute of Information TechnologyCentre for Computational Natural Sciences and Bioinformatics, International Institute of Information TechnologyIHub-Data, International Institute of Information TechnologyCentre for Computational Natural Sciences and Bioinformatics, International Institute of Information TechnologyIHub-Data, International Institute of Information TechnologyIHub-Data, International Institute of Information TechnologyIHub-Data, International Institute of Information TechnologyIHub-Data, International Institute of Information TechnologyCentre for Computational Natural Sciences and Bioinformatics, International Institute of Information TechnologyIHub-Data, International Institute of Information TechnologyIHub-Data, International Institute of Information TechnologyIHub-Data, International Institute of Information TechnologyIHub-Data, International Institute of Information TechnologyIHub-Data, International Institute of Information TechnologyDepartment of Materials Science and Engineering, Indian Institute of Technology DelhiIHub-Data, International Institute of Information TechnologyAbstract Computing binding affinities is of great importance in drug discovery pipeline and its prediction using advanced machine learning methods still remains a major challenge as the existing datasets and models do not consider the dynamic features of protein-ligand interactions. To this end, we have developed PLAS-20k dataset, an extension of previously developed PLAS-5k, with 97,500 independent simulations on a total of 19,500 different protein-ligand complexes. Our results show good correlation with the available experimental values, performing better than docking scores. This holds true even for a subset of ligands that follows Lipinski’s rule, and for diverse clusters of complex structures, thereby highlighting the importance of PLAS-20k dataset in developing new ML models. Along with this, our dataset is also beneficial in classifying strong and weak binders compared to docking. Further, OnionNet model has been retrained on PLAS-20k dataset and is provided as a baseline for the prediction of binding affinities. We believe that large-scale MD-based datasets along with trajectories will form new synergy, paving the way for accelerating drug discovery.https://doi.org/10.1038/s41597-023-02872-y |
spellingShingle | Divya B. Korlepara Vasavi C. S. Rakesh Srivastava Pradeep Kumar Pal Saalim H. Raza Vishal Kumar Shivam Pandit Aathira G. Nair Sanjana Pandey Shubham Sharma Shruti Jeurkar Kavita Thakran Reena Jaglan Shivangi Verma Indhu Ramachandran Prathit Chatterjee Divya Nayar U. Deva Priyakumar PLAS-20k: Extended Dataset of Protein-Ligand Affinities from MD Simulations for Machine Learning Applications Scientific Data |
title | PLAS-20k: Extended Dataset of Protein-Ligand Affinities from MD Simulations for Machine Learning Applications |
title_full | PLAS-20k: Extended Dataset of Protein-Ligand Affinities from MD Simulations for Machine Learning Applications |
title_fullStr | PLAS-20k: Extended Dataset of Protein-Ligand Affinities from MD Simulations for Machine Learning Applications |
title_full_unstemmed | PLAS-20k: Extended Dataset of Protein-Ligand Affinities from MD Simulations for Machine Learning Applications |
title_short | PLAS-20k: Extended Dataset of Protein-Ligand Affinities from MD Simulations for Machine Learning Applications |
title_sort | plas 20k extended dataset of protein ligand affinities from md simulations for machine learning applications |
url | https://doi.org/10.1038/s41597-023-02872-y |
work_keys_str_mv | AT divyabkorlepara plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications AT vasavics plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications AT rakeshsrivastava plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications AT pradeepkumarpal plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications AT saalimhraza plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications AT vishalkumar plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications AT shivampandit plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications AT aathiragnair plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications AT sanjanapandey plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications AT shubhamsharma plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications AT shrutijeurkar plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications AT kavitathakran plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications AT reenajaglan plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications AT shivangiverma plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications AT indhuramachandran plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications AT prathitchatterjee plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications AT divyanayar plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications AT udevapriyakumar plas20kextendeddatasetofproteinligandaffinitiesfrommdsimulationsformachinelearningapplications |