Prediction of aqueous intrinsic solubility of druglike molecules using Random Forest regression trained with Wiki-pS0 database

The accurate prediction of solubility of drugs is still problematic. It was thought for a long time that shortfalls had been due the lack of high-quality solubility data from the chemical space of drugs. This study considers the quality of solubility data, particularly of ionizable drugs. A database...

Full description

Bibliographic Details
Main Author:	Alex Avdeef
Format:	Article
Language:	English
Published:	International Association of Physical Chemists (IAPC) 2020-03-01
Series:	ADMET and DMPK
Subjects:	aqueous intrinsic solubility druglike interlaboratory experimental error pdisol-x general solubility equation (gse) abraham solvation equation (absolv) multiple linear regression (mlr) random forest regression (rfr) quantitative structure-property
Online Access:	http://pub.iapchem.org/ojs/index.php/admet/article/view/766

_version_	1828806916446355456
author	Alex Avdeef
author_facet	Alex Avdeef
author_sort	Alex Avdeef
collection	DOAJ
description	The accurate prediction of solubility of drugs is still problematic. It was thought for a long time that shortfalls had been due the lack of high-quality solubility data from the chemical space of drugs. This study considers the quality of solubility data, particularly of ionizable drugs. A database is described, comprising 6355 entries of intrinsic solubility for 3014 different molecules, drawing on 1325 citations. In an earlier publication, many factors affecting the quality of the measurement had been discussed, and suggestions were offered to improve ways of extracting more reliable information from legacy data. Many of the suggestions have been implemented in this study. By correcting solubility for ionization (i.e., deriving intrinsic solubility, S0) and by normalizing temperature (by transforming measurements performed in the range 10-50 °C to 25 °C), it can now be estimated that the average interlaboratory reproducibility is 0.17 log unit. Empirical methods to predict solubility at best have hovered around the root mean square error (RMSE) of 0.6 log unit. Three prediction methods are compared here: (a) Yalkowsky’s general solubility equation (GSE), (b) Abraham solvation equation (ABSOLV), and (c) Random Forest regression (RFR) statistical machine learning. The latter two methods were trained using the new database. The RFR method outperforms the other two models, as anticipated. However, the ability to predict the solubility of drugs to the level of the quality of data is still out of reach. The data quality is not the limiting factor in prediction. The statistical machine learning methodologies are probably up to the task. Possibly what’s missing are solubility data from a few sparsely-covered chemical space of drugs (particularly of research compounds). Also, new descriptors which can better differentiate the factors affecting solubility between molecules could be critical for narrowing the gap between the accuracy of the prediction models and that of the experimental data.
first_indexed	2024-12-12T08:16:39Z
format	Article
id	doaj.art-dbdd23cc75b64f7a96796d966d648668
institution	Directory Open Access Journal
issn	1848-7718
language	English
last_indexed	2024-12-12T08:16:39Z
publishDate	2020-03-01
publisher	International Association of Physical Chemists (IAPC)
record_format	Article
series	ADMET and DMPK
spelling	doaj.art-dbdd23cc75b64f7a96796d966d6486682022-12-22T00:31:33ZengInternational Association of Physical Chemists (IAPC)ADMET and DMPK1848-77182020-03-0181297710.5599/admet.766418Prediction of aqueous intrinsic solubility of druglike molecules using Random Forest regression trained with Wiki-pS0 databaseAlex Avdeef0in-ADME ResearchThe accurate prediction of solubility of drugs is still problematic. It was thought for a long time that shortfalls had been due the lack of high-quality solubility data from the chemical space of drugs. This study considers the quality of solubility data, particularly of ionizable drugs. A database is described, comprising 6355 entries of intrinsic solubility for 3014 different molecules, drawing on 1325 citations. In an earlier publication, many factors affecting the quality of the measurement had been discussed, and suggestions were offered to improve ways of extracting more reliable information from legacy data. Many of the suggestions have been implemented in this study. By correcting solubility for ionization (i.e., deriving intrinsic solubility, S0) and by normalizing temperature (by transforming measurements performed in the range 10-50 °C to 25 °C), it can now be estimated that the average interlaboratory reproducibility is 0.17 log unit. Empirical methods to predict solubility at best have hovered around the root mean square error (RMSE) of 0.6 log unit. Three prediction methods are compared here: (a) Yalkowsky’s general solubility equation (GSE), (b) Abraham solvation equation (ABSOLV), and (c) Random Forest regression (RFR) statistical machine learning. The latter two methods were trained using the new database. The RFR method outperforms the other two models, as anticipated. However, the ability to predict the solubility of drugs to the level of the quality of data is still out of reach. The data quality is not the limiting factor in prediction. The statistical machine learning methodologies are probably up to the task. Possibly what’s missing are solubility data from a few sparsely-covered chemical space of drugs (particularly of research compounds). Also, new descriptors which can better differentiate the factors affecting solubility between molecules could be critical for narrowing the gap between the accuracy of the prediction models and that of the experimental data.http://pub.iapchem.org/ojs/index.php/admet/article/view/766aqueous intrinsic solubilitydruglikeinterlaboratory experimental errorpdisol-xgeneral solubility equation (gse)abraham solvation equation (absolv)multiple linear regression (mlr)random forest regression (rfr)quantitative structure-property
spellingShingle	Alex Avdeef Prediction of aqueous intrinsic solubility of druglike molecules using Random Forest regression trained with Wiki-pS0 database ADMET and DMPK aqueous intrinsic solubility druglike interlaboratory experimental error pdisol-x general solubility equation (gse) abraham solvation equation (absolv) multiple linear regression (mlr) random forest regression (rfr) quantitative structure-property
title	Prediction of aqueous intrinsic solubility of druglike molecules using Random Forest regression trained with Wiki-pS0 database
title_full	Prediction of aqueous intrinsic solubility of druglike molecules using Random Forest regression trained with Wiki-pS0 database
title_fullStr	Prediction of aqueous intrinsic solubility of druglike molecules using Random Forest regression trained with Wiki-pS0 database
title_full_unstemmed	Prediction of aqueous intrinsic solubility of druglike molecules using Random Forest regression trained with Wiki-pS0 database
title_short	Prediction of aqueous intrinsic solubility of druglike molecules using Random Forest regression trained with Wiki-pS0 database
title_sort	prediction of aqueous intrinsic solubility of druglike molecules using random forest regression trained with wiki ps0 database
topic	aqueous intrinsic solubility druglike interlaboratory experimental error pdisol-x general solubility equation (gse) abraham solvation equation (absolv) multiple linear regression (mlr) random forest regression (rfr) quantitative structure-property
url	http://pub.iapchem.org/ojs/index.php/admet/article/view/766
work_keys_str_mv	AT alexavdeef predictionofaqueousintrinsicsolubilityofdruglikemoleculesusingrandomforestregressiontrainedwithwikips0database

Prediction of aqueous intrinsic solubility of druglike molecules using Random Forest regression trained with Wiki-pS0 database

Similar Items