VLA-SMILES: Variable-Length-Array SMILES Descriptors in Neural Network-Based QSAR Modeling

Machine learning represents a milestone in data-driven research, including material informatics, robotics, and computer-aided drug discovery. With the continuously growing virtual and synthetically available chemical space, efficient and robust quantitative structure–activity relationship (QSAR) met...

Full description

Bibliographic Details
Main Authors:	Antonina L. Nazarova, Aiichiro Nakano
Format:	Article
Language:	English
Published:	MDPI AG 2022-08-01
Series:	Machine Learning and Knowledge Extraction
Subjects:	machine learning deep learning neural networks SMILES descriptors QSAR
Online Access:	https://www.mdpi.com/2504-4990/4/3/34

_version_	1797485464676990976
author	Antonina L. Nazarova Aiichiro Nakano
author_facet	Antonina L. Nazarova Aiichiro Nakano
author_sort	Antonina L. Nazarova
collection	DOAJ
description	Machine learning represents a milestone in data-driven research, including material informatics, robotics, and computer-aided drug discovery. With the continuously growing virtual and synthetically available chemical space, efficient and robust quantitative structure–activity relationship (QSAR) methods are required to uncover molecules with desired properties. Herein, we propose variable-length-array SMILES-based (VLA-SMILES) structural descriptors that expand conventional SMILES descriptors widely used in machine learning. This structural representation extends the family of numerically coded SMILES, particularly binary SMILES, to expedite the discovery of new deep learning QSAR models with high predictive ability. VLA-SMILES descriptors were shown to speed up the training of QSAR models based on multilayer perceptron (MLP) with optimized backpropagation (ATransformedBP), resilient propagation (iRPROP<sup>‒</sup>), and Adam optimization learning algorithms featuring rational train–test splitting, while improving the predictive ability toward the more compute-intensive binary SMILES representation format. All the tested MLPs under the same length-array-based SMILES descriptors showed similar predictive ability and convergence rate of training in combination with the considered learning procedures. Validation with the Kennard–Stone train–test splitting based on the structural descriptor similarity metrics was found more effective than the partitioning with the ranking by activity based on biological activity values metrics for the entire set of VLA-SMILES featured QSAR. Robustness and the predictive ability of MLP models based on VLA-SMILES were assessed via the method of QSAR parametric model validation. In addition, the method of the statistical H<sub>0</sub> hypothesis testing of the linear regression between real and observed activities based on the <i>F</i><sub>2<i>,n−</i>2 </sub>-criteria was used for predictability estimation among VLA-SMILES featured QSAR-MLPs (with <i>n</i> being the volume of the testing set). Both approaches of QSAR parametric model validation and statistical hypothesis testing were found to correlate when used for the quantitative evaluation of predictabilities of the designed QSAR models with VLA-SMILES descriptors.
first_indexed	2024-03-09T23:20:08Z
format	Article
id	doaj.art-0e392cdde6ff472fb19fc33ba1685f73
institution	Directory Open Access Journal
issn	2504-4990
language	English
last_indexed	2024-03-09T23:20:08Z
publishDate	2022-08-01
publisher	MDPI AG
record_format	Article
series	Machine Learning and Knowledge Extraction
spelling	doaj.art-0e392cdde6ff472fb19fc33ba1685f732023-11-23T17:28:12ZengMDPI AGMachine Learning and Knowledge Extraction2504-49902022-08-014371573710.3390/make4030034VLA-SMILES: Variable-Length-Array SMILES Descriptors in Neural Network-Based QSAR ModelingAntonina L. Nazarova0Aiichiro Nakano1Department of Quantitative & Computational Biology, Bridge Institute, USC Michelson Center for Convergent Bioscience, University of Southern California, Los Angeles, CA 90089, USACollaboratory of Advanced Computing and Simulations, Department of Computer Science, Department of Physics & Astronomy, Department of Quantitative & Computational Biology, University of Southern California, Los Angeles, CA 90089, USAMachine learning represents a milestone in data-driven research, including material informatics, robotics, and computer-aided drug discovery. With the continuously growing virtual and synthetically available chemical space, efficient and robust quantitative structure–activity relationship (QSAR) methods are required to uncover molecules with desired properties. Herein, we propose variable-length-array SMILES-based (VLA-SMILES) structural descriptors that expand conventional SMILES descriptors widely used in machine learning. This structural representation extends the family of numerically coded SMILES, particularly binary SMILES, to expedite the discovery of new deep learning QSAR models with high predictive ability. VLA-SMILES descriptors were shown to speed up the training of QSAR models based on multilayer perceptron (MLP) with optimized backpropagation (ATransformedBP), resilient propagation (iRPROP<sup>‒</sup>), and Adam optimization learning algorithms featuring rational train–test splitting, while improving the predictive ability toward the more compute-intensive binary SMILES representation format. All the tested MLPs under the same length-array-based SMILES descriptors showed similar predictive ability and convergence rate of training in combination with the considered learning procedures. Validation with the Kennard–Stone train–test splitting based on the structural descriptor similarity metrics was found more effective than the partitioning with the ranking by activity based on biological activity values metrics for the entire set of VLA-SMILES featured QSAR. Robustness and the predictive ability of MLP models based on VLA-SMILES were assessed via the method of QSAR parametric model validation. In addition, the method of the statistical H<sub>0</sub> hypothesis testing of the linear regression between real and observed activities based on the <i>F</i><sub>2<i>,n−</i>2 </sub>-criteria was used for predictability estimation among VLA-SMILES featured QSAR-MLPs (with <i>n</i> being the volume of the testing set). Both approaches of QSAR parametric model validation and statistical hypothesis testing were found to correlate when used for the quantitative evaluation of predictabilities of the designed QSAR models with VLA-SMILES descriptors.https://www.mdpi.com/2504-4990/4/3/34machine learningdeep learningneural networksSMILESdescriptorsQSAR
spellingShingle	Antonina L. Nazarova Aiichiro Nakano VLA-SMILES: Variable-Length-Array SMILES Descriptors in Neural Network-Based QSAR Modeling Machine Learning and Knowledge Extraction machine learning deep learning neural networks SMILES descriptors QSAR
title	VLA-SMILES: Variable-Length-Array SMILES Descriptors in Neural Network-Based QSAR Modeling
title_full	VLA-SMILES: Variable-Length-Array SMILES Descriptors in Neural Network-Based QSAR Modeling
title_fullStr	VLA-SMILES: Variable-Length-Array SMILES Descriptors in Neural Network-Based QSAR Modeling
title_full_unstemmed	VLA-SMILES: Variable-Length-Array SMILES Descriptors in Neural Network-Based QSAR Modeling
title_short	VLA-SMILES: Variable-Length-Array SMILES Descriptors in Neural Network-Based QSAR Modeling
title_sort	vla smiles variable length array smiles descriptors in neural network based qsar modeling
topic	machine learning deep learning neural networks SMILES descriptors QSAR
url	https://www.mdpi.com/2504-4990/4/3/34
work_keys_str_mv	AT antoninalnazarova vlasmilesvariablelengtharraysmilesdescriptorsinneuralnetworkbasedqsarmodeling AT aiichironakano vlasmilesvariablelengtharraysmilesdescriptorsinneuralnetworkbasedqsarmodeling

VLA-SMILES: Variable-Length-Array SMILES Descriptors in Neural Network-Based QSAR Modeling

Similar Items