From molecular diagrams to material properties: investigation of data-driven tools and strategies in the field of Chemoinformatics

The field of Chemoinformatics has enabled QSAR/QSPR predictive models useful for the rapid virtual assessment of compounds or mining knowledge that may contribute towards the understanding of chemical systems. However, scepticism remains about the practical value of these approaches: models often ex...

Full description

Bibliographic Details
Main Author:	Frade, APPO
Other Authors:	Cooper, RI
Format:	Thesis
Language:	English
Published:	2021
Subjects:	Computational chemistry Cheminformatics Machine learning Statistics

_version_	1797106612092010496
author	Frade, APPO
author2	Cooper, RI
author_facet	Cooper, RI Frade, APPO
author_sort	Frade, APPO
collection	OXFORD
description	The field of Chemoinformatics has enabled QSAR/QSPR predictive models useful for the rapid virtual assessment of compounds or mining knowledge that may contribute towards the understanding of chemical systems. However, scepticism remains about the practical value of these approaches: models often exhibit limited predictive power, low generalisation capacity, and vague interpretability. More broadly, the field faces a general lack of consistency of results and procedures. This thesis addresses some of these challenges and explores new solutions to the field. First, standard protocols are used to explore the challenges of building chemical datasets suitable for statistical learning and understanding limitations of the predictive power of models. Here, the challenges of dataset curation and improvements to molecular structure description are explored. A library of new descriptors is built and assessed, which cleanly captures some information about crystallizability, but did not lead to measurable performance improvements of the models studied. The lack of relevant information content of some chemical datasets is reviewed, and strategies developed to increase the value of low-accuracy models: I report the generalisability and confidence tools and methodologies designed to restrict model predictions to inputs within their applicability domain. Illustrative implementations led to performance increases of up to 11% points of our predictive models. Finally, attempts are made to derive insights about structure-property relationships from chemoinformatics models. I develop a feature selection strategy which identifies the descriptors that influence a model’s predictions, and an interpretation method that identifies useful changes to a molecule to control predicted properties. Insights such as the correlation between the higher crystallization propensity of molecules and the lower volume and flexibility of molecules are identified. This work contributes towards the motivation and development of general-purpose open-source solutions which are key to add consistency in the field, increasing trust, and leveraging QSPR models to their full potential.
first_indexed	2024-03-07T07:04:56Z
format	Thesis
id	oxford-uuid:8dc08d08-c251-4dd4-b39a-a051b2ac7351
institution	University of Oxford
language	English
last_indexed	2024-03-07T07:04:56Z
publishDate	2021
record_format	dspace
spelling	oxford-uuid:8dc08d08-c251-4dd4-b39a-a051b2ac73512022-04-27T14:12:07ZFrom molecular diagrams to material properties: investigation of data-driven tools and strategies in the field of ChemoinformaticsThesishttp://purl.org/coar/resource_type/c_db06uuid:8dc08d08-c251-4dd4-b39a-a051b2ac7351Computational chemistryCheminformaticsMachine learningStatisticsEnglishHyrax Deposit2021Frade, APPOCooper, RIMcCabe, PThe field of Chemoinformatics has enabled QSAR/QSPR predictive models useful for the rapid virtual assessment of compounds or mining knowledge that may contribute towards the understanding of chemical systems. However, scepticism remains about the practical value of these approaches: models often exhibit limited predictive power, low generalisation capacity, and vague interpretability. More broadly, the field faces a general lack of consistency of results and procedures. This thesis addresses some of these challenges and explores new solutions to the field. First, standard protocols are used to explore the challenges of building chemical datasets suitable for statistical learning and understanding limitations of the predictive power of models. Here, the challenges of dataset curation and improvements to molecular structure description are explored. A library of new descriptors is built and assessed, which cleanly captures some information about crystallizability, but did not lead to measurable performance improvements of the models studied. The lack of relevant information content of some chemical datasets is reviewed, and strategies developed to increase the value of low-accuracy models: I report the generalisability and confidence tools and methodologies designed to restrict model predictions to inputs within their applicability domain. Illustrative implementations led to performance increases of up to 11% points of our predictive models. Finally, attempts are made to derive insights about structure-property relationships from chemoinformatics models. I develop a feature selection strategy which identifies the descriptors that influence a model’s predictions, and an interpretation method that identifies useful changes to a molecule to control predicted properties. Insights such as the correlation between the higher crystallization propensity of molecules and the lower volume and flexibility of molecules are identified. This work contributes towards the motivation and development of general-purpose open-source solutions which are key to add consistency in the field, increasing trust, and leveraging QSPR models to their full potential.
spellingShingle	Computational chemistry Cheminformatics Machine learning Statistics Frade, APPO From molecular diagrams to material properties: investigation of data-driven tools and strategies in the field of Chemoinformatics
title	From molecular diagrams to material properties: investigation of data-driven tools and strategies in the field of Chemoinformatics
title_full	From molecular diagrams to material properties: investigation of data-driven tools and strategies in the field of Chemoinformatics
title_fullStr	From molecular diagrams to material properties: investigation of data-driven tools and strategies in the field of Chemoinformatics
title_full_unstemmed	From molecular diagrams to material properties: investigation of data-driven tools and strategies in the field of Chemoinformatics
title_short	From molecular diagrams to material properties: investigation of data-driven tools and strategies in the field of Chemoinformatics
title_sort	from molecular diagrams to material properties investigation of data driven tools and strategies in the field of chemoinformatics
topic	Computational chemistry Cheminformatics Machine learning Statistics
work_keys_str_mv	AT fradeappo frommoleculardiagramstomaterialpropertiesinvestigationofdatadriventoolsandstrategiesinthefieldofchemoinformatics

From molecular diagrams to material properties: investigation of data-driven tools and strategies in the field of Chemoinformatics

Similar Items