From molecular diagrams to material properties: investigation of data-driven tools and strategies in the field of Chemoinformatics

The field of Chemoinformatics has enabled QSAR/QSPR predictive models useful for the rapid virtual assessment of compounds or mining knowledge that may contribute towards the understanding of chemical systems. However, scepticism remains about the practical value of these approaches: models often ex...

Full description

Bibliographic Details
Main Author: Frade, APPO
Other Authors: Cooper, RI
Format: Thesis
Language:English
Published: 2021
Subjects:
_version_ 1797106612092010496
author Frade, APPO
author2 Cooper, RI
author_facet Cooper, RI
Frade, APPO
author_sort Frade, APPO
collection OXFORD
description The field of Chemoinformatics has enabled QSAR/QSPR predictive models useful for the rapid virtual assessment of compounds or mining knowledge that may contribute towards the understanding of chemical systems. However, scepticism remains about the practical value of these approaches: models often exhibit limited predictive power, low generalisation capacity, and vague interpretability. More broadly, the field faces a general lack of consistency of results and procedures. This thesis addresses some of these challenges and explores new solutions to the field. First, standard protocols are used to explore the challenges of building chemical datasets suitable for statistical learning and understanding limitations of the predictive power of models. Here, the challenges of dataset curation and improvements to molecular structure description are explored. A library of new descriptors is built and assessed, which cleanly captures some information about crystallizability, but did not lead to measurable performance improvements of the models studied. The lack of relevant information content of some chemical datasets is reviewed, and strategies developed to increase the value of low-accuracy models: I report the generalisability and confidence tools and methodologies designed to restrict model predictions to inputs within their applicability domain. Illustrative implementations led to performance increases of up to 11% points of our predictive models. Finally, attempts are made to derive insights about structure-property relationships from chemoinformatics models. I develop a feature selection strategy which identifies the descriptors that influence a model’s predictions, and an interpretation method that identifies useful changes to a molecule to control predicted properties. Insights such as the correlation between the higher crystallization propensity of molecules and the lower volume and flexibility of molecules are identified. This work contributes towards the motivation and development of general-purpose open-source solutions which are key to add consistency in the field, increasing trust, and leveraging QSPR models to their full potential.
first_indexed 2024-03-07T07:04:56Z
format Thesis
id oxford-uuid:8dc08d08-c251-4dd4-b39a-a051b2ac7351
institution University of Oxford
language English
last_indexed 2024-03-07T07:04:56Z
publishDate 2021
record_format dspace
spelling oxford-uuid:8dc08d08-c251-4dd4-b39a-a051b2ac73512022-04-27T14:12:07ZFrom molecular diagrams to material properties: investigation of data-driven tools and strategies in the field of ChemoinformaticsThesishttp://purl.org/coar/resource_type/c_db06uuid:8dc08d08-c251-4dd4-b39a-a051b2ac7351Computational chemistryCheminformaticsMachine learningStatisticsEnglishHyrax Deposit2021Frade, APPOCooper, RIMcCabe, PThe field of Chemoinformatics has enabled QSAR/QSPR predictive models useful for the rapid virtual assessment of compounds or mining knowledge that may contribute towards the understanding of chemical systems. However, scepticism remains about the practical value of these approaches: models often exhibit limited predictive power, low generalisation capacity, and vague interpretability. More broadly, the field faces a general lack of consistency of results and procedures. This thesis addresses some of these challenges and explores new solutions to the field. First, standard protocols are used to explore the challenges of building chemical datasets suitable for statistical learning and understanding limitations of the predictive power of models. Here, the challenges of dataset curation and improvements to molecular structure description are explored. A library of new descriptors is built and assessed, which cleanly captures some information about crystallizability, but did not lead to measurable performance improvements of the models studied. The lack of relevant information content of some chemical datasets is reviewed, and strategies developed to increase the value of low-accuracy models: I report the generalisability and confidence tools and methodologies designed to restrict model predictions to inputs within their applicability domain. Illustrative implementations led to performance increases of up to 11% points of our predictive models. Finally, attempts are made to derive insights about structure-property relationships from chemoinformatics models. I develop a feature selection strategy which identifies the descriptors that influence a model’s predictions, and an interpretation method that identifies useful changes to a molecule to control predicted properties. Insights such as the correlation between the higher crystallization propensity of molecules and the lower volume and flexibility of molecules are identified. This work contributes towards the motivation and development of general-purpose open-source solutions which are key to add consistency in the field, increasing trust, and leveraging QSPR models to their full potential.
spellingShingle Computational chemistry
Cheminformatics
Machine learning
Statistics
Frade, APPO
From molecular diagrams to material properties: investigation of data-driven tools and strategies in the field of Chemoinformatics
title From molecular diagrams to material properties: investigation of data-driven tools and strategies in the field of Chemoinformatics
title_full From molecular diagrams to material properties: investigation of data-driven tools and strategies in the field of Chemoinformatics
title_fullStr From molecular diagrams to material properties: investigation of data-driven tools and strategies in the field of Chemoinformatics
title_full_unstemmed From molecular diagrams to material properties: investigation of data-driven tools and strategies in the field of Chemoinformatics
title_short From molecular diagrams to material properties: investigation of data-driven tools and strategies in the field of Chemoinformatics
title_sort from molecular diagrams to material properties investigation of data driven tools and strategies in the field of chemoinformatics
topic Computational chemistry
Cheminformatics
Machine learning
Statistics
work_keys_str_mv AT fradeappo frommoleculardiagramstomaterialpropertiesinvestigationofdatadriventoolsandstrategiesinthefieldofchemoinformatics