Summary: | The field of Chemoinformatics has enabled QSAR/QSPR predictive models useful for the rapid virtual assessment of compounds or mining knowledge that may contribute towards the understanding of chemical systems. However, scepticism remains about the practical value of these approaches: models often exhibit limited predictive power, low generalisation capacity, and vague interpretability. More broadly, the field faces a general lack of consistency of results and procedures. This thesis addresses some of these challenges and explores new solutions to the field. First, standard protocols are used to explore the challenges of building chemical datasets suitable for statistical learning and understanding limitations of the predictive power of models. Here, the challenges of dataset curation and improvements to molecular structure description are explored. A library of new descriptors is built and assessed, which cleanly captures some information about crystallizability, but did not lead to measurable performance improvements of the models studied. The lack of relevant information content of some chemical datasets is reviewed, and strategies developed to increase the value of low-accuracy models: I report the generalisability and confidence tools and methodologies designed to restrict model predictions to inputs within their applicability domain. Illustrative implementations led to performance increases of up to 11% points of our predictive models. Finally, attempts are made to derive insights about structure-property relationships from chemoinformatics models. I develop a feature selection strategy which identifies the descriptors that influence a model’s predictions, and an interpretation method that identifies useful changes to a molecule to control predicted properties. Insights such as the correlation between the higher crystallization propensity of molecules and the lower volume and flexibility of molecules are identified. This work contributes towards the motivation and development of general-purpose open-source
solutions which are key to add consistency in the field, increasing trust, and leveraging QSPR models to their full potential.
|