Resolving Transition Metal Chemical Space: Feature Selection for Machine Learning and Structure–Property Relationships

Machine learning (ML) of quantum mechanical properties shows promise for accelerating chemical discovery. For transition metal chemistry where accurate calculations are computationally costly and available training data sets are small, the molecular representation becomes a critical ingredient in ML...

Full description

Bibliographic Details
Main Authors: Janet, Jon Paul, Kulik, Heather J.
Other Authors: Massachusetts Institute of Technology. Department of Chemical Engineering
Format: Article
Published: American Chemical Society (ACS) 2020
Online Access:https://hdl.handle.net/1721.1/123835
_version_ 1811072595259293696
author Janet, Jon Paul
Kulik, Heather J.
author2 Massachusetts Institute of Technology. Department of Chemical Engineering
author_facet Massachusetts Institute of Technology. Department of Chemical Engineering
Janet, Jon Paul
Kulik, Heather J.
author_sort Janet, Jon Paul
collection MIT
description Machine learning (ML) of quantum mechanical properties shows promise for accelerating chemical discovery. For transition metal chemistry where accurate calculations are computationally costly and available training data sets are small, the molecular representation becomes a critical ingredient in ML model predictive accuracy. We introduce a series of revised autocorrelation functions (RACs) that encode relationships of the heuristic atomic properties (e.g., size, connectivity, and electronegativity) on a molecular graph. We alter the starting point, scope, and nature of the quantities evaluated in standard ACs to make these RACs amenable to inorganic chemistry. On an organic molecule set, we first demonstrate superior standard AC performance to other presently available topological descriptors for ML model training, with mean unsigned errors (MUEs) for atomization energies on set-aside test molecules as low as 6 kcal/mol. For inorganic chemistry, our RACs yield 1 kcal/mol ML MUEs on set-aside test molecules in spin-state splitting in comparison to 15–20× higher errors for feature sets that encode whole-molecule structural information. Systematic feature selection methods including univariate filtering, recursive feature elimination, and direct optimization (e.g., random forest and LASSO) are compared. Random-forest- or LASSO-selected subsets 4–5× smaller than the full RAC set produce sub- to 1 kcal/mol spin-splitting MUEs, with good transferability to metal–ligand bond length prediction (0.004–5 Å MUE) and redox potential on a smaller data set (0.2–0.3 eV MUE). Evaluation of feature selection results across property sets reveals the relative importance of local, electronic descriptors (e.g., electronegativity, atomic number) in spin-splitting and distal, steric effects in redox potential and bond lengths.
first_indexed 2024-09-23T09:08:29Z
format Article
id mit-1721.1/123835
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T09:08:29Z
publishDate 2020
publisher American Chemical Society (ACS)
record_format dspace
spelling mit-1721.1/1238352022-09-26T10:44:20Z Resolving Transition Metal Chemical Space: Feature Selection for Machine Learning and Structure–Property Relationships Janet, Jon Paul Kulik, Heather J. Massachusetts Institute of Technology. Department of Chemical Engineering Machine learning (ML) of quantum mechanical properties shows promise for accelerating chemical discovery. For transition metal chemistry where accurate calculations are computationally costly and available training data sets are small, the molecular representation becomes a critical ingredient in ML model predictive accuracy. We introduce a series of revised autocorrelation functions (RACs) that encode relationships of the heuristic atomic properties (e.g., size, connectivity, and electronegativity) on a molecular graph. We alter the starting point, scope, and nature of the quantities evaluated in standard ACs to make these RACs amenable to inorganic chemistry. On an organic molecule set, we first demonstrate superior standard AC performance to other presently available topological descriptors for ML model training, with mean unsigned errors (MUEs) for atomization energies on set-aside test molecules as low as 6 kcal/mol. For inorganic chemistry, our RACs yield 1 kcal/mol ML MUEs on set-aside test molecules in spin-state splitting in comparison to 15–20× higher errors for feature sets that encode whole-molecule structural information. Systematic feature selection methods including univariate filtering, recursive feature elimination, and direct optimization (e.g., random forest and LASSO) are compared. Random-forest- or LASSO-selected subsets 4–5× smaller than the full RAC set produce sub- to 1 kcal/mol spin-splitting MUEs, with good transferability to metal–ligand bond length prediction (0.004–5 Å MUE) and redox potential on a smaller data set (0.2–0.3 eV MUE). Evaluation of feature selection results across property sets reveals the relative importance of local, electronic descriptors (e.g., electronegativity, atomic number) in spin-splitting and distal, steric effects in redox potential and bond lengths. United States. Office of Naval Research (Grant N00014-17-1-2956) National Science Foundation (Grant ECCS-1449291) National Science Foundation (Grant CBET-1704266) 2020-02-20T18:25:58Z 2020-02-20T18:25:58Z 2017-11 2017-10 Article http://purl.org/eprint/type/JournalArticle 1089-5639 1520-5215 https://hdl.handle.net/1721.1/123835 Janet, Jon Paul and Heather J. Kulik. "Resolving Transition Metal Chemical Space: Feature Selection for Machine Learning and Structure–Property Relationships." Journal of Physical Chemistry A 121, 46 (November 2017): 8939-8954 © 2017 American Chemical Society http://dx.doi.org/10.1021/acs.jpca.7b08750 Journal of Physical Chemistry A Article is made available in accordance with the publisher's policy and may be subject to US copyright law. Please refer to the publisher's site for terms of use. application/pdf American Chemical Society (ACS) Prof. Kulik
spellingShingle Janet, Jon Paul
Kulik, Heather J.
Resolving Transition Metal Chemical Space: Feature Selection for Machine Learning and Structure–Property Relationships
title Resolving Transition Metal Chemical Space: Feature Selection for Machine Learning and Structure–Property Relationships
title_full Resolving Transition Metal Chemical Space: Feature Selection for Machine Learning and Structure–Property Relationships
title_fullStr Resolving Transition Metal Chemical Space: Feature Selection for Machine Learning and Structure–Property Relationships
title_full_unstemmed Resolving Transition Metal Chemical Space: Feature Selection for Machine Learning and Structure–Property Relationships
title_short Resolving Transition Metal Chemical Space: Feature Selection for Machine Learning and Structure–Property Relationships
title_sort resolving transition metal chemical space feature selection for machine learning and structure property relationships
url https://hdl.handle.net/1721.1/123835
work_keys_str_mv AT janetjonpaul resolvingtransitionmetalchemicalspacefeatureselectionformachinelearningandstructurepropertyrelationships
AT kulikheatherj resolvingtransitionmetalchemicalspacefeatureselectionformachinelearningandstructurepropertyrelationships