Data Augmentation and Pretraining for Template-Based Retrosynthetic Prediction in Computer-Aided Synthesis Planning

Copyright © 2020 American Chemical Society. This work presents efforts to augment the performance of data-driven machine learning algorithms for reaction template recommendation used in computer-aided synthesis planning software. Often, machine learning models designed to perform the task of priorit...

Full description

Bibliographic Details
Main Authors: Fortunato, Michael E, Coley, Connor W, Barnes, Brian C, Jensen, Klavs F
Other Authors: Massachusetts Institute of Technology. Department of Chemical Engineering
Format: Article
Language:English
Published: American Chemical Society (ACS) 2021
Online Access:https://hdl.handle.net/1721.1/134636
_version_ 1811093403001159680
author Fortunato, Michael E
Coley, Connor W
Barnes, Brian C
Jensen, Klavs F
author2 Massachusetts Institute of Technology. Department of Chemical Engineering
author_facet Massachusetts Institute of Technology. Department of Chemical Engineering
Fortunato, Michael E
Coley, Connor W
Barnes, Brian C
Jensen, Klavs F
author_sort Fortunato, Michael E
collection MIT
description Copyright © 2020 American Chemical Society. This work presents efforts to augment the performance of data-driven machine learning algorithms for reaction template recommendation used in computer-aided synthesis planning software. Often, machine learning models designed to perform the task of prioritizing reaction templates or molecular transformations are focused on reporting high-accuracy metrics for the one-to-one mapping of product molecules in reaction databases to the template extracted from the recorded reaction. The available templates that get selected for inclusion in these machine learning models have been previously limited to those that appear frequently in the reaction databases and exclude potentially useful transformations. By augmenting open-access data sets of organic reactions with explicitly calculated template applicability and pretraining a template-relevance neural network on this augmented applicability data set, we report an increase in the template applicability recall and an increase in the diversity of predicted precursors. The augmentation and pretraining effectively teaches the neural network an increased set of templates that could theoretically lead to successful reactions for a given target. Even on a small data set of well-curated reactions, the data augmentation and pretraining methods resulted in an increase in top-1 accuracy, especially for rare templates, indicating that these strategies can be very useful for small data sets.
first_indexed 2024-09-23T15:44:42Z
format Article
id mit-1721.1/134636
institution Massachusetts Institute of Technology
language English
last_indexed 2024-09-23T15:44:42Z
publishDate 2021
publisher American Chemical Society (ACS)
record_format dspace
spelling mit-1721.1/1346362023-01-11T17:23:14Z Data Augmentation and Pretraining for Template-Based Retrosynthetic Prediction in Computer-Aided Synthesis Planning Fortunato, Michael E Coley, Connor W Barnes, Brian C Jensen, Klavs F Massachusetts Institute of Technology. Department of Chemical Engineering Copyright © 2020 American Chemical Society. This work presents efforts to augment the performance of data-driven machine learning algorithms for reaction template recommendation used in computer-aided synthesis planning software. Often, machine learning models designed to perform the task of prioritizing reaction templates or molecular transformations are focused on reporting high-accuracy metrics for the one-to-one mapping of product molecules in reaction databases to the template extracted from the recorded reaction. The available templates that get selected for inclusion in these machine learning models have been previously limited to those that appear frequently in the reaction databases and exclude potentially useful transformations. By augmenting open-access data sets of organic reactions with explicitly calculated template applicability and pretraining a template-relevance neural network on this augmented applicability data set, we report an increase in the template applicability recall and an increase in the diversity of predicted precursors. The augmentation and pretraining effectively teaches the neural network an increased set of templates that could theoretically lead to successful reactions for a given target. Even on a small data set of well-curated reactions, the data augmentation and pretraining methods resulted in an increase in top-1 accuracy, especially for rare templates, indicating that these strategies can be very useful for small data sets. 2021-10-27T20:05:54Z 2021-10-27T20:05:54Z 2020 2021-06-09T16:27:02Z Article http://purl.org/eprint/type/JournalArticle https://hdl.handle.net/1721.1/134636 en 10.1021/ACS.JCIM.0C00403 Journal of Chemical Information and Modeling Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/ application/pdf American Chemical Society (ACS) Other repository
spellingShingle Fortunato, Michael E
Coley, Connor W
Barnes, Brian C
Jensen, Klavs F
Data Augmentation and Pretraining for Template-Based Retrosynthetic Prediction in Computer-Aided Synthesis Planning
title Data Augmentation and Pretraining for Template-Based Retrosynthetic Prediction in Computer-Aided Synthesis Planning
title_full Data Augmentation and Pretraining for Template-Based Retrosynthetic Prediction in Computer-Aided Synthesis Planning
title_fullStr Data Augmentation and Pretraining for Template-Based Retrosynthetic Prediction in Computer-Aided Synthesis Planning
title_full_unstemmed Data Augmentation and Pretraining for Template-Based Retrosynthetic Prediction in Computer-Aided Synthesis Planning
title_short Data Augmentation and Pretraining for Template-Based Retrosynthetic Prediction in Computer-Aided Synthesis Planning
title_sort data augmentation and pretraining for template based retrosynthetic prediction in computer aided synthesis planning
url https://hdl.handle.net/1721.1/134636
work_keys_str_mv AT fortunatomichaele dataaugmentationandpretrainingfortemplatebasedretrosyntheticpredictionincomputeraidedsynthesisplanning
AT coleyconnorw dataaugmentationandpretrainingfortemplatebasedretrosyntheticpredictionincomputeraidedsynthesisplanning
AT barnesbrianc dataaugmentationandpretrainingfortemplatebasedretrosyntheticpredictionincomputeraidedsynthesisplanning
AT jensenklavsf dataaugmentationandpretrainingfortemplatebasedretrosyntheticpredictionincomputeraidedsynthesisplanning