Data Augmentation and Pretraining for Template-Based Retrosynthetic Prediction in Computer-Aided Synthesis Planning

Copyright © 2020 American Chemical Society. This work presents efforts to augment the performance of data-driven machine learning algorithms for reaction template recommendation used in computer-aided synthesis planning software. Often, machine learning models designed to perform the task of priorit...

Full description

Bibliographic Details
Main Authors:	Fortunato, Michael E, Coley, Connor W, Barnes, Brian C, Jensen, Klavs F
Other Authors:	Massachusetts Institute of Technology. Department of Chemical Engineering
Format:	Article
Language:	English
Published:	American Chemical Society (ACS) 2021
Online Access:	https://hdl.handle.net/1721.1/134636

_version_	1811093403001159680
author	Fortunato, Michael E Coley, Connor W Barnes, Brian C Jensen, Klavs F
author2	Massachusetts Institute of Technology. Department of Chemical Engineering
author_facet	Massachusetts Institute of Technology. Department of Chemical Engineering Fortunato, Michael E Coley, Connor W Barnes, Brian C Jensen, Klavs F
author_sort	Fortunato, Michael E
collection	MIT
description	Copyright © 2020 American Chemical Society. This work presents efforts to augment the performance of data-driven machine learning algorithms for reaction template recommendation used in computer-aided synthesis planning software. Often, machine learning models designed to perform the task of prioritizing reaction templates or molecular transformations are focused on reporting high-accuracy metrics for the one-to-one mapping of product molecules in reaction databases to the template extracted from the recorded reaction. The available templates that get selected for inclusion in these machine learning models have been previously limited to those that appear frequently in the reaction databases and exclude potentially useful transformations. By augmenting open-access data sets of organic reactions with explicitly calculated template applicability and pretraining a template-relevance neural network on this augmented applicability data set, we report an increase in the template applicability recall and an increase in the diversity of predicted precursors. The augmentation and pretraining effectively teaches the neural network an increased set of templates that could theoretically lead to successful reactions for a given target. Even on a small data set of well-curated reactions, the data augmentation and pretraining methods resulted in an increase in top-1 accuracy, especially for rare templates, indicating that these strategies can be very useful for small data sets.
first_indexed	2024-09-23T15:44:42Z
format	Article
id	mit-1721.1/134636
institution	Massachusetts Institute of Technology
language	English
last_indexed	2024-09-23T15:44:42Z
publishDate	2021
publisher	American Chemical Society (ACS)
record_format	dspace
spelling	mit-1721.1/1346362023-01-11T17:23:14Z Data Augmentation and Pretraining for Template-Based Retrosynthetic Prediction in Computer-Aided Synthesis Planning Fortunato, Michael E Coley, Connor W Barnes, Brian C Jensen, Klavs F Massachusetts Institute of Technology. Department of Chemical Engineering Copyright © 2020 American Chemical Society. This work presents efforts to augment the performance of data-driven machine learning algorithms for reaction template recommendation used in computer-aided synthesis planning software. Often, machine learning models designed to perform the task of prioritizing reaction templates or molecular transformations are focused on reporting high-accuracy metrics for the one-to-one mapping of product molecules in reaction databases to the template extracted from the recorded reaction. The available templates that get selected for inclusion in these machine learning models have been previously limited to those that appear frequently in the reaction databases and exclude potentially useful transformations. By augmenting open-access data sets of organic reactions with explicitly calculated template applicability and pretraining a template-relevance neural network on this augmented applicability data set, we report an increase in the template applicability recall and an increase in the diversity of predicted precursors. The augmentation and pretraining effectively teaches the neural network an increased set of templates that could theoretically lead to successful reactions for a given target. Even on a small data set of well-curated reactions, the data augmentation and pretraining methods resulted in an increase in top-1 accuracy, especially for rare templates, indicating that these strategies can be very useful for small data sets. 2021-10-27T20:05:54Z 2021-10-27T20:05:54Z 2020 2021-06-09T16:27:02Z Article http://purl.org/eprint/type/JournalArticle https://hdl.handle.net/1721.1/134636 en 10.1021/ACS.JCIM.0C00403 Journal of Chemical Information and Modeling Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/ application/pdf American Chemical Society (ACS) Other repository
spellingShingle	Fortunato, Michael E Coley, Connor W Barnes, Brian C Jensen, Klavs F Data Augmentation and Pretraining for Template-Based Retrosynthetic Prediction in Computer-Aided Synthesis Planning
title	Data Augmentation and Pretraining for Template-Based Retrosynthetic Prediction in Computer-Aided Synthesis Planning
title_full	Data Augmentation and Pretraining for Template-Based Retrosynthetic Prediction in Computer-Aided Synthesis Planning
title_fullStr	Data Augmentation and Pretraining for Template-Based Retrosynthetic Prediction in Computer-Aided Synthesis Planning
title_full_unstemmed	Data Augmentation and Pretraining for Template-Based Retrosynthetic Prediction in Computer-Aided Synthesis Planning
title_short	Data Augmentation and Pretraining for Template-Based Retrosynthetic Prediction in Computer-Aided Synthesis Planning
title_sort	data augmentation and pretraining for template based retrosynthetic prediction in computer aided synthesis planning
url	https://hdl.handle.net/1721.1/134636
work_keys_str_mv	AT fortunatomichaele dataaugmentationandpretrainingfortemplatebasedretrosyntheticpredictionincomputeraidedsynthesisplanning AT coleyconnorw dataaugmentationandpretrainingfortemplatebasedretrosyntheticpredictionincomputeraidedsynthesisplanning AT barnesbrianc dataaugmentationandpretrainingfortemplatebasedretrosyntheticpredictionincomputeraidedsynthesisplanning AT jensenklavsf dataaugmentationandpretrainingfortemplatebasedretrosyntheticpredictionincomputeraidedsynthesisplanning

Data Augmentation and Pretraining for Template-Based Retrosynthetic Prediction in Computer-Aided Synthesis Planning

Similar Items