Small molecule autoencoders: architecture engineering to optimize latent space utility and sustainability

Abstract Autoencoders are frequently used to embed molecules for training of downstream deep learning models. However, evaluation of the chemical information quality in the latent spaces is lacking and the model architectures are often arbitrarily chosen. Unoptimized architectures may not only negat...

Full description

Bibliographic Details
Main Authors:	Marie Oestreich, Iva Ewert, Matthias Becker
Format:	Article
Language:	English
Published:	BMC 2024-03-01
Series:	Journal of Cheminformatics
Subjects:	Molecular autoencoders Latent space optimization Sustainability Resource optimization
Online Access:	https://doi.org/10.1186/s13321-024-00817-0

_version_	1797273476023713792
author	Marie Oestreich Iva Ewert Matthias Becker
author_facet	Marie Oestreich Iva Ewert Matthias Becker
author_sort	Marie Oestreich
collection	DOAJ
description	Abstract Autoencoders are frequently used to embed molecules for training of downstream deep learning models. However, evaluation of the chemical information quality in the latent spaces is lacking and the model architectures are often arbitrarily chosen. Unoptimized architectures may not only negatively affect latent space quality but also increase energy consumption during training, making the models unsustainable. We conducted systematic experiments to better understand how the autoencoder architecture affects the reconstruction and latent space quality and how it can be optimized towards the encoding task as well as energy consumption. We can show that optimizing the architecture allows us to maintain the quality of a generic architecture but using 97% less data and reducing energy consumption by around 36%. We additionally observed that representing the molecules as SELFIES reduced the reconstruction performance compared to SMILES and that training with enumerated SMILES drastically improved latent space quality. Scientific Contribution: This work provides the first comprehensive systematic analysis of how choosing the autoencoder architecture affects the reconstruction performance of small molecules, the chemical information content of the latent space as well as the energy required for training. Demonstrated on the MOSES benchmarking dataset it provides first valuable insights into how autoencoders for the embedding of small molecules can be designed to optimize their utility and simultaneously become more sustainable, both in terms of energy consumption as well as the required amount of training data. All code, data and model checkpoints are made available on Zenodo (Oestreich et al. Small molecule autoencoders: architecture engineering to optimize latent space utility and sustainability. Zenodo, 2024). Furthermore, the top models can be found on GitHub with scripts to encode custom molecules: https://github.com/MarieOestreich/small-molecule-autoencoders . Graphical Abstract
first_indexed	2024-03-07T14:43:51Z
format	Article
id	doaj.art-9564c81fcebf465298f539ee1d29ff16
institution	Directory Open Access Journal
issn	1758-2946
language	English
last_indexed	2024-03-07T14:43:51Z
publishDate	2024-03-01
publisher	BMC
record_format	Article
series	Journal of Cheminformatics
spelling	doaj.art-9564c81fcebf465298f539ee1d29ff162024-03-05T20:06:12ZengBMCJournal of Cheminformatics1758-29462024-03-0116111410.1186/s13321-024-00817-0Small molecule autoencoders: architecture engineering to optimize latent space utility and sustainabilityMarie Oestreich0Iva Ewert1Matthias Becker2Modular High-Performance Computing and Artificial Intelligence, German Center for Neurodegenerative Diseases (DZNE)Modular High-Performance Computing and Artificial Intelligence, German Center for Neurodegenerative Diseases (DZNE)Modular High-Performance Computing and Artificial Intelligence, German Center for Neurodegenerative Diseases (DZNE)Abstract Autoencoders are frequently used to embed molecules for training of downstream deep learning models. However, evaluation of the chemical information quality in the latent spaces is lacking and the model architectures are often arbitrarily chosen. Unoptimized architectures may not only negatively affect latent space quality but also increase energy consumption during training, making the models unsustainable. We conducted systematic experiments to better understand how the autoencoder architecture affects the reconstruction and latent space quality and how it can be optimized towards the encoding task as well as energy consumption. We can show that optimizing the architecture allows us to maintain the quality of a generic architecture but using 97% less data and reducing energy consumption by around 36%. We additionally observed that representing the molecules as SELFIES reduced the reconstruction performance compared to SMILES and that training with enumerated SMILES drastically improved latent space quality. Scientific Contribution: This work provides the first comprehensive systematic analysis of how choosing the autoencoder architecture affects the reconstruction performance of small molecules, the chemical information content of the latent space as well as the energy required for training. Demonstrated on the MOSES benchmarking dataset it provides first valuable insights into how autoencoders for the embedding of small molecules can be designed to optimize their utility and simultaneously become more sustainable, both in terms of energy consumption as well as the required amount of training data. All code, data and model checkpoints are made available on Zenodo (Oestreich et al. Small molecule autoencoders: architecture engineering to optimize latent space utility and sustainability. Zenodo, 2024). Furthermore, the top models can be found on GitHub with scripts to encode custom molecules: https://github.com/MarieOestreich/small-molecule-autoencoders . Graphical Abstracthttps://doi.org/10.1186/s13321-024-00817-0Molecular autoencodersLatent space optimizationSustainabilityResource optimization
spellingShingle	Marie Oestreich Iva Ewert Matthias Becker Small molecule autoencoders: architecture engineering to optimize latent space utility and sustainability Journal of Cheminformatics Molecular autoencoders Latent space optimization Sustainability Resource optimization
title	Small molecule autoencoders: architecture engineering to optimize latent space utility and sustainability
title_full	Small molecule autoencoders: architecture engineering to optimize latent space utility and sustainability
title_fullStr	Small molecule autoencoders: architecture engineering to optimize latent space utility and sustainability
title_full_unstemmed	Small molecule autoencoders: architecture engineering to optimize latent space utility and sustainability
title_short	Small molecule autoencoders: architecture engineering to optimize latent space utility and sustainability
title_sort	small molecule autoencoders architecture engineering to optimize latent space utility and sustainability
topic	Molecular autoencoders Latent space optimization Sustainability Resource optimization
url	https://doi.org/10.1186/s13321-024-00817-0
work_keys_str_mv	AT marieoestreich smallmoleculeautoencodersarchitectureengineeringtooptimizelatentspaceutilityandsustainability AT ivaewert smallmoleculeautoencodersarchitectureengineeringtooptimizelatentspaceutilityandsustainability AT matthiasbecker smallmoleculeautoencodersarchitectureengineeringtooptimizelatentspaceutilityandsustainability

Small molecule autoencoders: architecture engineering to optimize latent space utility and sustainability

Similar Items