SELFormer: molecular representation learning via SELFIES language models

Automated computational analysis of the vast chemical space is critical for numerous fields of research such as drug discovery and material science. Representation learning techniques have recently been employed with the primary objective of generating compact and informative numerical expressions o...

Full description

Bibliographic Details
Main Authors:	Atakan Yüksel, Erva Ulusoy, Atabey Ünlü, Tunca Doğan
Format:	Article
Language:	English
Published:	IOP Publishing 2023-01-01
Series:	Machine Learning: Science and Technology
Subjects:	molecular representation learning drug discovery molecular property prediction natural language processing transformers
Online Access:	https://doi.org/10.1088/2632-2153/acdb30

_version_	1797792293816631296
author	Atakan Yüksel Erva Ulusoy Atabey Ünlü Tunca Doğan
author_facet	Atakan Yüksel Erva Ulusoy Atabey Ünlü Tunca Doğan
author_sort	Atakan Yüksel
collection	DOAJ
description	Automated computational analysis of the vast chemical space is critical for numerous fields of research such as drug discovery and material science. Representation learning techniques have recently been employed with the primary objective of generating compact and informative numerical expressions of complex data, for efficient usage in subsequent prediction tasks. One approach to efficiently learn molecular representations is processing string-based notations of chemicals via natural language processing algorithms. Majority of the methods proposed so far utilize SMILES notations for this purpose, which is the most extensively used string-based encoding for molecules. However, SMILES is associated with numerous problems related to validity and robustness, which may prevent the model from effectively uncovering the knowledge hidden in the data. In this study, we propose SELFormer, a transformer architecture-based chemical language model (CLM) that utilizes a 100% valid, compact and expressive notation, SELFIES, as input, in order to learn flexible and high-quality molecular representations. SELFormer is pre-trained on two million drug-like compounds and fine-tuned for diverse molecular property prediction tasks. Our performance evaluation has revealed that, SELFormer outperforms all competing methods, including graph learning-based approaches and SMILES-based CLMs, on predicting aqueous solubility of molecules and adverse drug reactions, while producing comparable results for the remaining tasks. We also visualized molecular representations learned by SELFormer via dimensionality reduction, which indicated that even the pre-trained model can discriminate molecules with differing structural properties. We shared SELFormer as a programmatic tool, together with its datasets and pre-trained models at https://github.com/HUBioDataLab/SELFormer . Overall, our research demonstrates the benefit of using the SELFIES notations in the context of chemical language modeling and opens up new possibilities for the design and discovery of novel drug candidates with desired features.
first_indexed	2024-03-13T02:32:10Z
format	Article
id	doaj.art-637ede00a8164e5e97ed8eb34b854e8c
institution	Directory Open Access Journal
issn	2632-2153
language	English
last_indexed	2024-03-13T02:32:10Z
publishDate	2023-01-01
publisher	IOP Publishing
record_format	Article
series	Machine Learning: Science and Technology
spelling	doaj.art-637ede00a8164e5e97ed8eb34b854e8c2023-06-29T13:36:01ZengIOP PublishingMachine Learning: Science and Technology2632-21532023-01-014202503510.1088/2632-2153/acdb30SELFormer: molecular representation learning via SELFIES language modelsAtakan Yüksel0Erva Ulusoy1Atabey Ünlü2Tunca Doğan3https://orcid.org/0000-0002-1298-9763Biological Data Science Lab, Department of Computer Engineering, Hacettepe University , Ankara, TurkeyBiological Data Science Lab, Department of Computer Engineering, Hacettepe University , Ankara, Turkey; Department of Bioinformatics, Graduate School of Health Sciences, Hacettepe University , Ankara, TurkeyBiological Data Science Lab, Department of Computer Engineering, Hacettepe University , Ankara, Turkey; Department of Bioinformatics, Graduate School of Health Sciences, Hacettepe University , Ankara, TurkeyBiological Data Science Lab, Department of Computer Engineering, Hacettepe University , Ankara, Turkey; Department of Bioinformatics, Graduate School of Health Sciences, Hacettepe University , Ankara, Turkey; Institute of Informatics, Hacettepe University , Ankara, TurkeyAutomated computational analysis of the vast chemical space is critical for numerous fields of research such as drug discovery and material science. Representation learning techniques have recently been employed with the primary objective of generating compact and informative numerical expressions of complex data, for efficient usage in subsequent prediction tasks. One approach to efficiently learn molecular representations is processing string-based notations of chemicals via natural language processing algorithms. Majority of the methods proposed so far utilize SMILES notations for this purpose, which is the most extensively used string-based encoding for molecules. However, SMILES is associated with numerous problems related to validity and robustness, which may prevent the model from effectively uncovering the knowledge hidden in the data. In this study, we propose SELFormer, a transformer architecture-based chemical language model (CLM) that utilizes a 100% valid, compact and expressive notation, SELFIES, as input, in order to learn flexible and high-quality molecular representations. SELFormer is pre-trained on two million drug-like compounds and fine-tuned for diverse molecular property prediction tasks. Our performance evaluation has revealed that, SELFormer outperforms all competing methods, including graph learning-based approaches and SMILES-based CLMs, on predicting aqueous solubility of molecules and adverse drug reactions, while producing comparable results for the remaining tasks. We also visualized molecular representations learned by SELFormer via dimensionality reduction, which indicated that even the pre-trained model can discriminate molecules with differing structural properties. We shared SELFormer as a programmatic tool, together with its datasets and pre-trained models at https://github.com/HUBioDataLab/SELFormer . Overall, our research demonstrates the benefit of using the SELFIES notations in the context of chemical language modeling and opens up new possibilities for the design and discovery of novel drug candidates with desired features.https://doi.org/10.1088/2632-2153/acdb30molecular representation learningdrug discoverymolecular property predictionnatural language processingtransformers
spellingShingle	Atakan Yüksel Erva Ulusoy Atabey Ünlü Tunca Doğan SELFormer: molecular representation learning via SELFIES language models Machine Learning: Science and Technology molecular representation learning drug discovery molecular property prediction natural language processing transformers
title	SELFormer: molecular representation learning via SELFIES language models
title_full	SELFormer: molecular representation learning via SELFIES language models
title_fullStr	SELFormer: molecular representation learning via SELFIES language models
title_full_unstemmed	SELFormer: molecular representation learning via SELFIES language models
title_short	SELFormer: molecular representation learning via SELFIES language models
title_sort	selformer molecular representation learning via selfies language models
topic	molecular representation learning drug discovery molecular property prediction natural language processing transformers
url	https://doi.org/10.1088/2632-2153/acdb30
work_keys_str_mv	AT atakanyuksel selformermolecularrepresentationlearningviaselfieslanguagemodels AT ervaulusoy selformermolecularrepresentationlearningviaselfieslanguagemodels AT atabeyunlu selformermolecularrepresentationlearningviaselfieslanguagemodels AT tuncadogan selformermolecularrepresentationlearningviaselfieslanguagemodels

SELFormer: molecular representation learning via SELFIES language models

Similar Items