Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization

Abstract Tokenization is an important preprocessing step in natural language processing that may have a significant influence on prediction quality. This research showed that the traditional SMILES tokenization has a certain limitation that results in tokens failing to reflect the true nature of mol...

Full description

Bibliographic Details
Main Authors:	Umit V. Ucak, Islambek Ashyrmamatov, Juyong Lee
Format:	Article
Language:	English
Published:	BMC 2023-05-01
Series:	Journal of Cheminformatics
Subjects:	Atom-in-SMILES Tokenization Repetition Chemical language processing
Online Access:	https://doi.org/10.1186/s13321-023-00725-9

_version_	1797752662853156864
author	Umit V. Ucak Islambek Ashyrmamatov Juyong Lee
author_facet	Umit V. Ucak Islambek Ashyrmamatov Juyong Lee
author_sort	Umit V. Ucak
collection	DOAJ
description	Abstract Tokenization is an important preprocessing step in natural language processing that may have a significant influence on prediction quality. This research showed that the traditional SMILES tokenization has a certain limitation that results in tokens failing to reflect the true nature of molecules. To address this issue, we developed the atom-in-SMILES tokenization scheme that eliminates ambiguities in the generic nature of SMILES tokens. Our results in multiple chemical translation and molecular property prediction tasks demonstrate that proper tokenization has a significant impact on prediction quality. In terms of prediction accuracy and token degeneration, atom-in-SMILES is more effective method in generating higher-quality SMILES sequences from AI-based chemical models compared to other tokenization and representation schemes. We investigated the degrees of token degeneration of various schemes and analyzed their adverse effects on prediction quality. Additionally, token-level repetitions were quantified, and generated examples were incorporated for qualitative examination. We believe that the atom-in-SMILES tokenization has a great potential to be adopted by broad related scientific communities, as it provides chemically accurate, tailor-made tokens for molecular property prediction, chemical translation, and molecular generative models.
first_indexed	2024-03-12T17:07:43Z
format	Article
id	doaj.art-f3ba08aa0cb047ad994f7c66938c686b
institution	Directory Open Access Journal
issn	1758-2946
language	English
last_indexed	2024-03-12T17:07:43Z
publishDate	2023-05-01
publisher	BMC
record_format	Article
series	Journal of Cheminformatics
spelling	doaj.art-f3ba08aa0cb047ad994f7c66938c686b2023-08-06T11:23:25ZengBMCJournal of Cheminformatics1758-29462023-05-0115111310.1186/s13321-023-00725-9Improving the quality of chemical language model outcomes with atom-in-SMILES tokenizationUmit V. Ucak0Islambek Ashyrmamatov1Juyong Lee2Department of Molecular Medicine and Biopharmaceutical Sciences, Graduate School of Convergence Science and Technology, Seoul National UniversityCollege of Pharmacy, Seoul National UniversityResearch Institute of Pharmaceutical Science, Seoul National UniversityAbstract Tokenization is an important preprocessing step in natural language processing that may have a significant influence on prediction quality. This research showed that the traditional SMILES tokenization has a certain limitation that results in tokens failing to reflect the true nature of molecules. To address this issue, we developed the atom-in-SMILES tokenization scheme that eliminates ambiguities in the generic nature of SMILES tokens. Our results in multiple chemical translation and molecular property prediction tasks demonstrate that proper tokenization has a significant impact on prediction quality. In terms of prediction accuracy and token degeneration, atom-in-SMILES is more effective method in generating higher-quality SMILES sequences from AI-based chemical models compared to other tokenization and representation schemes. We investigated the degrees of token degeneration of various schemes and analyzed their adverse effects on prediction quality. Additionally, token-level repetitions were quantified, and generated examples were incorporated for qualitative examination. We believe that the atom-in-SMILES tokenization has a great potential to be adopted by broad related scientific communities, as it provides chemically accurate, tailor-made tokens for molecular property prediction, chemical translation, and molecular generative models.https://doi.org/10.1186/s13321-023-00725-9Atom-in-SMILESTokenizationRepetitionChemical language processing
spellingShingle	Umit V. Ucak Islambek Ashyrmamatov Juyong Lee Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization Journal of Cheminformatics Atom-in-SMILES Tokenization Repetition Chemical language processing
title	Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization
title_full	Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization
title_fullStr	Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization
title_full_unstemmed	Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization
title_short	Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization
title_sort	improving the quality of chemical language model outcomes with atom in smiles tokenization
topic	Atom-in-SMILES Tokenization Repetition Chemical language processing
url	https://doi.org/10.1186/s13321-023-00725-9
work_keys_str_mv	AT umitvucak improvingthequalityofchemicallanguagemodeloutcomeswithatominsmilestokenization AT islambekashyrmamatov improvingthequalityofchemicallanguagemodeloutcomeswithatominsmilestokenization AT juyonglee improvingthequalityofchemicallanguagemodeloutcomeswithatominsmilestokenization

Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization

Similar Items