Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization

Abstract Tokenization is an important preprocessing step in natural language processing that may have a significant influence on prediction quality. This research showed that the traditional SMILES tokenization has a certain limitation that results in tokens failing to reflect the true nature of mol...

Full description

Bibliographic Details
Main Authors: Umit V. Ucak, Islambek Ashyrmamatov, Juyong Lee
Format: Article
Language:English
Published: BMC 2023-05-01
Series:Journal of Cheminformatics
Subjects:
Online Access:https://doi.org/10.1186/s13321-023-00725-9
_version_ 1797752662853156864
author Umit V. Ucak
Islambek Ashyrmamatov
Juyong Lee
author_facet Umit V. Ucak
Islambek Ashyrmamatov
Juyong Lee
author_sort Umit V. Ucak
collection DOAJ
description Abstract Tokenization is an important preprocessing step in natural language processing that may have a significant influence on prediction quality. This research showed that the traditional SMILES tokenization has a certain limitation that results in tokens failing to reflect the true nature of molecules. To address this issue, we developed the atom-in-SMILES tokenization scheme that eliminates ambiguities in the generic nature of SMILES tokens. Our results in multiple chemical translation and molecular property prediction tasks demonstrate that proper tokenization has a significant impact on prediction quality. In terms of prediction accuracy and token degeneration, atom-in-SMILES is more effective method in generating higher-quality SMILES sequences from AI-based chemical models compared to other tokenization and representation schemes. We investigated the degrees of token degeneration of various schemes and analyzed their adverse effects on prediction quality. Additionally, token-level repetitions were quantified, and generated examples were incorporated for qualitative examination. We believe that the atom-in-SMILES tokenization has a great potential to be adopted by broad related scientific communities, as it provides chemically accurate, tailor-made tokens for molecular property prediction, chemical translation, and molecular generative models.
first_indexed 2024-03-12T17:07:43Z
format Article
id doaj.art-f3ba08aa0cb047ad994f7c66938c686b
institution Directory Open Access Journal
issn 1758-2946
language English
last_indexed 2024-03-12T17:07:43Z
publishDate 2023-05-01
publisher BMC
record_format Article
series Journal of Cheminformatics
spelling doaj.art-f3ba08aa0cb047ad994f7c66938c686b2023-08-06T11:23:25ZengBMCJournal of Cheminformatics1758-29462023-05-0115111310.1186/s13321-023-00725-9Improving the quality of chemical language model outcomes with atom-in-SMILES tokenizationUmit V. Ucak0Islambek Ashyrmamatov1Juyong Lee2Department of Molecular Medicine and Biopharmaceutical Sciences, Graduate School of Convergence Science and Technology, Seoul National UniversityCollege of Pharmacy, Seoul National UniversityResearch Institute of Pharmaceutical Science, Seoul National UniversityAbstract Tokenization is an important preprocessing step in natural language processing that may have a significant influence on prediction quality. This research showed that the traditional SMILES tokenization has a certain limitation that results in tokens failing to reflect the true nature of molecules. To address this issue, we developed the atom-in-SMILES tokenization scheme that eliminates ambiguities in the generic nature of SMILES tokens. Our results in multiple chemical translation and molecular property prediction tasks demonstrate that proper tokenization has a significant impact on prediction quality. In terms of prediction accuracy and token degeneration, atom-in-SMILES is more effective method in generating higher-quality SMILES sequences from AI-based chemical models compared to other tokenization and representation schemes. We investigated the degrees of token degeneration of various schemes and analyzed their adverse effects on prediction quality. Additionally, token-level repetitions were quantified, and generated examples were incorporated for qualitative examination. We believe that the atom-in-SMILES tokenization has a great potential to be adopted by broad related scientific communities, as it provides chemically accurate, tailor-made tokens for molecular property prediction, chemical translation, and molecular generative models.https://doi.org/10.1186/s13321-023-00725-9Atom-in-SMILESTokenizationRepetitionChemical language processing
spellingShingle Umit V. Ucak
Islambek Ashyrmamatov
Juyong Lee
Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization
Journal of Cheminformatics
Atom-in-SMILES
Tokenization
Repetition
Chemical language processing
title Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization
title_full Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization
title_fullStr Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization
title_full_unstemmed Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization
title_short Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization
title_sort improving the quality of chemical language model outcomes with atom in smiles tokenization
topic Atom-in-SMILES
Tokenization
Repetition
Chemical language processing
url https://doi.org/10.1186/s13321-023-00725-9
work_keys_str_mv AT umitvucak improvingthequalityofchemicallanguagemodeloutcomeswithatominsmilestokenization
AT islambekashyrmamatov improvingthequalityofchemicallanguagemodeloutcomeswithatominsmilestokenization
AT juyonglee improvingthequalityofchemicallanguagemodeloutcomeswithatominsmilestokenization