Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization
Abstract Tokenization is an important preprocessing step in natural language processing that may have a significant influence on prediction quality. This research showed that the traditional SMILES tokenization has a certain limitation that results in tokens failing to reflect the true nature of mol...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2023-05-01
|
Series: | Journal of Cheminformatics |
Subjects: | |
Online Access: | https://doi.org/10.1186/s13321-023-00725-9 |
_version_ | 1797752662853156864 |
---|---|
author | Umit V. Ucak Islambek Ashyrmamatov Juyong Lee |
author_facet | Umit V. Ucak Islambek Ashyrmamatov Juyong Lee |
author_sort | Umit V. Ucak |
collection | DOAJ |
description | Abstract Tokenization is an important preprocessing step in natural language processing that may have a significant influence on prediction quality. This research showed that the traditional SMILES tokenization has a certain limitation that results in tokens failing to reflect the true nature of molecules. To address this issue, we developed the atom-in-SMILES tokenization scheme that eliminates ambiguities in the generic nature of SMILES tokens. Our results in multiple chemical translation and molecular property prediction tasks demonstrate that proper tokenization has a significant impact on prediction quality. In terms of prediction accuracy and token degeneration, atom-in-SMILES is more effective method in generating higher-quality SMILES sequences from AI-based chemical models compared to other tokenization and representation schemes. We investigated the degrees of token degeneration of various schemes and analyzed their adverse effects on prediction quality. Additionally, token-level repetitions were quantified, and generated examples were incorporated for qualitative examination. We believe that the atom-in-SMILES tokenization has a great potential to be adopted by broad related scientific communities, as it provides chemically accurate, tailor-made tokens for molecular property prediction, chemical translation, and molecular generative models. |
first_indexed | 2024-03-12T17:07:43Z |
format | Article |
id | doaj.art-f3ba08aa0cb047ad994f7c66938c686b |
institution | Directory Open Access Journal |
issn | 1758-2946 |
language | English |
last_indexed | 2024-03-12T17:07:43Z |
publishDate | 2023-05-01 |
publisher | BMC |
record_format | Article |
series | Journal of Cheminformatics |
spelling | doaj.art-f3ba08aa0cb047ad994f7c66938c686b2023-08-06T11:23:25ZengBMCJournal of Cheminformatics1758-29462023-05-0115111310.1186/s13321-023-00725-9Improving the quality of chemical language model outcomes with atom-in-SMILES tokenizationUmit V. Ucak0Islambek Ashyrmamatov1Juyong Lee2Department of Molecular Medicine and Biopharmaceutical Sciences, Graduate School of Convergence Science and Technology, Seoul National UniversityCollege of Pharmacy, Seoul National UniversityResearch Institute of Pharmaceutical Science, Seoul National UniversityAbstract Tokenization is an important preprocessing step in natural language processing that may have a significant influence on prediction quality. This research showed that the traditional SMILES tokenization has a certain limitation that results in tokens failing to reflect the true nature of molecules. To address this issue, we developed the atom-in-SMILES tokenization scheme that eliminates ambiguities in the generic nature of SMILES tokens. Our results in multiple chemical translation and molecular property prediction tasks demonstrate that proper tokenization has a significant impact on prediction quality. In terms of prediction accuracy and token degeneration, atom-in-SMILES is more effective method in generating higher-quality SMILES sequences from AI-based chemical models compared to other tokenization and representation schemes. We investigated the degrees of token degeneration of various schemes and analyzed their adverse effects on prediction quality. Additionally, token-level repetitions were quantified, and generated examples were incorporated for qualitative examination. We believe that the atom-in-SMILES tokenization has a great potential to be adopted by broad related scientific communities, as it provides chemically accurate, tailor-made tokens for molecular property prediction, chemical translation, and molecular generative models.https://doi.org/10.1186/s13321-023-00725-9Atom-in-SMILESTokenizationRepetitionChemical language processing |
spellingShingle | Umit V. Ucak Islambek Ashyrmamatov Juyong Lee Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization Journal of Cheminformatics Atom-in-SMILES Tokenization Repetition Chemical language processing |
title | Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization |
title_full | Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization |
title_fullStr | Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization |
title_full_unstemmed | Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization |
title_short | Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization |
title_sort | improving the quality of chemical language model outcomes with atom in smiles tokenization |
topic | Atom-in-SMILES Tokenization Repetition Chemical language processing |
url | https://doi.org/10.1186/s13321-023-00725-9 |
work_keys_str_mv | AT umitvucak improvingthequalityofchemicallanguagemodeloutcomeswithatominsmilestokenization AT islambekashyrmamatov improvingthequalityofchemicallanguagemodeloutcomeswithatominsmilestokenization AT juyonglee improvingthequalityofchemicallanguagemodeloutcomeswithatominsmilestokenization |