Automated BigSMILES conversion workflow and dataset for homopolymeric macromolecules

Abstract The simplified molecular-input line-entry system (SMILES) has been utilized in a variety of artificial intelligence analyses owing to its capability of representing chemical structures using line notation. However, its ease of representation is limited, which has led to the proposal of BigS...

Full description

Bibliographic Details
Main Authors: Sunho Choi, Joonbum Lee, Jangwon Seo, Sung Won Han, Sang Hyun Lee, Ji-Hun Seo, Junhee Seok
Format: Article
Language:English
Published: Nature Portfolio 2024-04-01
Series:Scientific Data
Online Access:https://doi.org/10.1038/s41597-024-03212-4
_version_ 1797209560859017216
author Sunho Choi
Joonbum Lee
Jangwon Seo
Sung Won Han
Sang Hyun Lee
Ji-Hun Seo
Junhee Seok
author_facet Sunho Choi
Joonbum Lee
Jangwon Seo
Sung Won Han
Sang Hyun Lee
Ji-Hun Seo
Junhee Seok
author_sort Sunho Choi
collection DOAJ
description Abstract The simplified molecular-input line-entry system (SMILES) has been utilized in a variety of artificial intelligence analyses owing to its capability of representing chemical structures using line notation. However, its ease of representation is limited, which has led to the proposal of BigSMILES as an alternative method suitable for the representation of macromolecules. Nevertheless, research on BigSMILES remains limited due to its preprocessing requirements. Thus, this study proposes a conversion workflow of BigSMILES, focusing on its automated generation from SMILES representations of homopolymers. BigSMILES representations for 4,927,181 records are provided, thereby enabling its immediate use for various research and development applications. Our study presents detailed descriptions on a validation process to ensure the accuracy, interchangeability, and robustness of the conversion. Additionally, a systematic overview of utilized codes and functions that emphasizes their relevance in the context of BigSMILES generation are produced. This advancement is anticipated to significantly aid researchers and facilitate further studies in BigSMILES representation, including potential applications in deep learning and further extension to complex structures such as copolymers.
first_indexed 2024-04-24T09:56:39Z
format Article
id doaj.art-63fe68448051491a86d6dd1b2e2dc9b8
institution Directory Open Access Journal
issn 2052-4463
language English
last_indexed 2024-04-24T09:56:39Z
publishDate 2024-04-01
publisher Nature Portfolio
record_format Article
series Scientific Data
spelling doaj.art-63fe68448051491a86d6dd1b2e2dc9b82024-04-14T11:07:22ZengNature PortfolioScientific Data2052-44632024-04-011111910.1038/s41597-024-03212-4Automated BigSMILES conversion workflow and dataset for homopolymeric macromoleculesSunho Choi0Joonbum Lee1Jangwon Seo2Sung Won Han3Sang Hyun Lee4Ji-Hun Seo5Junhee Seok6School of Electrical Engineering, Korea UniversityDepartment of Materials Science and Engineering, Korea UniversitySchool of Electrical Engineering, Korea UniversitySchool of Industrial Management Engineering, Korea UniversitySchool of Electrical Engineering, Korea UniversityDepartment of Materials Science and Engineering, Korea UniversitySchool of Electrical Engineering, Korea UniversityAbstract The simplified molecular-input line-entry system (SMILES) has been utilized in a variety of artificial intelligence analyses owing to its capability of representing chemical structures using line notation. However, its ease of representation is limited, which has led to the proposal of BigSMILES as an alternative method suitable for the representation of macromolecules. Nevertheless, research on BigSMILES remains limited due to its preprocessing requirements. Thus, this study proposes a conversion workflow of BigSMILES, focusing on its automated generation from SMILES representations of homopolymers. BigSMILES representations for 4,927,181 records are provided, thereby enabling its immediate use for various research and development applications. Our study presents detailed descriptions on a validation process to ensure the accuracy, interchangeability, and robustness of the conversion. Additionally, a systematic overview of utilized codes and functions that emphasizes their relevance in the context of BigSMILES generation are produced. This advancement is anticipated to significantly aid researchers and facilitate further studies in BigSMILES representation, including potential applications in deep learning and further extension to complex structures such as copolymers.https://doi.org/10.1038/s41597-024-03212-4
spellingShingle Sunho Choi
Joonbum Lee
Jangwon Seo
Sung Won Han
Sang Hyun Lee
Ji-Hun Seo
Junhee Seok
Automated BigSMILES conversion workflow and dataset for homopolymeric macromolecules
Scientific Data
title Automated BigSMILES conversion workflow and dataset for homopolymeric macromolecules
title_full Automated BigSMILES conversion workflow and dataset for homopolymeric macromolecules
title_fullStr Automated BigSMILES conversion workflow and dataset for homopolymeric macromolecules
title_full_unstemmed Automated BigSMILES conversion workflow and dataset for homopolymeric macromolecules
title_short Automated BigSMILES conversion workflow and dataset for homopolymeric macromolecules
title_sort automated bigsmiles conversion workflow and dataset for homopolymeric macromolecules
url https://doi.org/10.1038/s41597-024-03212-4
work_keys_str_mv AT sunhochoi automatedbigsmilesconversionworkflowanddatasetforhomopolymericmacromolecules
AT joonbumlee automatedbigsmilesconversionworkflowanddatasetforhomopolymericmacromolecules
AT jangwonseo automatedbigsmilesconversionworkflowanddatasetforhomopolymericmacromolecules
AT sungwonhan automatedbigsmilesconversionworkflowanddatasetforhomopolymericmacromolecules
AT sanghyunlee automatedbigsmilesconversionworkflowanddatasetforhomopolymericmacromolecules
AT jihunseo automatedbigsmilesconversionworkflowanddatasetforhomopolymericmacromolecules
AT junheeseok automatedbigsmilesconversionworkflowanddatasetforhomopolymericmacromolecules