Linguistic measures of chemical diversity and the “keywords” of molecular collections

Abstract Computerized linguistic analyses have proven of immense value in comparing and searching through large text collections (“corpora”), including those deposited on the Internet – indeed, it would nowadays be hard to imagine browsing the Web without, for instance, search algorithms extracting...

Full description

Bibliographic Details
Main Authors: Michał Woźniak, Agnieszka Wołos, Urszula Modrzyk, Rafał L. Górski, Jan Winkowski, Michał Bajczyk, Sara Szymkuć, Bartosz A. Grzybowski, Maciej Eder
Format: Article
Language:English
Published: Nature Portfolio 2018-05-01
Series:Scientific Reports
Online Access:https://doi.org/10.1038/s41598-018-25440-6
_version_ 1818754205104472064
author Michał Woźniak
Agnieszka Wołos
Urszula Modrzyk
Rafał L. Górski
Jan Winkowski
Michał Bajczyk
Sara Szymkuć
Bartosz A. Grzybowski
Maciej Eder
author_facet Michał Woźniak
Agnieszka Wołos
Urszula Modrzyk
Rafał L. Górski
Jan Winkowski
Michał Bajczyk
Sara Szymkuć
Bartosz A. Grzybowski
Maciej Eder
author_sort Michał Woźniak
collection DOAJ
description Abstract Computerized linguistic analyses have proven of immense value in comparing and searching through large text collections (“corpora”), including those deposited on the Internet – indeed, it would nowadays be hard to imagine browsing the Web without, for instance, search algorithms extracting most appropriate keywords from documents. This paper describes how such corpus-linguistic concepts can be extended to chemistry based on characteristic “chemical words” that span more than traditional functional groups and, instead, look at common structural fragments molecules share. Using these words, it is possible to quantify the diversity of chemical collections/databases in new ways and to define molecular “keywords” by which such collections are best characterized and annotated.
first_indexed 2024-12-18T05:19:33Z
format Article
id doaj.art-cb3bb3ade27f4b13a2b20861629d946e
institution Directory Open Access Journal
issn 2045-2322
language English
last_indexed 2024-12-18T05:19:33Z
publishDate 2018-05-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj.art-cb3bb3ade27f4b13a2b20861629d946e2022-12-21T21:19:42ZengNature PortfolioScientific Reports2045-23222018-05-018111010.1038/s41598-018-25440-6Linguistic measures of chemical diversity and the “keywords” of molecular collectionsMichał Woźniak0Agnieszka Wołos1Urszula Modrzyk2Rafał L. Górski3Jan Winkowski4Michał Bajczyk5Sara Szymkuć6Bartosz A. Grzybowski7Maciej Eder8Institute of Polish Language, Polish Academy of SciencesInstitute of Organic Chemistry, Polish Academy of SciencesInstitute of Polish Language, Polish Academy of SciencesInstitute of Polish Language, Polish Academy of SciencesInstitute of Polish Language, Polish Academy of SciencesInstitute of Organic Chemistry, Polish Academy of SciencesInstitute of Organic Chemistry, Polish Academy of SciencesInstitute of Organic Chemistry, Polish Academy of SciencesInstitute of Polish Language, Polish Academy of SciencesAbstract Computerized linguistic analyses have proven of immense value in comparing and searching through large text collections (“corpora”), including those deposited on the Internet – indeed, it would nowadays be hard to imagine browsing the Web without, for instance, search algorithms extracting most appropriate keywords from documents. This paper describes how such corpus-linguistic concepts can be extended to chemistry based on characteristic “chemical words” that span more than traditional functional groups and, instead, look at common structural fragments molecules share. Using these words, it is possible to quantify the diversity of chemical collections/databases in new ways and to define molecular “keywords” by which such collections are best characterized and annotated.https://doi.org/10.1038/s41598-018-25440-6
spellingShingle Michał Woźniak
Agnieszka Wołos
Urszula Modrzyk
Rafał L. Górski
Jan Winkowski
Michał Bajczyk
Sara Szymkuć
Bartosz A. Grzybowski
Maciej Eder
Linguistic measures of chemical diversity and the “keywords” of molecular collections
Scientific Reports
title Linguistic measures of chemical diversity and the “keywords” of molecular collections
title_full Linguistic measures of chemical diversity and the “keywords” of molecular collections
title_fullStr Linguistic measures of chemical diversity and the “keywords” of molecular collections
title_full_unstemmed Linguistic measures of chemical diversity and the “keywords” of molecular collections
title_short Linguistic measures of chemical diversity and the “keywords” of molecular collections
title_sort linguistic measures of chemical diversity and the keywords of molecular collections
url https://doi.org/10.1038/s41598-018-25440-6
work_keys_str_mv AT michałwozniak linguisticmeasuresofchemicaldiversityandthekeywordsofmolecularcollections
AT agnieszkawołos linguisticmeasuresofchemicaldiversityandthekeywordsofmolecularcollections
AT urszulamodrzyk linguisticmeasuresofchemicaldiversityandthekeywordsofmolecularcollections
AT rafałlgorski linguisticmeasuresofchemicaldiversityandthekeywordsofmolecularcollections
AT janwinkowski linguisticmeasuresofchemicaldiversityandthekeywordsofmolecularcollections
AT michałbajczyk linguisticmeasuresofchemicaldiversityandthekeywordsofmolecularcollections
AT saraszymkuc linguisticmeasuresofchemicaldiversityandthekeywordsofmolecularcollections
AT bartoszagrzybowski linguisticmeasuresofchemicaldiversityandthekeywordsofmolecularcollections
AT maciejeder linguisticmeasuresofchemicaldiversityandthekeywordsofmolecularcollections