Chemical named entity recognition in the texts of scientific publications using the naïve Bayes classifier approach

Abstract Motivation Application of chemical named entity recognition (CNER) algorithms allows retrieval of information from texts about chemical compound identifiers and creates associations with physical–chemical properties and biological activities. Scientific texts represent low-formalized source...

Full description

Bibliographic Details
Main Authors: O. A. Tarasova, A. V. Rudik, N. Yu. Biziukova, D. A. Filimonov, V. V. Poroikov
Format: Article
Language:English
Published: BMC 2022-08-01
Series:Journal of Cheminformatics
Subjects:
Online Access:https://doi.org/10.1186/s13321-022-00633-4
_version_ 1811215645685055488
author O. A. Tarasova
A. V. Rudik
N. Yu. Biziukova
D. A. Filimonov
V. V. Poroikov
author_facet O. A. Tarasova
A. V. Rudik
N. Yu. Biziukova
D. A. Filimonov
V. V. Poroikov
author_sort O. A. Tarasova
collection DOAJ
description Abstract Motivation Application of chemical named entity recognition (CNER) algorithms allows retrieval of information from texts about chemical compound identifiers and creates associations with physical–chemical properties and biological activities. Scientific texts represent low-formalized sources of information. Most methods aimed at CNER are based on machine learning approaches, including conditional random fields and deep neural networks. In general, most machine learning approaches require either vector or sparse word representation of texts. Chemical named entities (CNEs) constitute only a small fraction of the whole text, and the datasets used for training are highly imbalanced. Methods and results We propose a new method for extracting CNEs from texts based on the naïve Bayes classifier combined with specially developed filters. In contrast to the earlier developed CNER methods, our approach uses the representation of the data as a set of fragments of text (FoTs) with the subsequent preparati`on of a set of multi-n-grams (sequences from one to n symbols) for each FoT. Our approach may provide the recognition of novel CNEs. For CHEMDNER corpus, the values of the sensitivity (recall) was 0.95, precision was 0.74, specificity was 0.88, and balanced accuracy was 0.92 based on five-fold cross validation. We applied the developed algorithm to the extracted CNEs of potential Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) main protease (Mpro) inhibitors. A set of CNEs corresponding to the chemical substances evaluated in the biochemical assays used for the discovery of Mpro inhibitors was retrieved. Manual analysis of the appropriate texts showed that CNEs of potential SARS-CoV-2 Mpro inhibitors were successfully identified by our method. Conclusion The obtained results show that the proposed method can be used for filtering out words that are not related to CNEs; therefore, it can be successfully applied to the extraction of CNEs for the purposes of cheminformatics and medicinal chemistry.
first_indexed 2024-04-12T06:26:06Z
format Article
id doaj.art-86da2f39a6de4e618a8d5f12e47d3a50
institution Directory Open Access Journal
issn 1758-2946
language English
last_indexed 2024-04-12T06:26:06Z
publishDate 2022-08-01
publisher BMC
record_format Article
series Journal of Cheminformatics
spelling doaj.art-86da2f39a6de4e618a8d5f12e47d3a502022-12-22T03:44:10ZengBMCJournal of Cheminformatics1758-29462022-08-0114111210.1186/s13321-022-00633-4Chemical named entity recognition in the texts of scientific publications using the naïve Bayes classifier approachO. A. Tarasova0A. V. Rudik1N. Yu. Biziukova2D. A. Filimonov3V. V. Poroikov4Laboratory of Structure-Function Based Drug Design, Institute of Biomedical ChemistryLaboratory of Structure-Function Based Drug Design, Institute of Biomedical ChemistryLaboratory of Structure-Function Based Drug Design, Institute of Biomedical ChemistryLaboratory of Structure-Function Based Drug Design, Institute of Biomedical ChemistryLaboratory of Structure-Function Based Drug Design, Institute of Biomedical ChemistryAbstract Motivation Application of chemical named entity recognition (CNER) algorithms allows retrieval of information from texts about chemical compound identifiers and creates associations with physical–chemical properties and biological activities. Scientific texts represent low-formalized sources of information. Most methods aimed at CNER are based on machine learning approaches, including conditional random fields and deep neural networks. In general, most machine learning approaches require either vector or sparse word representation of texts. Chemical named entities (CNEs) constitute only a small fraction of the whole text, and the datasets used for training are highly imbalanced. Methods and results We propose a new method for extracting CNEs from texts based on the naïve Bayes classifier combined with specially developed filters. In contrast to the earlier developed CNER methods, our approach uses the representation of the data as a set of fragments of text (FoTs) with the subsequent preparati`on of a set of multi-n-grams (sequences from one to n symbols) for each FoT. Our approach may provide the recognition of novel CNEs. For CHEMDNER corpus, the values of the sensitivity (recall) was 0.95, precision was 0.74, specificity was 0.88, and balanced accuracy was 0.92 based on five-fold cross validation. We applied the developed algorithm to the extracted CNEs of potential Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) main protease (Mpro) inhibitors. A set of CNEs corresponding to the chemical substances evaluated in the biochemical assays used for the discovery of Mpro inhibitors was retrieved. Manual analysis of the appropriate texts showed that CNEs of potential SARS-CoV-2 Mpro inhibitors were successfully identified by our method. Conclusion The obtained results show that the proposed method can be used for filtering out words that are not related to CNEs; therefore, it can be successfully applied to the extraction of CNEs for the purposes of cheminformatics and medicinal chemistry.https://doi.org/10.1186/s13321-022-00633-4Chemical named entity recognitionCNECNERNaïve Bayes classifierSARS-CoV-2Mpro inhibitors
spellingShingle O. A. Tarasova
A. V. Rudik
N. Yu. Biziukova
D. A. Filimonov
V. V. Poroikov
Chemical named entity recognition in the texts of scientific publications using the naïve Bayes classifier approach
Journal of Cheminformatics
Chemical named entity recognition
CNE
CNER
Naïve Bayes classifier
SARS-CoV-2
Mpro inhibitors
title Chemical named entity recognition in the texts of scientific publications using the naïve Bayes classifier approach
title_full Chemical named entity recognition in the texts of scientific publications using the naïve Bayes classifier approach
title_fullStr Chemical named entity recognition in the texts of scientific publications using the naïve Bayes classifier approach
title_full_unstemmed Chemical named entity recognition in the texts of scientific publications using the naïve Bayes classifier approach
title_short Chemical named entity recognition in the texts of scientific publications using the naïve Bayes classifier approach
title_sort chemical named entity recognition in the texts of scientific publications using the naive bayes classifier approach
topic Chemical named entity recognition
CNE
CNER
Naïve Bayes classifier
SARS-CoV-2
Mpro inhibitors
url https://doi.org/10.1186/s13321-022-00633-4
work_keys_str_mv AT oatarasova chemicalnamedentityrecognitioninthetextsofscientificpublicationsusingthenaivebayesclassifierapproach
AT avrudik chemicalnamedentityrecognitioninthetextsofscientificpublicationsusingthenaivebayesclassifierapproach
AT nyubiziukova chemicalnamedentityrecognitioninthetextsofscientificpublicationsusingthenaivebayesclassifierapproach
AT dafilimonov chemicalnamedentityrecognitioninthetextsofscientificpublicationsusingthenaivebayesclassifierapproach
AT vvporoikov chemicalnamedentityrecognitioninthetextsofscientificpublicationsusingthenaivebayesclassifierapproach