Automated Extraction of Information From Texts of Scientific Publications: Insights Into HIV Treatment Strategies

Text analysis can help to identify named entities (NEs) of small molecules, proteins, and genes. Such data are very important for the analysis of molecular mechanisms of disease progression and development of new strategies for the treatment of various diseases and pathological conditions. The texts...

Full description

Bibliographic Details
Main Authors: Nadezhda Biziukova, Olga Tarasova, Sergey Ivanov, Vladimir Poroikov
Format: Article
Language:English
Published: Frontiers Media S.A. 2020-12-01
Series:Frontiers in Genetics
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fgene.2020.618862/full
_version_ 1818380582946603008
author Nadezhda Biziukova
Olga Tarasova
Sergey Ivanov
Sergey Ivanov
Vladimir Poroikov
author_facet Nadezhda Biziukova
Olga Tarasova
Sergey Ivanov
Sergey Ivanov
Vladimir Poroikov
author_sort Nadezhda Biziukova
collection DOAJ
description Text analysis can help to identify named entities (NEs) of small molecules, proteins, and genes. Such data are very important for the analysis of molecular mechanisms of disease progression and development of new strategies for the treatment of various diseases and pathological conditions. The texts of publications represent a primary source of information, which is especially important to collect the data of the highest quality due to the immediate obtaining information, in comparison with databases. In our study, we aimed at the development and testing of an approach to the named entity recognition in the abstracts of publications. More specifically, we have developed and tested an algorithm based on the conditional random fields, which provides recognition of NEs of (i) genes and proteins and (ii) chemicals. Careful selection of abstracts strictly related to the subject of interest leads to the possibility of extracting the NEs strongly associated with the subject. To test the applicability of our approach, we have applied it for the extraction of (i) potential HIV inhibitors and (ii) a set of proteins and genes potentially responsible for viremic control in HIV-positive patients. The computational experiments performed provide the estimations of evaluating the accuracy of recognition of chemical NEs and proteins (genes). The precision of the chemical NEs recognition is over 0.91; recall is 0.86, and the F1-score (harmonic mean of precision and recall) is 0.89; the precision of recognition of proteins and genes names is over 0.86; recall is 0.83; while F1-score is above 0.85. Evaluation of the algorithm on two case studies related to HIV treatment confirms our suggestion about the possibility of extracting the NEs strongly relevant to (i) HIV inhibitors and (ii) a group of patients i.e., the group of HIV-positive individuals with an ability to maintain an undetectable HIV-1 viral load overtime in the absence of antiretroviral therapy. Analysis of the results obtained provides insights into the function of proteins that can be responsible for viremic control. Our study demonstrated the applicability of the developed approach for the extraction of useful data on HIV treatment.
first_indexed 2024-12-14T02:20:59Z
format Article
id doaj.art-1592b79550e3445388a0ce2982d11153
institution Directory Open Access Journal
issn 1664-8021
language English
last_indexed 2024-12-14T02:20:59Z
publishDate 2020-12-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Genetics
spelling doaj.art-1592b79550e3445388a0ce2982d111532022-12-21T23:20:30ZengFrontiers Media S.A.Frontiers in Genetics1664-80212020-12-011110.3389/fgene.2020.618862618862Automated Extraction of Information From Texts of Scientific Publications: Insights Into HIV Treatment StrategiesNadezhda Biziukova0Olga Tarasova1Sergey Ivanov2Sergey Ivanov3Vladimir Poroikov4Laboratory of Structure-Function Based Drug Design, Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, RussiaLaboratory of Structure-Function Based Drug Design, Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, RussiaLaboratory of Structure-Function Based Drug Design, Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, RussiaDepartment of Bioinformatics, Faculty of Biomedicine, Pirogov Russian National Research Medical University, Moscow, RussiaLaboratory of Structure-Function Based Drug Design, Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, RussiaText analysis can help to identify named entities (NEs) of small molecules, proteins, and genes. Such data are very important for the analysis of molecular mechanisms of disease progression and development of new strategies for the treatment of various diseases and pathological conditions. The texts of publications represent a primary source of information, which is especially important to collect the data of the highest quality due to the immediate obtaining information, in comparison with databases. In our study, we aimed at the development and testing of an approach to the named entity recognition in the abstracts of publications. More specifically, we have developed and tested an algorithm based on the conditional random fields, which provides recognition of NEs of (i) genes and proteins and (ii) chemicals. Careful selection of abstracts strictly related to the subject of interest leads to the possibility of extracting the NEs strongly associated with the subject. To test the applicability of our approach, we have applied it for the extraction of (i) potential HIV inhibitors and (ii) a set of proteins and genes potentially responsible for viremic control in HIV-positive patients. The computational experiments performed provide the estimations of evaluating the accuracy of recognition of chemical NEs and proteins (genes). The precision of the chemical NEs recognition is over 0.91; recall is 0.86, and the F1-score (harmonic mean of precision and recall) is 0.89; the precision of recognition of proteins and genes names is over 0.86; recall is 0.83; while F1-score is above 0.85. Evaluation of the algorithm on two case studies related to HIV treatment confirms our suggestion about the possibility of extracting the NEs strongly relevant to (i) HIV inhibitors and (ii) a group of patients i.e., the group of HIV-positive individuals with an ability to maintain an undetectable HIV-1 viral load overtime in the absence of antiretroviral therapy. Analysis of the results obtained provides insights into the function of proteins that can be responsible for viremic control. Our study demonstrated the applicability of the developed approach for the extraction of useful data on HIV treatment.https://www.frontiersin.org/articles/10.3389/fgene.2020.618862/fulltext miningdata miningnamed entity recognitionNERvirus-host interactionsHIV
spellingShingle Nadezhda Biziukova
Olga Tarasova
Sergey Ivanov
Sergey Ivanov
Vladimir Poroikov
Automated Extraction of Information From Texts of Scientific Publications: Insights Into HIV Treatment Strategies
Frontiers in Genetics
text mining
data mining
named entity recognition
NER
virus-host interactions
HIV
title Automated Extraction of Information From Texts of Scientific Publications: Insights Into HIV Treatment Strategies
title_full Automated Extraction of Information From Texts of Scientific Publications: Insights Into HIV Treatment Strategies
title_fullStr Automated Extraction of Information From Texts of Scientific Publications: Insights Into HIV Treatment Strategies
title_full_unstemmed Automated Extraction of Information From Texts of Scientific Publications: Insights Into HIV Treatment Strategies
title_short Automated Extraction of Information From Texts of Scientific Publications: Insights Into HIV Treatment Strategies
title_sort automated extraction of information from texts of scientific publications insights into hiv treatment strategies
topic text mining
data mining
named entity recognition
NER
virus-host interactions
HIV
url https://www.frontiersin.org/articles/10.3389/fgene.2020.618862/full
work_keys_str_mv AT nadezhdabiziukova automatedextractionofinformationfromtextsofscientificpublicationsinsightsintohivtreatmentstrategies
AT olgatarasova automatedextractionofinformationfromtextsofscientificpublicationsinsightsintohivtreatmentstrategies
AT sergeyivanov automatedextractionofinformationfromtextsofscientificpublicationsinsightsintohivtreatmentstrategies
AT sergeyivanov automatedextractionofinformationfromtextsofscientificpublicationsinsightsintohivtreatmentstrategies
AT vladimirporoikov automatedextractionofinformationfromtextsofscientificpublicationsinsightsintohivtreatmentstrategies