Combining word embeddings to extract chemical and drug entities in biomedical literature

Abstract Background Natural language processing (NLP) and text mining technologies for the extraction and indexing of chemical and drug entities are key to improving the access and integration of information from unstructured data such as biomedical literature. Methods In this paper we evaluate two...

Full description

Bibliographic Details
Main Authors: Pilar López-Úbeda, Manuel Carlos Díaz-Galiano, L. Alfonso Ureña-López, M. Teresa Martín-Valdivia
Format: Article
Language:English
Published: BMC 2021-12-01
Series:BMC Bioinformatics
Subjects:
Online Access:https://doi.org/10.1186/s12859-021-04188-3
_version_ 1827590270354456576
author Pilar López-Úbeda
Manuel Carlos Díaz-Galiano
L. Alfonso Ureña-López
M. Teresa Martín-Valdivia
author_facet Pilar López-Úbeda
Manuel Carlos Díaz-Galiano
L. Alfonso Ureña-López
M. Teresa Martín-Valdivia
author_sort Pilar López-Úbeda
collection DOAJ
description Abstract Background Natural language processing (NLP) and text mining technologies for the extraction and indexing of chemical and drug entities are key to improving the access and integration of information from unstructured data such as biomedical literature. Methods In this paper we evaluate two important tasks in NLP: the named entity recognition (NER) and Entity indexing using the SNOMED-CT terminology. For this purpose, we propose a combination of word embeddings in order to improve the results obtained in the PharmaCoNER challenge. Results For the NER task we present a neural network composed of BiLSTM with a CRF sequential layer where different word embeddings are combined as an input to the architecture. A hybrid method combining supervised and unsupervised models is used for the concept indexing task. In the supervised model, we use the training set to find previously trained concepts, and the unsupervised model is based on a 6-step architecture. This architecture uses a dictionary of synonyms and the Levenshtein distance to assign the correct SNOMED-CT code. Conclusion On the one hand, the combination of word embeddings helps to improve the recognition of chemicals and drugs in the biomedical literature. We achieved results of 91.41% for precision, 90.14% for recall, and 90.77% for F1-score using micro-averaging. On the other hand, our indexing system achieves a 92.67% F1-score, 92.44% for recall, and 92.91% for precision. With these results in a final ranking, we would be in the first position.
first_indexed 2024-03-09T01:14:31Z
format Article
id doaj.art-4baebaecfa3145d99217f629fa37915f
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-03-09T01:14:31Z
publishDate 2021-12-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-4baebaecfa3145d99217f629fa37915f2023-12-10T12:33:49ZengBMCBMC Bioinformatics1471-21052021-12-0122S111710.1186/s12859-021-04188-3Combining word embeddings to extract chemical and drug entities in biomedical literaturePilar López-Úbeda0Manuel Carlos Díaz-Galiano1L. Alfonso Ureña-López2M. Teresa Martín-Valdivia3Department of Computer Science, Advanced Studies Center in Information and Communication Technologies (CEATIC), Universidad de JaénDepartment of Computer Science, Advanced Studies Center in Information and Communication Technologies (CEATIC), Universidad de JaénDepartment of Computer Science, Advanced Studies Center in Information and Communication Technologies (CEATIC), Universidad de JaénDepartment of Computer Science, Advanced Studies Center in Information and Communication Technologies (CEATIC), Universidad de JaénAbstract Background Natural language processing (NLP) and text mining technologies for the extraction and indexing of chemical and drug entities are key to improving the access and integration of information from unstructured data such as biomedical literature. Methods In this paper we evaluate two important tasks in NLP: the named entity recognition (NER) and Entity indexing using the SNOMED-CT terminology. For this purpose, we propose a combination of word embeddings in order to improve the results obtained in the PharmaCoNER challenge. Results For the NER task we present a neural network composed of BiLSTM with a CRF sequential layer where different word embeddings are combined as an input to the architecture. A hybrid method combining supervised and unsupervised models is used for the concept indexing task. In the supervised model, we use the training set to find previously trained concepts, and the unsupervised model is based on a 6-step architecture. This architecture uses a dictionary of synonyms and the Levenshtein distance to assign the correct SNOMED-CT code. Conclusion On the one hand, the combination of word embeddings helps to improve the recognition of chemicals and drugs in the biomedical literature. We achieved results of 91.41% for precision, 90.14% for recall, and 90.77% for F1-score using micro-averaging. On the other hand, our indexing system achieves a 92.67% F1-score, 92.44% for recall, and 92.91% for precision. With these results in a final ranking, we would be in the first position.https://doi.org/10.1186/s12859-021-04188-3Natural language processingNamed entity recognitionConcept indexingNeural networkWord embeddingsSNOMED-CT
spellingShingle Pilar López-Úbeda
Manuel Carlos Díaz-Galiano
L. Alfonso Ureña-López
M. Teresa Martín-Valdivia
Combining word embeddings to extract chemical and drug entities in biomedical literature
BMC Bioinformatics
Natural language processing
Named entity recognition
Concept indexing
Neural network
Word embeddings
SNOMED-CT
title Combining word embeddings to extract chemical and drug entities in biomedical literature
title_full Combining word embeddings to extract chemical and drug entities in biomedical literature
title_fullStr Combining word embeddings to extract chemical and drug entities in biomedical literature
title_full_unstemmed Combining word embeddings to extract chemical and drug entities in biomedical literature
title_short Combining word embeddings to extract chemical and drug entities in biomedical literature
title_sort combining word embeddings to extract chemical and drug entities in biomedical literature
topic Natural language processing
Named entity recognition
Concept indexing
Neural network
Word embeddings
SNOMED-CT
url https://doi.org/10.1186/s12859-021-04188-3
work_keys_str_mv AT pilarlopezubeda combiningwordembeddingstoextractchemicalanddrugentitiesinbiomedicalliterature
AT manuelcarlosdiazgaliano combiningwordembeddingstoextractchemicalanddrugentitiesinbiomedicalliterature
AT lalfonsourenalopez combiningwordembeddingstoextractchemicalanddrugentitiesinbiomedicalliterature
AT mteresamartinvaldivia combiningwordembeddingstoextractchemicalanddrugentitiesinbiomedicalliterature