Combining word embeddings to extract chemical and drug entities in biomedical literature

Abstract Background Natural language processing (NLP) and text mining technologies for the extraction and indexing of chemical and drug entities are key to improving the access and integration of information from unstructured data such as biomedical literature. Methods In this paper we evaluate two...

Full description

Bibliographic Details
Main Authors:	Pilar López-Úbeda, Manuel Carlos Díaz-Galiano, L. Alfonso Ureña-López, M. Teresa Martín-Valdivia
Format:	Article
Language:	English
Published:	BMC 2021-12-01
Series:	BMC Bioinformatics
Subjects:	Natural language processing Named entity recognition Concept indexing Neural network Word embeddings SNOMED-CT
Online Access:	https://doi.org/10.1186/s12859-021-04188-3

_version_	1827590270354456576
author	Pilar López-Úbeda Manuel Carlos Díaz-Galiano L. Alfonso Ureña-López M. Teresa Martín-Valdivia
author_facet	Pilar López-Úbeda Manuel Carlos Díaz-Galiano L. Alfonso Ureña-López M. Teresa Martín-Valdivia
author_sort	Pilar López-Úbeda
collection	DOAJ
description	Abstract Background Natural language processing (NLP) and text mining technologies for the extraction and indexing of chemical and drug entities are key to improving the access and integration of information from unstructured data such as biomedical literature. Methods In this paper we evaluate two important tasks in NLP: the named entity recognition (NER) and Entity indexing using the SNOMED-CT terminology. For this purpose, we propose a combination of word embeddings in order to improve the results obtained in the PharmaCoNER challenge. Results For the NER task we present a neural network composed of BiLSTM with a CRF sequential layer where different word embeddings are combined as an input to the architecture. A hybrid method combining supervised and unsupervised models is used for the concept indexing task. In the supervised model, we use the training set to find previously trained concepts, and the unsupervised model is based on a 6-step architecture. This architecture uses a dictionary of synonyms and the Levenshtein distance to assign the correct SNOMED-CT code. Conclusion On the one hand, the combination of word embeddings helps to improve the recognition of chemicals and drugs in the biomedical literature. We achieved results of 91.41% for precision, 90.14% for recall, and 90.77% for F1-score using micro-averaging. On the other hand, our indexing system achieves a 92.67% F1-score, 92.44% for recall, and 92.91% for precision. With these results in a final ranking, we would be in the first position.
first_indexed	2024-03-09T01:14:31Z
format	Article
id	doaj.art-4baebaecfa3145d99217f629fa37915f
institution	Directory Open Access Journal
issn	1471-2105
language	English
last_indexed	2024-03-09T01:14:31Z
publishDate	2021-12-01
publisher	BMC
record_format	Article
series	BMC Bioinformatics
spelling	doaj.art-4baebaecfa3145d99217f629fa37915f2023-12-10T12:33:49ZengBMCBMC Bioinformatics1471-21052021-12-0122S111710.1186/s12859-021-04188-3Combining word embeddings to extract chemical and drug entities in biomedical literaturePilar López-Úbeda0Manuel Carlos Díaz-Galiano1L. Alfonso Ureña-López2M. Teresa Martín-Valdivia3Department of Computer Science, Advanced Studies Center in Information and Communication Technologies (CEATIC), Universidad de JaénDepartment of Computer Science, Advanced Studies Center in Information and Communication Technologies (CEATIC), Universidad de JaénDepartment of Computer Science, Advanced Studies Center in Information and Communication Technologies (CEATIC), Universidad de JaénDepartment of Computer Science, Advanced Studies Center in Information and Communication Technologies (CEATIC), Universidad de JaénAbstract Background Natural language processing (NLP) and text mining technologies for the extraction and indexing of chemical and drug entities are key to improving the access and integration of information from unstructured data such as biomedical literature. Methods In this paper we evaluate two important tasks in NLP: the named entity recognition (NER) and Entity indexing using the SNOMED-CT terminology. For this purpose, we propose a combination of word embeddings in order to improve the results obtained in the PharmaCoNER challenge. Results For the NER task we present a neural network composed of BiLSTM with a CRF sequential layer where different word embeddings are combined as an input to the architecture. A hybrid method combining supervised and unsupervised models is used for the concept indexing task. In the supervised model, we use the training set to find previously trained concepts, and the unsupervised model is based on a 6-step architecture. This architecture uses a dictionary of synonyms and the Levenshtein distance to assign the correct SNOMED-CT code. Conclusion On the one hand, the combination of word embeddings helps to improve the recognition of chemicals and drugs in the biomedical literature. We achieved results of 91.41% for precision, 90.14% for recall, and 90.77% for F1-score using micro-averaging. On the other hand, our indexing system achieves a 92.67% F1-score, 92.44% for recall, and 92.91% for precision. With these results in a final ranking, we would be in the first position.https://doi.org/10.1186/s12859-021-04188-3Natural language processingNamed entity recognitionConcept indexingNeural networkWord embeddingsSNOMED-CT
spellingShingle	Pilar López-Úbeda Manuel Carlos Díaz-Galiano L. Alfonso Ureña-López M. Teresa Martín-Valdivia Combining word embeddings to extract chemical and drug entities in biomedical literature BMC Bioinformatics Natural language processing Named entity recognition Concept indexing Neural network Word embeddings SNOMED-CT
title	Combining word embeddings to extract chemical and drug entities in biomedical literature
title_full	Combining word embeddings to extract chemical and drug entities in biomedical literature
title_fullStr	Combining word embeddings to extract chemical and drug entities in biomedical literature
title_full_unstemmed	Combining word embeddings to extract chemical and drug entities in biomedical literature
title_short	Combining word embeddings to extract chemical and drug entities in biomedical literature
title_sort	combining word embeddings to extract chemical and drug entities in biomedical literature
topic	Natural language processing Named entity recognition Concept indexing Neural network Word embeddings SNOMED-CT
url	https://doi.org/10.1186/s12859-021-04188-3
work_keys_str_mv	AT pilarlopezubeda combiningwordembeddingstoextractchemicalanddrugentitiesinbiomedicalliterature AT manuelcarlosdiazgaliano combiningwordembeddingstoextractchemicalanddrugentitiesinbiomedicalliterature AT lalfonsourenalopez combiningwordembeddingstoextractchemicalanddrugentitiesinbiomedicalliterature AT mteresamartinvaldivia combiningwordembeddingstoextractchemicalanddrugentitiesinbiomedicalliterature

Combining word embeddings to extract chemical and drug entities in biomedical literature

Similar Items