Document Retrieval System for Biomedical Question Answering
In this paper, we describe our biomedical document retrieval system and answers extraction module, which is part of the biomedical question answering system. Approximately 26.5 million PubMed articles are indexed as a corpus with the Apache Lucene text search engine. Our proposed system consists of...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2024-03-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/14/6/2613 |
_version_ | 1797242160575152128 |
---|---|
author | Harun Bolat Baha Şen |
author_facet | Harun Bolat Baha Şen |
author_sort | Harun Bolat |
collection | DOAJ |
description | In this paper, we describe our biomedical document retrieval system and answers extraction module, which is part of the biomedical question answering system. Approximately 26.5 million PubMed articles are indexed as a corpus with the Apache Lucene text search engine. Our proposed system consists of three parts. The first part is the question analysis module, which analyzes the question and enriches it with biomedical concepts related to its wording. The second part of the system is the document retrieval module. In this step, the proposed system is tested using different information retrieval models, like the Vector Space Model, Okapi BM25, and Query Likelihood. The third part is the document re-ranking module, which is responsible for re-arranging the documents retrieved in the previous step. For this study, we tested our proposed system with 6B training questions from the BioASQ challenge task. We obtained the best MAP score on the document retrieval phase when we used Query Likelihood with the Dirichlet Smoothing model. We used the sequential dependence model at the re-rank phase, but this model produced a worse MAP score than the previous phase. In similarity calculation, we included the Named Entity Recognition (NER), UMLS Concept Unique Identifiers (CUI), and UMLS Semantic Types of the words in the question to find the sentences containing the answer. Using this approach, we observed a performance enhancement of roughly 25% for the top 20 outcomes, surpassing another method employed in this study, which relies solely on textual similarity. |
first_indexed | 2024-04-24T18:34:48Z |
format | Article |
id | doaj.art-c299ee39aef445e8ad8cfd00cb7fff49 |
institution | Directory Open Access Journal |
issn | 2076-3417 |
language | English |
last_indexed | 2024-04-24T18:34:48Z |
publishDate | 2024-03-01 |
publisher | MDPI AG |
record_format | Article |
series | Applied Sciences |
spelling | doaj.art-c299ee39aef445e8ad8cfd00cb7fff492024-03-27T13:20:15ZengMDPI AGApplied Sciences2076-34172024-03-01146261310.3390/app14062613Document Retrieval System for Biomedical Question AnsweringHarun Bolat0Baha Şen1Computer Engineering Department, Ankara Yıldırım Beyazıt University, 06010 Ankara, TurkeyComputer Engineering Department, Ankara Yıldırım Beyazıt University, 06010 Ankara, TurkeyIn this paper, we describe our biomedical document retrieval system and answers extraction module, which is part of the biomedical question answering system. Approximately 26.5 million PubMed articles are indexed as a corpus with the Apache Lucene text search engine. Our proposed system consists of three parts. The first part is the question analysis module, which analyzes the question and enriches it with biomedical concepts related to its wording. The second part of the system is the document retrieval module. In this step, the proposed system is tested using different information retrieval models, like the Vector Space Model, Okapi BM25, and Query Likelihood. The third part is the document re-ranking module, which is responsible for re-arranging the documents retrieved in the previous step. For this study, we tested our proposed system with 6B training questions from the BioASQ challenge task. We obtained the best MAP score on the document retrieval phase when we used Query Likelihood with the Dirichlet Smoothing model. We used the sequential dependence model at the re-rank phase, but this model produced a worse MAP score than the previous phase. In similarity calculation, we included the Named Entity Recognition (NER), UMLS Concept Unique Identifiers (CUI), and UMLS Semantic Types of the words in the question to find the sentences containing the answer. Using this approach, we observed a performance enhancement of roughly 25% for the top 20 outcomes, surpassing another method employed in this study, which relies solely on textual similarity.https://www.mdpi.com/2076-3417/14/6/2613information retrievaldocument retrievalbiomedical question answeringsearch enginenatural language processing |
spellingShingle | Harun Bolat Baha Şen Document Retrieval System for Biomedical Question Answering Applied Sciences information retrieval document retrieval biomedical question answering search engine natural language processing |
title | Document Retrieval System for Biomedical Question Answering |
title_full | Document Retrieval System for Biomedical Question Answering |
title_fullStr | Document Retrieval System for Biomedical Question Answering |
title_full_unstemmed | Document Retrieval System for Biomedical Question Answering |
title_short | Document Retrieval System for Biomedical Question Answering |
title_sort | document retrieval system for biomedical question answering |
topic | information retrieval document retrieval biomedical question answering search engine natural language processing |
url | https://www.mdpi.com/2076-3417/14/6/2613 |
work_keys_str_mv | AT harunbolat documentretrievalsystemforbiomedicalquestionanswering AT bahasen documentretrievalsystemforbiomedicalquestionanswering |