Document Retrieval System for Biomedical Question Answering

In this paper, we describe our biomedical document retrieval system and answers extraction module, which is part of the biomedical question answering system. Approximately 26.5 million PubMed articles are indexed as a corpus with the Apache Lucene text search engine. Our proposed system consists of...

Full description

Bibliographic Details
Main Authors: Harun Bolat, Baha Şen
Format: Article
Language:English
Published: MDPI AG 2024-03-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/14/6/2613
_version_ 1797242160575152128
author Harun Bolat
Baha Şen
author_facet Harun Bolat
Baha Şen
author_sort Harun Bolat
collection DOAJ
description In this paper, we describe our biomedical document retrieval system and answers extraction module, which is part of the biomedical question answering system. Approximately 26.5 million PubMed articles are indexed as a corpus with the Apache Lucene text search engine. Our proposed system consists of three parts. The first part is the question analysis module, which analyzes the question and enriches it with biomedical concepts related to its wording. The second part of the system is the document retrieval module. In this step, the proposed system is tested using different information retrieval models, like the Vector Space Model, Okapi BM25, and Query Likelihood. The third part is the document re-ranking module, which is responsible for re-arranging the documents retrieved in the previous step. For this study, we tested our proposed system with 6B training questions from the BioASQ challenge task. We obtained the best MAP score on the document retrieval phase when we used Query Likelihood with the Dirichlet Smoothing model. We used the sequential dependence model at the re-rank phase, but this model produced a worse MAP score than the previous phase. In similarity calculation, we included the Named Entity Recognition (NER), UMLS Concept Unique Identifiers (CUI), and UMLS Semantic Types of the words in the question to find the sentences containing the answer. Using this approach, we observed a performance enhancement of roughly 25% for the top 20 outcomes, surpassing another method employed in this study, which relies solely on textual similarity.
first_indexed 2024-04-24T18:34:48Z
format Article
id doaj.art-c299ee39aef445e8ad8cfd00cb7fff49
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-04-24T18:34:48Z
publishDate 2024-03-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-c299ee39aef445e8ad8cfd00cb7fff492024-03-27T13:20:15ZengMDPI AGApplied Sciences2076-34172024-03-01146261310.3390/app14062613Document Retrieval System for Biomedical Question AnsweringHarun Bolat0Baha Şen1Computer Engineering Department, Ankara Yıldırım Beyazıt University, 06010 Ankara, TurkeyComputer Engineering Department, Ankara Yıldırım Beyazıt University, 06010 Ankara, TurkeyIn this paper, we describe our biomedical document retrieval system and answers extraction module, which is part of the biomedical question answering system. Approximately 26.5 million PubMed articles are indexed as a corpus with the Apache Lucene text search engine. Our proposed system consists of three parts. The first part is the question analysis module, which analyzes the question and enriches it with biomedical concepts related to its wording. The second part of the system is the document retrieval module. In this step, the proposed system is tested using different information retrieval models, like the Vector Space Model, Okapi BM25, and Query Likelihood. The third part is the document re-ranking module, which is responsible for re-arranging the documents retrieved in the previous step. For this study, we tested our proposed system with 6B training questions from the BioASQ challenge task. We obtained the best MAP score on the document retrieval phase when we used Query Likelihood with the Dirichlet Smoothing model. We used the sequential dependence model at the re-rank phase, but this model produced a worse MAP score than the previous phase. In similarity calculation, we included the Named Entity Recognition (NER), UMLS Concept Unique Identifiers (CUI), and UMLS Semantic Types of the words in the question to find the sentences containing the answer. Using this approach, we observed a performance enhancement of roughly 25% for the top 20 outcomes, surpassing another method employed in this study, which relies solely on textual similarity.https://www.mdpi.com/2076-3417/14/6/2613information retrievaldocument retrievalbiomedical question answeringsearch enginenatural language processing
spellingShingle Harun Bolat
Baha Şen
Document Retrieval System for Biomedical Question Answering
Applied Sciences
information retrieval
document retrieval
biomedical question answering
search engine
natural language processing
title Document Retrieval System for Biomedical Question Answering
title_full Document Retrieval System for Biomedical Question Answering
title_fullStr Document Retrieval System for Biomedical Question Answering
title_full_unstemmed Document Retrieval System for Biomedical Question Answering
title_short Document Retrieval System for Biomedical Question Answering
title_sort document retrieval system for biomedical question answering
topic information retrieval
document retrieval
biomedical question answering
search engine
natural language processing
url https://www.mdpi.com/2076-3417/14/6/2613
work_keys_str_mv AT harunbolat documentretrievalsystemforbiomedicalquestionanswering
AT bahasen documentretrievalsystemforbiomedicalquestionanswering