Seed-driven document ranking for systematic reviews in evidence-based medicine

Evidence-based medicine (EBM) uses the best-available evidence in the process of patient care. Systematic reviews (SRs) are an essential part in the practice of EBM, providing the most up-to-date answers from primary evidence (published clinical studies) for a specific clinical question. SRs locate,...

Full description

Bibliographic Details
Main Author: Lee, Eunkyung
Other Authors: Sun Aixin
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/146709
_version_ 1811697311030444032
author Lee, Eunkyung
author2 Sun Aixin
author_facet Sun Aixin
Lee, Eunkyung
author_sort Lee, Eunkyung
collection NTU
description Evidence-based medicine (EBM) uses the best-available evidence in the process of patient care. Systematic reviews (SRs) are an essential part in the practice of EBM, providing the most up-to-date answers from primary evidence (published clinical studies) for a specific clinical question. SRs locate, appraise, and synthesize pertinent literature and derive answers. Finding all relevant documents is the most critical task in systematic reviews. Missing relevant studies discredits answers in SRs. In order to find every relevant literature, SRs are strictly conducted by following four sequential steps: (i) defining a clinical question and relevance conditions, (ii) collecting potential relevant documents from digital libraries (e.g., PubMed, Embase) by Boolean search, (iii) screening the retrieved documents (candidate documents) to identify relevant documents, and (iv) analyzing the relevant documents and deriving a conclusion (answer) based on them. Each step is recorded and published together in SRs. Screening (i.e., identifying relevant documents among candidate documents) is the most time-consuming process when conducting SRs. It is a manual process involving several thousands of documents, and thus typically takes a few months to complete. Many approaches have been proposed to improve the screening process using text mining and machine learning techniques. Among them, screening prioritization is the most promising approach to be implemented in practice while assisting researchers to identify all relevant documents. Given candidate documents, it provides the document rankings aiming that relevant documents are ranked at the top. Screening prioritized documents makes relevant documents be found early, and it leads to the efficient workflow in conducting SRs. It is natural that SR experts know the existence of one or two relevant documents after defining the clinical question (i.e., before the screening process). In this thesis, we propose a new approach named seed-driven document ranking for screening prioritization. We assume that one relevant document is given, which we call ‘seed document’ and use it as a query. The task of seed-driven document ranking is to rank candidate documents using a seed document query. Existing approaches in screening prioritization focus on building short keyword queries using statistical information in candidate documents. In this thesis, we extensively investigate ranking models for seed-driven document ranking. We first develop a retrieval model adapted for a seed document query. We propose ‘bag-of-clinical terms’ document representation reducing a long document query into essential clinical terms to find relevant documents. More importantly, we propose a weight function for clinical terms in a seed document. The proposed term weight function is combined with query likelihood retrieval model. Next, we propose a new semantic ranking model. We investigate a document matching (i.e., similarity) approach that ranks candidate documents according to their matching scores to a seed document. We design position-based semantic term matching, and develop a simple two-way document matching model. The semantic document matching approach further improves upon the performance of the model proposed in the previous chapter. Thirdly, we predict the ranking performance on SRs in seed-driven document ranking. Despite the state-of-the-art performance on benchmark datasets, the performance of ranking models on individual SRs widely varies. The large performance difference is less desirable, because SR experts experience the local performance (performance on a single SR) in the manual screening, instead of the global performance (performance on multiple SRs in benchmark datasets). We hypothesize that individual SRs have a different ranking difficulty depending on its topic broadness. To this end, we propose a measure predicting the ranking difficulties of SRs. Furthermore, we explore methods to improve the local performance especially on difficult SRs and the global performance of ranking models. We first study PICO elements in medical literature. Automatic PICO recognition helps identify essential information for the relevance of documents. We examine the boundaries of PICO span annotations and discuss how to correctly and effectively evaluate machine learning models in this task. Next we investigate medical concepts (i.e., medical/clinical terms) embeddings. Our work in previous chapters has shown that medical concepts are important information for effectively representing documents. We analyze the stability of medical concept embeddings. We conclude this thesis with the discussion on how other domain-specific applications with the similar aim on high recall can benefit from seed-driven document ranking, such as legal document search and test collection generation.
first_indexed 2024-10-01T07:53:14Z
format Thesis-Doctor of Philosophy
id ntu-10356/146709
institution Nanyang Technological University
language English
last_indexed 2024-10-01T07:53:14Z
publishDate 2021
publisher Nanyang Technological University
record_format dspace
spelling ntu-10356/1467092021-04-20T07:00:35Z Seed-driven document ranking for systematic reviews in evidence-based medicine Lee, Eunkyung Sun Aixin School of Computer Science and Engineering AXSun@ntu.edu.sg Engineering::Computer science and engineering::Information systems::Information storage and retrieval Evidence-based medicine (EBM) uses the best-available evidence in the process of patient care. Systematic reviews (SRs) are an essential part in the practice of EBM, providing the most up-to-date answers from primary evidence (published clinical studies) for a specific clinical question. SRs locate, appraise, and synthesize pertinent literature and derive answers. Finding all relevant documents is the most critical task in systematic reviews. Missing relevant studies discredits answers in SRs. In order to find every relevant literature, SRs are strictly conducted by following four sequential steps: (i) defining a clinical question and relevance conditions, (ii) collecting potential relevant documents from digital libraries (e.g., PubMed, Embase) by Boolean search, (iii) screening the retrieved documents (candidate documents) to identify relevant documents, and (iv) analyzing the relevant documents and deriving a conclusion (answer) based on them. Each step is recorded and published together in SRs. Screening (i.e., identifying relevant documents among candidate documents) is the most time-consuming process when conducting SRs. It is a manual process involving several thousands of documents, and thus typically takes a few months to complete. Many approaches have been proposed to improve the screening process using text mining and machine learning techniques. Among them, screening prioritization is the most promising approach to be implemented in practice while assisting researchers to identify all relevant documents. Given candidate documents, it provides the document rankings aiming that relevant documents are ranked at the top. Screening prioritized documents makes relevant documents be found early, and it leads to the efficient workflow in conducting SRs. It is natural that SR experts know the existence of one or two relevant documents after defining the clinical question (i.e., before the screening process). In this thesis, we propose a new approach named seed-driven document ranking for screening prioritization. We assume that one relevant document is given, which we call ‘seed document’ and use it as a query. The task of seed-driven document ranking is to rank candidate documents using a seed document query. Existing approaches in screening prioritization focus on building short keyword queries using statistical information in candidate documents. In this thesis, we extensively investigate ranking models for seed-driven document ranking. We first develop a retrieval model adapted for a seed document query. We propose ‘bag-of-clinical terms’ document representation reducing a long document query into essential clinical terms to find relevant documents. More importantly, we propose a weight function for clinical terms in a seed document. The proposed term weight function is combined with query likelihood retrieval model. Next, we propose a new semantic ranking model. We investigate a document matching (i.e., similarity) approach that ranks candidate documents according to their matching scores to a seed document. We design position-based semantic term matching, and develop a simple two-way document matching model. The semantic document matching approach further improves upon the performance of the model proposed in the previous chapter. Thirdly, we predict the ranking performance on SRs in seed-driven document ranking. Despite the state-of-the-art performance on benchmark datasets, the performance of ranking models on individual SRs widely varies. The large performance difference is less desirable, because SR experts experience the local performance (performance on a single SR) in the manual screening, instead of the global performance (performance on multiple SRs in benchmark datasets). We hypothesize that individual SRs have a different ranking difficulty depending on its topic broadness. To this end, we propose a measure predicting the ranking difficulties of SRs. Furthermore, we explore methods to improve the local performance especially on difficult SRs and the global performance of ranking models. We first study PICO elements in medical literature. Automatic PICO recognition helps identify essential information for the relevance of documents. We examine the boundaries of PICO span annotations and discuss how to correctly and effectively evaluate machine learning models in this task. Next we investigate medical concepts (i.e., medical/clinical terms) embeddings. Our work in previous chapters has shown that medical concepts are important information for effectively representing documents. We analyze the stability of medical concept embeddings. We conclude this thesis with the discussion on how other domain-specific applications with the similar aim on high recall can benefit from seed-driven document ranking, such as legal document search and test collection generation. Doctor of Philosophy 2021-03-08T01:35:07Z 2021-03-08T01:35:07Z 2020 Thesis-Doctor of Philosophy Lee, F. (2020). Seed-driven document ranking for systematic reviews in evidence-based medicine. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/146709 10.32657/10356/146709 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University
spellingShingle Engineering::Computer science and engineering::Information systems::Information storage and retrieval
Lee, Eunkyung
Seed-driven document ranking for systematic reviews in evidence-based medicine
title Seed-driven document ranking for systematic reviews in evidence-based medicine
title_full Seed-driven document ranking for systematic reviews in evidence-based medicine
title_fullStr Seed-driven document ranking for systematic reviews in evidence-based medicine
title_full_unstemmed Seed-driven document ranking for systematic reviews in evidence-based medicine
title_short Seed-driven document ranking for systematic reviews in evidence-based medicine
title_sort seed driven document ranking for systematic reviews in evidence based medicine
topic Engineering::Computer science and engineering::Information systems::Information storage and retrieval
url https://hdl.handle.net/10356/146709
work_keys_str_mv AT leeeunkyung seeddrivendocumentrankingforsystematicreviewsinevidencebasedmedicine