Seed-driven document ranking for systematic reviews in evidence-based medicine

Evidence-based medicine (EBM) uses the best-available evidence in the process of patient care. Systematic reviews (SRs) are an essential part in the practice of EBM, providing the most up-to-date answers from primary evidence (published clinical studies) for a specific clinical question. SRs locate,...

Full description

Bibliographic Details
Main Author:	Lee, Eunkyung
Other Authors:	Sun Aixin
Format:	Thesis-Doctor of Philosophy
Language:	English
Published:	Nanyang Technological University 2021
Subjects:	Engineering::Computer science and engineering::Information systems::Information storage and retrieval
Online Access:	https://hdl.handle.net/10356/146709

_version_	1811697311030444032
author	Lee, Eunkyung
author2	Sun Aixin
author_facet	Sun Aixin Lee, Eunkyung
author_sort	Lee, Eunkyung
collection	NTU
description	Evidence-based medicine (EBM) uses the best-available evidence in the process of patient care. Systematic reviews (SRs) are an essential part in the practice of EBM, providing the most up-to-date answers from primary evidence (published clinical studies) for a specific clinical question. SRs locate, appraise, and synthesize pertinent literature and derive answers. Finding all relevant documents is the most critical task in systematic reviews. Missing relevant studies discredits answers in SRs. In order to find every relevant literature, SRs are strictly conducted by following four sequential steps: (i) defining a clinical question and relevance conditions, (ii) collecting potential relevant documents from digital libraries (e.g., PubMed, Embase) by Boolean search, (iii) screening the retrieved documents (candidate documents) to identify relevant documents, and (iv) analyzing the relevant documents and deriving a conclusion (answer) based on them. Each step is recorded and published together in SRs. Screening (i.e., identifying relevant documents among candidate documents) is the most time-consuming process when conducting SRs. It is a manual process involving several thousands of documents, and thus typically takes a few months to complete. Many approaches have been proposed to improve the screening process using text mining and machine learning techniques. Among them, screening prioritization is the most promising approach to be implemented in practice while assisting researchers to identify all relevant documents. Given candidate documents, it provides the document rankings aiming that relevant documents are ranked at the top. Screening prioritized documents makes relevant documents be found early, and it leads to the efficient workflow in conducting SRs. It is natural that SR experts know the existence of one or two relevant documents after defining the clinical question (i.e., before the screening process). In this thesis, we propose a new approach named seed-driven document ranking for screening prioritization. We assume that one relevant document is given, which we call ‘seed document’ and use it as a query. The task of seed-driven document ranking is to rank candidate documents using a seed document query. Existing approaches in screening prioritization focus on building short keyword queries using statistical information in candidate documents. In this thesis, we extensively investigate ranking models for seed-driven document ranking. We first develop a retrieval model adapted for a seed document query. We propose ‘bag-of-clinical terms’ document representation reducing a long document query into essential clinical terms to find relevant documents. More importantly, we propose a weight function for clinical terms in a seed document. The proposed term weight function is combined with query likelihood retrieval model. Next, we propose a new semantic ranking model. We investigate a document matching (i.e., similarity) approach that ranks candidate documents according to their matching scores to a seed document. We design position-based semantic term matching, and develop a simple two-way document matching model. The semantic document matching approach further improves upon the performance of the model proposed in the previous chapter. Thirdly, we predict the ranking performance on SRs in seed-driven document ranking. Despite the state-of-the-art performance on benchmark datasets, the performance of ranking models on individual SRs widely varies. The large performance difference is less desirable, because SR experts experience the local performance (performance on a single SR) in the manual screening, instead of the global performance (performance on multiple SRs in benchmark datasets). We hypothesize that individual SRs have a different ranking difficulty depending on its topic broadness. To this end, we propose a measure predicting the ranking difficulties of SRs. Furthermore, we explore methods to improve the local performance especially on difficult SRs and the global performance of ranking models. We first study PICO elements in medical literature. Automatic PICO recognition helps identify essential information for the relevance of documents. We examine the boundaries of PICO span annotations and discuss how to correctly and effectively evaluate machine learning models in this task. Next we investigate medical concepts (i.e., medical/clinical terms) embeddings. Our work in previous chapters has shown that medical concepts are important information for effectively representing documents. We analyze the stability of medical concept embeddings. We conclude this thesis with the discussion on how other domain-specific applications with the similar aim on high recall can benefit from seed-driven document ranking, such as legal document search and test collection generation.
first_indexed	2024-10-01T07:53:14Z
format	Thesis-Doctor of Philosophy
id	ntu-10356/146709
institution	Nanyang Technological University
language	English
last_indexed	2024-10-01T07:53:14Z
publishDate	2021
publisher	Nanyang Technological University
record_format	dspace
spelling	ntu-10356/1467092021-04-20T07:00:35Z Seed-driven document ranking for systematic reviews in evidence-based medicine Lee, Eunkyung Sun Aixin School of Computer Science and Engineering AXSun@ntu.edu.sg Engineering::Computer science and engineering::Information systems::Information storage and retrieval Evidence-based medicine (EBM) uses the best-available evidence in the process of patient care. Systematic reviews (SRs) are an essential part in the practice of EBM, providing the most up-to-date answers from primary evidence (published clinical studies) for a specific clinical question. SRs locate, appraise, and synthesize pertinent literature and derive answers. Finding all relevant documents is the most critical task in systematic reviews. Missing relevant studies discredits answers in SRs. In order to find every relevant literature, SRs are strictly conducted by following four sequential steps: (i) defining a clinical question and relevance conditions, (ii) collecting potential relevant documents from digital libraries (e.g., PubMed, Embase) by Boolean search, (iii) screening the retrieved documents (candidate documents) to identify relevant documents, and (iv) analyzing the relevant documents and deriving a conclusion (answer) based on them. Each step is recorded and published together in SRs. Screening (i.e., identifying relevant documents among candidate documents) is the most time-consuming process when conducting SRs. It is a manual process involving several thousands of documents, and thus typically takes a few months to complete. Many approaches have been proposed to improve the screening process using text mining and machine learning techniques. Among them, screening prioritization is the most promising approach to be implemented in practice while assisting researchers to identify all relevant documents. Given candidate documents, it provides the document rankings aiming that relevant documents are ranked at the top. Screening prioritized documents makes relevant documents be found early, and it leads to the efficient workflow in conducting SRs. It is natural that SR experts know the existence of one or two relevant documents after defining the clinical question (i.e., before the screening process). In this thesis, we propose a new approach named seed-driven document ranking for screening prioritization. We assume that one relevant document is given, which we call ‘seed document’ and use it as a query. The task of seed-driven document ranking is to rank candidate documents using a seed document query. Existing approaches in screening prioritization focus on building short keyword queries using statistical information in candidate documents. In this thesis, we extensively investigate ranking models for seed-driven document ranking. We first develop a retrieval model adapted for a seed document query. We propose ‘bag-of-clinical terms’ document representation reducing a long document query into essential clinical terms to find relevant documents. More importantly, we propose a weight function for clinical terms in a seed document. The proposed term weight function is combined with query likelihood retrieval model. Next, we propose a new semantic ranking model. We investigate a document matching (i.e., similarity) approach that ranks candidate documents according to their matching scores to a seed document. We design position-based semantic term matching, and develop a simple two-way document matching model. The semantic document matching approach further improves upon the performance of the model proposed in the previous chapter. Thirdly, we predict the ranking performance on SRs in seed-driven document ranking. Despite the state-of-the-art performance on benchmark datasets, the performance of ranking models on individual SRs widely varies. The large performance difference is less desirable, because SR experts experience the local performance (performance on a single SR) in the manual screening, instead of the global performance (performance on multiple SRs in benchmark datasets). We hypothesize that individual SRs have a different ranking difficulty depending on its topic broadness. To this end, we propose a measure predicting the ranking difficulties of SRs. Furthermore, we explore methods to improve the local performance especially on difficult SRs and the global performance of ranking models. We first study PICO elements in medical literature. Automatic PICO recognition helps identify essential information for the relevance of documents. We examine the boundaries of PICO span annotations and discuss how to correctly and effectively evaluate machine learning models in this task. Next we investigate medical concepts (i.e., medical/clinical terms) embeddings. Our work in previous chapters has shown that medical concepts are important information for effectively representing documents. We analyze the stability of medical concept embeddings. We conclude this thesis with the discussion on how other domain-specific applications with the similar aim on high recall can benefit from seed-driven document ranking, such as legal document search and test collection generation. Doctor of Philosophy 2021-03-08T01:35:07Z 2021-03-08T01:35:07Z 2020 Thesis-Doctor of Philosophy Lee, F. (2020). Seed-driven document ranking for systematic reviews in evidence-based medicine. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/146709 10.32657/10356/146709 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University
spellingShingle	Engineering::Computer science and engineering::Information systems::Information storage and retrieval Lee, Eunkyung Seed-driven document ranking for systematic reviews in evidence-based medicine
title	Seed-driven document ranking for systematic reviews in evidence-based medicine
title_full	Seed-driven document ranking for systematic reviews in evidence-based medicine
title_fullStr	Seed-driven document ranking for systematic reviews in evidence-based medicine
title_full_unstemmed	Seed-driven document ranking for systematic reviews in evidence-based medicine
title_short	Seed-driven document ranking for systematic reviews in evidence-based medicine
title_sort	seed driven document ranking for systematic reviews in evidence based medicine
topic	Engineering::Computer science and engineering::Information systems::Information storage and retrieval
url	https://hdl.handle.net/10356/146709
work_keys_str_mv	AT leeeunkyung seeddrivendocumentrankingforsystematicreviewsinevidencebasedmedicine

Seed-driven document ranking for systematic reviews in evidence-based medicine

Similar Items