MScanner: a classifier for retrieving Medline citations

<p>Abstract</p> <p>Background</p> <p>Keyword searching through PubMed and other systems is the standard means of retrieving information from Medline. However, ad-hoc retrieval systems do not meet all of the needs of databases that curate information from literature, or...

Full description

Bibliographic Details
Main Authors: Altman Russ B, Rubin Daniel L, Poulter Graham L, Seoighe Cathal
Format: Article
Language:English
Published: BMC 2008-02-01
Series:BMC Bioinformatics
Online Access:http://www.biomedcentral.com/1471-2105/9/108
_version_ 1818682501675089920
author Altman Russ B
Rubin Daniel L
Poulter Graham L
Seoighe Cathal
author_facet Altman Russ B
Rubin Daniel L
Poulter Graham L
Seoighe Cathal
author_sort Altman Russ B
collection DOAJ
description <p>Abstract</p> <p>Background</p> <p>Keyword searching through PubMed and other systems is the standard means of retrieving information from Medline. However, ad-hoc retrieval systems do not meet all of the needs of databases that curate information from literature, or of text miners developing a corpus on a topic that has many terms indicative of relevance. Several databases have developed supervised learning methods that operate on a filtered subset of Medline, to classify Medline records so that fewer articles have to be manually reviewed for relevance. A few studies have considered generalisation of Medline classification to operate on the entire Medline database in a non-domain-specific manner, but existing applications lack speed, available implementations, or a means to measure performance in new domains.</p> <p>Results</p> <p>MScanner is an implementation of a Bayesian classifier that provides a simple web interface for submitting a corpus of relevant training examples in the form of PubMed IDs and returning results ranked by decreasing probability of relevance. For maximum speed it uses the Medical Subject Headings (MeSH) and journal of publication as a concise document representation, and takes roughly 90 seconds to return results against the 16 million records in Medline. The web interface provides interactive exploration of the results, and cross validated performance evaluation on the relevant input against a random subset of Medline. We describe the classifier implementation, cross validate it on three domain-specific topics, and compare its performance to that of an expert PubMed query for a complex topic. In cross validation on the three sample topics against 100,000 random articles, the classifier achieved excellent separation of relevant and irrelevant article score distributions, ROC areas between 0.97 and 0.99, and averaged precision between 0.69 and 0.92.</p> <p>Conclusion</p> <p>MScanner is an effective non-domain-specific classifier that operates on the entire Medline database, and is suited to retrieving topics for which many features may indicate relevance. Its web interface simplifies the task of classifying Medline citations, compared to building a pre-filter and classifier specific to the topic. The data sets and open source code used to obtain the results in this paper are available on-line and as supplementary material, and the web interface may be accessed at <url>http://mscanner.stanford.edu</url>.</p>
first_indexed 2024-12-17T10:19:51Z
format Article
id doaj.art-c5e60c39b7944e3c882219dd0e3b46dc
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-12-17T10:19:51Z
publishDate 2008-02-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-c5e60c39b7944e3c882219dd0e3b46dc2022-12-21T21:52:50ZengBMCBMC Bioinformatics1471-21052008-02-019110810.1186/1471-2105-9-108MScanner: a classifier for retrieving Medline citationsAltman Russ BRubin Daniel LPoulter Graham LSeoighe Cathal<p>Abstract</p> <p>Background</p> <p>Keyword searching through PubMed and other systems is the standard means of retrieving information from Medline. However, ad-hoc retrieval systems do not meet all of the needs of databases that curate information from literature, or of text miners developing a corpus on a topic that has many terms indicative of relevance. Several databases have developed supervised learning methods that operate on a filtered subset of Medline, to classify Medline records so that fewer articles have to be manually reviewed for relevance. A few studies have considered generalisation of Medline classification to operate on the entire Medline database in a non-domain-specific manner, but existing applications lack speed, available implementations, or a means to measure performance in new domains.</p> <p>Results</p> <p>MScanner is an implementation of a Bayesian classifier that provides a simple web interface for submitting a corpus of relevant training examples in the form of PubMed IDs and returning results ranked by decreasing probability of relevance. For maximum speed it uses the Medical Subject Headings (MeSH) and journal of publication as a concise document representation, and takes roughly 90 seconds to return results against the 16 million records in Medline. The web interface provides interactive exploration of the results, and cross validated performance evaluation on the relevant input against a random subset of Medline. We describe the classifier implementation, cross validate it on three domain-specific topics, and compare its performance to that of an expert PubMed query for a complex topic. In cross validation on the three sample topics against 100,000 random articles, the classifier achieved excellent separation of relevant and irrelevant article score distributions, ROC areas between 0.97 and 0.99, and averaged precision between 0.69 and 0.92.</p> <p>Conclusion</p> <p>MScanner is an effective non-domain-specific classifier that operates on the entire Medline database, and is suited to retrieving topics for which many features may indicate relevance. Its web interface simplifies the task of classifying Medline citations, compared to building a pre-filter and classifier specific to the topic. The data sets and open source code used to obtain the results in this paper are available on-line and as supplementary material, and the web interface may be accessed at <url>http://mscanner.stanford.edu</url>.</p>http://www.biomedcentral.com/1471-2105/9/108
spellingShingle Altman Russ B
Rubin Daniel L
Poulter Graham L
Seoighe Cathal
MScanner: a classifier for retrieving Medline citations
BMC Bioinformatics
title MScanner: a classifier for retrieving Medline citations
title_full MScanner: a classifier for retrieving Medline citations
title_fullStr MScanner: a classifier for retrieving Medline citations
title_full_unstemmed MScanner: a classifier for retrieving Medline citations
title_short MScanner: a classifier for retrieving Medline citations
title_sort mscanner a classifier for retrieving medline citations
url http://www.biomedcentral.com/1471-2105/9/108
work_keys_str_mv AT altmanrussb mscanneraclassifierforretrievingmedlinecitations
AT rubindaniell mscanneraclassifierforretrievingmedlinecitations
AT poultergrahaml mscanneraclassifierforretrievingmedlinecitations
AT seoighecathal mscanneraclassifierforretrievingmedlinecitations