Using Machine Learning and Natural Language Processing to Review and Classify the Medical Literature on Cancer Susceptibility Genes

© 2019 by American Society of Clinical Oncology PURPOSE The medical literature relevant to germline genetics is growing exponentially. Clinicians need tools that help to monitor and prioritize the literature to understand the clinical implications of pathogenic genetic variants. We developed and eva...

Full description

Bibliographic Details
Main Authors: Bao, Yujia, Deng, Zhengyi, Wang, Yan, Kim, Heeyoon, Armengol, Victor Diego, Acevedo, Francisco, Ouardaoui, Nofal, Wang, Cathy, Parmigiani, Giovanni, Barzilay, Regina, Braun, Danielle, Hughes, Kevin S
Format: Article
Language:English
Published: American Society of Clinical Oncology (ASCO) 2021
Online Access:https://hdl.handle.net/1721.1/136447
_version_ 1826212555917361152
author Bao, Yujia
Deng, Zhengyi
Wang, Yan
Kim, Heeyoon
Armengol, Victor Diego
Acevedo, Francisco
Ouardaoui, Nofal
Wang, Cathy
Parmigiani, Giovanni
Barzilay, Regina
Braun, Danielle
Hughes, Kevin S
author_facet Bao, Yujia
Deng, Zhengyi
Wang, Yan
Kim, Heeyoon
Armengol, Victor Diego
Acevedo, Francisco
Ouardaoui, Nofal
Wang, Cathy
Parmigiani, Giovanni
Barzilay, Regina
Braun, Danielle
Hughes, Kevin S
author_sort Bao, Yujia
collection MIT
description © 2019 by American Society of Clinical Oncology PURPOSE The medical literature relevant to germline genetics is growing exponentially. Clinicians need tools that help to monitor and prioritize the literature to understand the clinical implications of pathogenic genetic variants. We developed and evaluated two machine learning models to classify abstracts as relevant to the penetrance—risk of cancer for germline mutation carriers—or prevalence of germline genetic mutations. MATERIALS AND METHODS We conducted literature searches in PubMed and retrieved paper titles and abstracts to create an annotated data set for training and evaluating the two machine learning classification models. Our first model is a support vector machine (SVM) which learns a linear decision rule on the basis of the bag-of-ngrams representation of each title and abstract. Our second model is a convolutional neural network (CNN) which learns a complex nonlinear decision rule on the basis of the raw title and abstract. We evaluated the performance of the two models on the classification of papers as relevant to penetrance or prevalence. RESULTS For penetrance classification, we annotated 3,740 paper titles and abstracts and evaluated the two models using 10-fold cross-validation. The SVM model achieved 88.93% accuracy—percentage of papers that were correctly classified—whereas the CNN model achieved 88.53% accuracy. For prevalence classification, we annotated 3,753 paper titles and abstracts. The SVM model achieved 88.92% accuracy and the CNN model achieved 88.52% accuracy. CONCLUSION Our models achieve high accuracy in classifying abstracts as relevant to penetrance or prevalence. By facilitating literature review, this tool could help clinicians and researchers keep abreast of the burgeoning knowledge of gene–cancer associations and keep the knowledge bases for clinical decision support tools up to date.
first_indexed 2024-09-23T15:25:47Z
format Article
id mit-1721.1/136447
institution Massachusetts Institute of Technology
language English
last_indexed 2024-09-23T15:25:47Z
publishDate 2021
publisher American Society of Clinical Oncology (ASCO)
record_format dspace
spelling mit-1721.1/1364472021-10-28T03:22:36Z Using Machine Learning and Natural Language Processing to Review and Classify the Medical Literature on Cancer Susceptibility Genes Bao, Yujia Deng, Zhengyi Wang, Yan Kim, Heeyoon Armengol, Victor Diego Acevedo, Francisco Ouardaoui, Nofal Wang, Cathy Parmigiani, Giovanni Barzilay, Regina Braun, Danielle Hughes, Kevin S © 2019 by American Society of Clinical Oncology PURPOSE The medical literature relevant to germline genetics is growing exponentially. Clinicians need tools that help to monitor and prioritize the literature to understand the clinical implications of pathogenic genetic variants. We developed and evaluated two machine learning models to classify abstracts as relevant to the penetrance—risk of cancer for germline mutation carriers—or prevalence of germline genetic mutations. MATERIALS AND METHODS We conducted literature searches in PubMed and retrieved paper titles and abstracts to create an annotated data set for training and evaluating the two machine learning classification models. Our first model is a support vector machine (SVM) which learns a linear decision rule on the basis of the bag-of-ngrams representation of each title and abstract. Our second model is a convolutional neural network (CNN) which learns a complex nonlinear decision rule on the basis of the raw title and abstract. We evaluated the performance of the two models on the classification of papers as relevant to penetrance or prevalence. RESULTS For penetrance classification, we annotated 3,740 paper titles and abstracts and evaluated the two models using 10-fold cross-validation. The SVM model achieved 88.93% accuracy—percentage of papers that were correctly classified—whereas the CNN model achieved 88.53% accuracy. For prevalence classification, we annotated 3,753 paper titles and abstracts. The SVM model achieved 88.92% accuracy and the CNN model achieved 88.52% accuracy. CONCLUSION Our models achieve high accuracy in classifying abstracts as relevant to penetrance or prevalence. By facilitating literature review, this tool could help clinicians and researchers keep abreast of the burgeoning knowledge of gene–cancer associations and keep the knowledge bases for clinical decision support tools up to date. 2021-10-27T20:35:25Z 2021-10-27T20:35:25Z 2019 2020-12-01T16:55:48Z Article http://purl.org/eprint/type/JournalArticle https://hdl.handle.net/1721.1/136447 en 10.1200/CCI.19.00042 JCO Clinical Cancer Informatics Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/ application/pdf American Society of Clinical Oncology (ASCO) arXiv
spellingShingle Bao, Yujia
Deng, Zhengyi
Wang, Yan
Kim, Heeyoon
Armengol, Victor Diego
Acevedo, Francisco
Ouardaoui, Nofal
Wang, Cathy
Parmigiani, Giovanni
Barzilay, Regina
Braun, Danielle
Hughes, Kevin S
Using Machine Learning and Natural Language Processing to Review and Classify the Medical Literature on Cancer Susceptibility Genes
title Using Machine Learning and Natural Language Processing to Review and Classify the Medical Literature on Cancer Susceptibility Genes
title_full Using Machine Learning and Natural Language Processing to Review and Classify the Medical Literature on Cancer Susceptibility Genes
title_fullStr Using Machine Learning and Natural Language Processing to Review and Classify the Medical Literature on Cancer Susceptibility Genes
title_full_unstemmed Using Machine Learning and Natural Language Processing to Review and Classify the Medical Literature on Cancer Susceptibility Genes
title_short Using Machine Learning and Natural Language Processing to Review and Classify the Medical Literature on Cancer Susceptibility Genes
title_sort using machine learning and natural language processing to review and classify the medical literature on cancer susceptibility genes
url https://hdl.handle.net/1721.1/136447
work_keys_str_mv AT baoyujia usingmachinelearningandnaturallanguageprocessingtoreviewandclassifythemedicalliteratureoncancersusceptibilitygenes
AT dengzhengyi usingmachinelearningandnaturallanguageprocessingtoreviewandclassifythemedicalliteratureoncancersusceptibilitygenes
AT wangyan usingmachinelearningandnaturallanguageprocessingtoreviewandclassifythemedicalliteratureoncancersusceptibilitygenes
AT kimheeyoon usingmachinelearningandnaturallanguageprocessingtoreviewandclassifythemedicalliteratureoncancersusceptibilitygenes
AT armengolvictordiego usingmachinelearningandnaturallanguageprocessingtoreviewandclassifythemedicalliteratureoncancersusceptibilitygenes
AT acevedofrancisco usingmachinelearningandnaturallanguageprocessingtoreviewandclassifythemedicalliteratureoncancersusceptibilitygenes
AT ouardaouinofal usingmachinelearningandnaturallanguageprocessingtoreviewandclassifythemedicalliteratureoncancersusceptibilitygenes
AT wangcathy usingmachinelearningandnaturallanguageprocessingtoreviewandclassifythemedicalliteratureoncancersusceptibilitygenes
AT parmigianigiovanni usingmachinelearningandnaturallanguageprocessingtoreviewandclassifythemedicalliteratureoncancersusceptibilitygenes
AT barzilayregina usingmachinelearningandnaturallanguageprocessingtoreviewandclassifythemedicalliteratureoncancersusceptibilitygenes
AT braundanielle usingmachinelearningandnaturallanguageprocessingtoreviewandclassifythemedicalliteratureoncancersusceptibilitygenes
AT hugheskevins usingmachinelearningandnaturallanguageprocessingtoreviewandclassifythemedicalliteratureoncancersusceptibilitygenes