TULIP: An RNA-seq-based Primary Tumor Type Prediction Tool Using Convolutional Neural Networks

Background: With cancer as one of the leading causes of death worldwide, accurate primary tumor type prediction is critical in identifying genetic factors that can inhibit or slow tumor progression. There have been efforts to categorize primary tumor types with gene expression data using machine lea...

Full description

Bibliographic Details
Main Authors: Sara Jones, Matthew Beyers, Maulik Shukla, Fangfang Xia, Thomas Brettin, Rick Stevens, M Ryan Weil, Satishkumar Ranganathan Ganakammal
Format: Article
Language:English
Published: SAGE Publishing 2022-12-01
Series:Cancer Informatics
Online Access:https://doi.org/10.1177/11769351221139491
_version_ 1811185857338540032
author Sara Jones
Matthew Beyers
Maulik Shukla
Fangfang Xia
Thomas Brettin
Rick Stevens
M Ryan Weil
Satishkumar Ranganathan Ganakammal
author_facet Sara Jones
Matthew Beyers
Maulik Shukla
Fangfang Xia
Thomas Brettin
Rick Stevens
M Ryan Weil
Satishkumar Ranganathan Ganakammal
author_sort Sara Jones
collection DOAJ
description Background: With cancer as one of the leading causes of death worldwide, accurate primary tumor type prediction is critical in identifying genetic factors that can inhibit or slow tumor progression. There have been efforts to categorize primary tumor types with gene expression data using machine learning, and more recently with deep learning, in the last several years. Methods In this paper, we developed four 1-dimensional (1D) Convolutional Neural Network (CNN) models to classify RNA-seq count data as one of 17 highly represented primary tumor types or 32 primary tumor types regardless of imbalanced representation. Additionally, we adapted the models to take as input either all Ensembl genes (60,483) or protein coding genes only (19,758). Unlike previous work, we avoided selection bias by not filtering genes based on expression values. RNA-seq count data expressed as FPKM-UQ of 9,025 and 10,940 samples from The Cancer Genome Atlas (TCGA) were downloaded from the Genomic Data Commons (GDC) corresponding to 17 and 32 primary tumor types respectively for training and validating the models. Results: All 4 1D-CNN models had an overall accuracy of 94.7% to 97.6% on the test dataset. Further evaluation indicates that the models with protein coding genes only as features performed with better accuracy compared to the models with all Ensembl genes for both 17 and 32 primary tumor types. For all models, the accuracy by primary tumor type was above 80% for most primary tumor types. Conclusions: We packaged all 4 models as a Python-based deep learning classification tool called TULIP ( TU mor C L ass I fication P redictor) for performing quality control on primary tumor samples and characterizing cancer samples of unknown tumor type. Further optimization of the models is needed to improve the accuracy of certain primary tumor types.
first_indexed 2024-04-11T13:36:20Z
format Article
id doaj.art-4936251ad7744b8bb020f50e7aa44780
institution Directory Open Access Journal
issn 1176-9351
language English
last_indexed 2024-04-11T13:36:20Z
publishDate 2022-12-01
publisher SAGE Publishing
record_format Article
series Cancer Informatics
spelling doaj.art-4936251ad7744b8bb020f50e7aa447802022-12-22T04:21:27ZengSAGE PublishingCancer Informatics1176-93512022-12-012110.1177/11769351221139491TULIP: An RNA-seq-based Primary Tumor Type Prediction Tool Using Convolutional Neural NetworksSara Jones0Matthew Beyers1Maulik Shukla2Fangfang Xia3Thomas Brettin4Rick Stevens5M Ryan Weil6Satishkumar Ranganathan Ganakammal7Frederick National Laboratory for Cancer Research, Cancer Data Science Initiatives, Cancer Research Technology Program, Rockville, MD, USAFrederick National Laboratory for Cancer Research, Cancer Data Science Initiatives, Cancer Research Technology Program, Rockville, MD, USAArgonne National Laboratory, Computing, Environment and Life Sciences, Lemont, IL, USAArgonne National Laboratory, Computing, Environment and Life Sciences, Lemont, IL, USAArgonne National Laboratory, Computing, Environment and Life Sciences, Lemont, IL, USAArgonne National Laboratory, Computing, Environment and Life Sciences, Lemont, IL, USAFrederick National Laboratory for Cancer Research, Cancer Data Science Initiatives, Cancer Research Technology Program, Rockville, MD, USAFrederick National Laboratory for Cancer Research, Cancer Data Science Initiatives, Cancer Research Technology Program, Rockville, MD, USABackground: With cancer as one of the leading causes of death worldwide, accurate primary tumor type prediction is critical in identifying genetic factors that can inhibit or slow tumor progression. There have been efforts to categorize primary tumor types with gene expression data using machine learning, and more recently with deep learning, in the last several years. Methods In this paper, we developed four 1-dimensional (1D) Convolutional Neural Network (CNN) models to classify RNA-seq count data as one of 17 highly represented primary tumor types or 32 primary tumor types regardless of imbalanced representation. Additionally, we adapted the models to take as input either all Ensembl genes (60,483) or protein coding genes only (19,758). Unlike previous work, we avoided selection bias by not filtering genes based on expression values. RNA-seq count data expressed as FPKM-UQ of 9,025 and 10,940 samples from The Cancer Genome Atlas (TCGA) were downloaded from the Genomic Data Commons (GDC) corresponding to 17 and 32 primary tumor types respectively for training and validating the models. Results: All 4 1D-CNN models had an overall accuracy of 94.7% to 97.6% on the test dataset. Further evaluation indicates that the models with protein coding genes only as features performed with better accuracy compared to the models with all Ensembl genes for both 17 and 32 primary tumor types. For all models, the accuracy by primary tumor type was above 80% for most primary tumor types. Conclusions: We packaged all 4 models as a Python-based deep learning classification tool called TULIP ( TU mor C L ass I fication P redictor) for performing quality control on primary tumor samples and characterizing cancer samples of unknown tumor type. Further optimization of the models is needed to improve the accuracy of certain primary tumor types.https://doi.org/10.1177/11769351221139491
spellingShingle Sara Jones
Matthew Beyers
Maulik Shukla
Fangfang Xia
Thomas Brettin
Rick Stevens
M Ryan Weil
Satishkumar Ranganathan Ganakammal
TULIP: An RNA-seq-based Primary Tumor Type Prediction Tool Using Convolutional Neural Networks
Cancer Informatics
title TULIP: An RNA-seq-based Primary Tumor Type Prediction Tool Using Convolutional Neural Networks
title_full TULIP: An RNA-seq-based Primary Tumor Type Prediction Tool Using Convolutional Neural Networks
title_fullStr TULIP: An RNA-seq-based Primary Tumor Type Prediction Tool Using Convolutional Neural Networks
title_full_unstemmed TULIP: An RNA-seq-based Primary Tumor Type Prediction Tool Using Convolutional Neural Networks
title_short TULIP: An RNA-seq-based Primary Tumor Type Prediction Tool Using Convolutional Neural Networks
title_sort tulip an rna seq based primary tumor type prediction tool using convolutional neural networks
url https://doi.org/10.1177/11769351221139491
work_keys_str_mv AT sarajones tulipanrnaseqbasedprimarytumortypepredictiontoolusingconvolutionalneuralnetworks
AT matthewbeyers tulipanrnaseqbasedprimarytumortypepredictiontoolusingconvolutionalneuralnetworks
AT maulikshukla tulipanrnaseqbasedprimarytumortypepredictiontoolusingconvolutionalneuralnetworks
AT fangfangxia tulipanrnaseqbasedprimarytumortypepredictiontoolusingconvolutionalneuralnetworks
AT thomasbrettin tulipanrnaseqbasedprimarytumortypepredictiontoolusingconvolutionalneuralnetworks
AT rickstevens tulipanrnaseqbasedprimarytumortypepredictiontoolusingconvolutionalneuralnetworks
AT mryanweil tulipanrnaseqbasedprimarytumortypepredictiontoolusingconvolutionalneuralnetworks
AT satishkumarranganathanganakammal tulipanrnaseqbasedprimarytumortypepredictiontoolusingconvolutionalneuralnetworks