Gene expression based survival prediction for cancer patients-A topic modeling approach.

Cancer is one of the leading cause of death, worldwide. Many believe that genomic data will enable us to better predict the survival time of these patients, which will lead to better, more personalized treatment options and patient care. As standard survival prediction models have a hard time coping...

Full description

Bibliographic Details
Main Authors: Luke Kumar, Russell Greiner
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2019-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0224446
_version_ 1828957265217978368
author Luke Kumar
Russell Greiner
author_facet Luke Kumar
Russell Greiner
author_sort Luke Kumar
collection DOAJ
description Cancer is one of the leading cause of death, worldwide. Many believe that genomic data will enable us to better predict the survival time of these patients, which will lead to better, more personalized treatment options and patient care. As standard survival prediction models have a hard time coping with the high-dimensionality of such gene expression data, many projects use some dimensionality reduction techniques to overcome this hurdle. We introduce a novel methodology, inspired by topic modeling from the natural language domain, to derive expressive features from the high-dimensional gene expression data. There, a document is represented as a mixture over a relatively small number of topics, where each topic corresponds to a distribution over the words; here, to accommodate the heterogeneity of a patient's cancer, we represent each patient (≈ document) as a mixture over cancer-topics, where each cancer-topic is a mixture over gene expression values (≈ words). This required some extensions to the standard LDA model-e.g., to accommodate the real-valued expression values-leading to our novel discretized Latent Dirichlet Allocation (dLDA) procedure. After using this dLDA to learn these cancer-topics, we can then express each patient as a distribution over a small number of cancer-topics, then use this low-dimensional "distribution vector" as input to a learning algorithm-here, we ran the recent survival prediction algorithm, MTLR, on this representation of the cancer dataset. We initially focus on the METABRIC dataset, which describes each of n = 1,981 breast cancer patients using the r = 49,576 gene expression values, from microarrays. Our results show that our approach (dLDA followed by MTLR) provides survival estimates that are more accurate than standard models, in terms of the standard Concordance measure. We then validate this "dLDA+MTLR" approach by running it on the n = 883 Pan-kidney (KIPAN) dataset, over r = 15,529 gene expression values-here using the mRNAseq modality-and find that it again achieves excellent results. In both cases, we also show that the resulting model is calibrated, using the recent "D-calibrated" measure. These successes, in two different cancer types and expression modalities, demonstrates the generality, and the effectiveness, of this approach. The dLDA+MTLR source code is available at https://github.com/nitsanluke/GE-LDA-Survival.
first_indexed 2024-12-14T08:25:19Z
format Article
id doaj.art-d7fb2d2fa0f044e79e45b19aa8660644
institution Directory Open Access Journal
issn 1932-6203
language English
last_indexed 2024-12-14T08:25:19Z
publishDate 2019-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj.art-d7fb2d2fa0f044e79e45b19aa86606442022-12-21T23:09:40ZengPublic Library of Science (PLoS)PLoS ONE1932-62032019-01-011411e022444610.1371/journal.pone.0224446Gene expression based survival prediction for cancer patients-A topic modeling approach.Luke KumarRussell GreinerCancer is one of the leading cause of death, worldwide. Many believe that genomic data will enable us to better predict the survival time of these patients, which will lead to better, more personalized treatment options and patient care. As standard survival prediction models have a hard time coping with the high-dimensionality of such gene expression data, many projects use some dimensionality reduction techniques to overcome this hurdle. We introduce a novel methodology, inspired by topic modeling from the natural language domain, to derive expressive features from the high-dimensional gene expression data. There, a document is represented as a mixture over a relatively small number of topics, where each topic corresponds to a distribution over the words; here, to accommodate the heterogeneity of a patient's cancer, we represent each patient (≈ document) as a mixture over cancer-topics, where each cancer-topic is a mixture over gene expression values (≈ words). This required some extensions to the standard LDA model-e.g., to accommodate the real-valued expression values-leading to our novel discretized Latent Dirichlet Allocation (dLDA) procedure. After using this dLDA to learn these cancer-topics, we can then express each patient as a distribution over a small number of cancer-topics, then use this low-dimensional "distribution vector" as input to a learning algorithm-here, we ran the recent survival prediction algorithm, MTLR, on this representation of the cancer dataset. We initially focus on the METABRIC dataset, which describes each of n = 1,981 breast cancer patients using the r = 49,576 gene expression values, from microarrays. Our results show that our approach (dLDA followed by MTLR) provides survival estimates that are more accurate than standard models, in terms of the standard Concordance measure. We then validate this "dLDA+MTLR" approach by running it on the n = 883 Pan-kidney (KIPAN) dataset, over r = 15,529 gene expression values-here using the mRNAseq modality-and find that it again achieves excellent results. In both cases, we also show that the resulting model is calibrated, using the recent "D-calibrated" measure. These successes, in two different cancer types and expression modalities, demonstrates the generality, and the effectiveness, of this approach. The dLDA+MTLR source code is available at https://github.com/nitsanluke/GE-LDA-Survival.https://doi.org/10.1371/journal.pone.0224446
spellingShingle Luke Kumar
Russell Greiner
Gene expression based survival prediction for cancer patients-A topic modeling approach.
PLoS ONE
title Gene expression based survival prediction for cancer patients-A topic modeling approach.
title_full Gene expression based survival prediction for cancer patients-A topic modeling approach.
title_fullStr Gene expression based survival prediction for cancer patients-A topic modeling approach.
title_full_unstemmed Gene expression based survival prediction for cancer patients-A topic modeling approach.
title_short Gene expression based survival prediction for cancer patients-A topic modeling approach.
title_sort gene expression based survival prediction for cancer patients a topic modeling approach
url https://doi.org/10.1371/journal.pone.0224446
work_keys_str_mv AT lukekumar geneexpressionbasedsurvivalpredictionforcancerpatientsatopicmodelingapproach
AT russellgreiner geneexpressionbasedsurvivalpredictionforcancerpatientsatopicmodelingapproach