Application of unsupervised analysis techniques to lung cancer patient data.

This study applies unsupervised machine learning techniques for classification and clustering to a collection of descriptive variables from 10,442 lung cancer patient records in the Surveillance, Epidemiology, and End Results (SEER) program database. The goal is to automatically classify lung cancer...

Full description

Bibliographic Details
Main Authors: Chip M Lynch, Victor H van Berkel, Hermann B Frieboes
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2017-01-01
Series:PLoS ONE
Online Access:https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0184370&type=printable
_version_ 1826579055748579328
author Chip M Lynch
Victor H van Berkel
Hermann B Frieboes
author_facet Chip M Lynch
Victor H van Berkel
Hermann B Frieboes
author_sort Chip M Lynch
collection DOAJ
description This study applies unsupervised machine learning techniques for classification and clustering to a collection of descriptive variables from 10,442 lung cancer patient records in the Surveillance, Epidemiology, and End Results (SEER) program database. The goal is to automatically classify lung cancer patients into groups based on clinically measurable disease-specific variables in order to estimate survival. Variables selected as inputs for machine learning include Number of Primaries, Age, Grade, Tumor Size, Stage, and TNM, which are numeric or can readily be converted to numeric type. Minimal up-front processing of the data enables exploring the out-of-the-box capabilities of established unsupervised learning techniques, with little human intervention through the entire process. The output of the techniques is used to predict survival time, with the efficacy of the prediction representing a proxy for the usefulness of the classification. A basic single variable linear regression against each unsupervised output is applied, and the associated Root Mean Squared Error (RMSE) value is calculated as a metric to compare between the outputs. The results show that self-ordering maps exhibit the best performance, while k-Means performs the best of the simpler classification techniques. Predicting against the full data set, it is found that their respective RMSE values (15.591 for self-ordering maps and 16.193 for k-Means) are comparable to supervised regression techniques, such as Gradient Boosting Machine (RMSE of 15.048). We conclude that unsupervised data analysis techniques may be of use to classify patients by defining the classes as effective proxies for survival prediction.
first_indexed 2024-12-11T11:42:14Z
format Article
id doaj.art-32c64bd59f0e4d0497504af4330fa982
institution Directory Open Access Journal
issn 1932-6203
language English
last_indexed 2025-03-14T14:12:12Z
publishDate 2017-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj.art-32c64bd59f0e4d0497504af4330fa9822025-02-27T05:36:00ZengPublic Library of Science (PLoS)PLoS ONE1932-62032017-01-01129e018437010.1371/journal.pone.0184370Application of unsupervised analysis techniques to lung cancer patient data.Chip M LynchVictor H van BerkelHermann B FrieboesThis study applies unsupervised machine learning techniques for classification and clustering to a collection of descriptive variables from 10,442 lung cancer patient records in the Surveillance, Epidemiology, and End Results (SEER) program database. The goal is to automatically classify lung cancer patients into groups based on clinically measurable disease-specific variables in order to estimate survival. Variables selected as inputs for machine learning include Number of Primaries, Age, Grade, Tumor Size, Stage, and TNM, which are numeric or can readily be converted to numeric type. Minimal up-front processing of the data enables exploring the out-of-the-box capabilities of established unsupervised learning techniques, with little human intervention through the entire process. The output of the techniques is used to predict survival time, with the efficacy of the prediction representing a proxy for the usefulness of the classification. A basic single variable linear regression against each unsupervised output is applied, and the associated Root Mean Squared Error (RMSE) value is calculated as a metric to compare between the outputs. The results show that self-ordering maps exhibit the best performance, while k-Means performs the best of the simpler classification techniques. Predicting against the full data set, it is found that their respective RMSE values (15.591 for self-ordering maps and 16.193 for k-Means) are comparable to supervised regression techniques, such as Gradient Boosting Machine (RMSE of 15.048). We conclude that unsupervised data analysis techniques may be of use to classify patients by defining the classes as effective proxies for survival prediction.https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0184370&type=printable
spellingShingle Chip M Lynch
Victor H van Berkel
Hermann B Frieboes
Application of unsupervised analysis techniques to lung cancer patient data.
PLoS ONE
title Application of unsupervised analysis techniques to lung cancer patient data.
title_full Application of unsupervised analysis techniques to lung cancer patient data.
title_fullStr Application of unsupervised analysis techniques to lung cancer patient data.
title_full_unstemmed Application of unsupervised analysis techniques to lung cancer patient data.
title_short Application of unsupervised analysis techniques to lung cancer patient data.
title_sort application of unsupervised analysis techniques to lung cancer patient data
url https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0184370&type=printable
work_keys_str_mv AT chipmlynch applicationofunsupervisedanalysistechniquestolungcancerpatientdata
AT victorhvanberkel applicationofunsupervisedanalysistechniquestolungcancerpatientdata
AT hermannbfrieboes applicationofunsupervisedanalysistechniquestolungcancerpatientdata