Application of unsupervised analysis techniques to lung cancer patient data.

This study applies unsupervised machine learning techniques for classification and clustering to a collection of descriptive variables from 10,442 lung cancer patient records in the Surveillance, Epidemiology, and End Results (SEER) program database. The goal is to automatically classify lung cancer...

Full description

Bibliographic Details
Main Authors:	Chip M Lynch, Victor H van Berkel, Hermann B Frieboes
Format:	Article
Language:	English
Published:	Public Library of Science (PLoS) 2017-01-01
Series:	PLoS ONE
Online Access:	https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0184370&type=printable

_version_	1826579055748579328
author	Chip M Lynch Victor H van Berkel Hermann B Frieboes
author_facet	Chip M Lynch Victor H van Berkel Hermann B Frieboes
author_sort	Chip M Lynch
collection	DOAJ
description	This study applies unsupervised machine learning techniques for classification and clustering to a collection of descriptive variables from 10,442 lung cancer patient records in the Surveillance, Epidemiology, and End Results (SEER) program database. The goal is to automatically classify lung cancer patients into groups based on clinically measurable disease-specific variables in order to estimate survival. Variables selected as inputs for machine learning include Number of Primaries, Age, Grade, Tumor Size, Stage, and TNM, which are numeric or can readily be converted to numeric type. Minimal up-front processing of the data enables exploring the out-of-the-box capabilities of established unsupervised learning techniques, with little human intervention through the entire process. The output of the techniques is used to predict survival time, with the efficacy of the prediction representing a proxy for the usefulness of the classification. A basic single variable linear regression against each unsupervised output is applied, and the associated Root Mean Squared Error (RMSE) value is calculated as a metric to compare between the outputs. The results show that self-ordering maps exhibit the best performance, while k-Means performs the best of the simpler classification techniques. Predicting against the full data set, it is found that their respective RMSE values (15.591 for self-ordering maps and 16.193 for k-Means) are comparable to supervised regression techniques, such as Gradient Boosting Machine (RMSE of 15.048). We conclude that unsupervised data analysis techniques may be of use to classify patients by defining the classes as effective proxies for survival prediction.
first_indexed	2024-12-11T11:42:14Z
format	Article
id	doaj.art-32c64bd59f0e4d0497504af4330fa982
institution	Directory Open Access Journal
issn	1932-6203
language	English
last_indexed	2025-03-14T14:12:12Z
publishDate	2017-01-01
publisher	Public Library of Science (PLoS)
record_format	Article
series	PLoS ONE
spelling	doaj.art-32c64bd59f0e4d0497504af4330fa9822025-02-27T05:36:00ZengPublic Library of Science (PLoS)PLoS ONE1932-62032017-01-01129e018437010.1371/journal.pone.0184370Application of unsupervised analysis techniques to lung cancer patient data.Chip M LynchVictor H van BerkelHermann B FrieboesThis study applies unsupervised machine learning techniques for classification and clustering to a collection of descriptive variables from 10,442 lung cancer patient records in the Surveillance, Epidemiology, and End Results (SEER) program database. The goal is to automatically classify lung cancer patients into groups based on clinically measurable disease-specific variables in order to estimate survival. Variables selected as inputs for machine learning include Number of Primaries, Age, Grade, Tumor Size, Stage, and TNM, which are numeric or can readily be converted to numeric type. Minimal up-front processing of the data enables exploring the out-of-the-box capabilities of established unsupervised learning techniques, with little human intervention through the entire process. The output of the techniques is used to predict survival time, with the efficacy of the prediction representing a proxy for the usefulness of the classification. A basic single variable linear regression against each unsupervised output is applied, and the associated Root Mean Squared Error (RMSE) value is calculated as a metric to compare between the outputs. The results show that self-ordering maps exhibit the best performance, while k-Means performs the best of the simpler classification techniques. Predicting against the full data set, it is found that their respective RMSE values (15.591 for self-ordering maps and 16.193 for k-Means) are comparable to supervised regression techniques, such as Gradient Boosting Machine (RMSE of 15.048). We conclude that unsupervised data analysis techniques may be of use to classify patients by defining the classes as effective proxies for survival prediction.https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0184370&type=printable
spellingShingle	Chip M Lynch Victor H van Berkel Hermann B Frieboes Application of unsupervised analysis techniques to lung cancer patient data. PLoS ONE
title	Application of unsupervised analysis techniques to lung cancer patient data.
title_full	Application of unsupervised analysis techniques to lung cancer patient data.
title_fullStr	Application of unsupervised analysis techniques to lung cancer patient data.
title_full_unstemmed	Application of unsupervised analysis techniques to lung cancer patient data.
title_short	Application of unsupervised analysis techniques to lung cancer patient data.
title_sort	application of unsupervised analysis techniques to lung cancer patient data
url	https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0184370&type=printable
work_keys_str_mv	AT chipmlynch applicationofunsupervisedanalysistechniquestolungcancerpatientdata AT victorhvanberkel applicationofunsupervisedanalysistechniquestolungcancerpatientdata AT hermannbfrieboes applicationofunsupervisedanalysistechniquestolungcancerpatientdata

Application of unsupervised analysis techniques to lung cancer patient data.

Similar Items