Robust identification of molecular phenotypes using semi-supervised learning

Abstract Background Modern molecular profiling techniques are yielding vast amounts of data from patient samples that could be utilized with machine learning methods to provide important biological insights and improvements in patient outcomes. Unsupervised methods have been successfully used to ide...

Full description

Bibliographic Details
Main Authors:	Heinrich Roder, Carlos Oliveira, Lelia Net, Benjamin Linstid, Maxim Tsypin, Joanna Roder
Format:	Article
Language:	English
Published:	BMC 2019-05-01
Series:	BMC Bioinformatics
Subjects:	Machine learning Clustering Molecular phenotype Semi-supervised learning
Online Access:	http://link.springer.com/article/10.1186/s12859-019-2885-3

_version_	1818459812123377664
author	Heinrich Roder Carlos Oliveira Lelia Net Benjamin Linstid Maxim Tsypin Joanna Roder
author_facet	Heinrich Roder Carlos Oliveira Lelia Net Benjamin Linstid Maxim Tsypin Joanna Roder
author_sort	Heinrich Roder
collection	DOAJ
description	Abstract Background Modern molecular profiling techniques are yielding vast amounts of data from patient samples that could be utilized with machine learning methods to provide important biological insights and improvements in patient outcomes. Unsupervised methods have been successfully used to identify molecularly-defined disease subtypes. However, these approaches do not take advantage of potential additional clinical outcome information. Supervised methods can be implemented when training classes are apparent (e.g., responders or non-responders to treatment). However, training classes can be difficult to define when assessing relative benefit of one therapy over another using gold standard clinical endpoints, since it is often not clear how much benefit each individual patient receives. Results We introduce an iterative approach to binary classification tasks based on the simultaneous refinement of training class labels and classifiers towards self-consistency. As training labels are refined during the process, the method is well suited to cases where training class definitions are not obvious or noisy. Clinical data, including time-to-event endpoints, can be incorporated into the approach to enable the iterative refinement to identify molecular phenotypes associated with a particular clinical variable. Using synthetic data, we show how this approach can be used to increase the accuracy of identification of outcome-related phenotypes and their associated molecular attributes. Further, we demonstrate that the advantages of the method persist in real world genomic datasets, allowing the reliable identification of molecular phenotypes and estimation of their association with outcome that generalizes to validation datasets. We show that at convergence of the iterative refinement, there is a consistent incorporation of the molecular data into the classifier yielding the molecular phenotype and that this allows a robust identification of associated attributes and the underlying biological processes. Conclusions The consistent incorporation of the structure of the molecular data into the classifier helps to minimize overfitting and facilitates not only good generalization of classification and molecular phenotypes, but also reliable identification of biologically relevant features and elucidation of underlying biological processes.
first_indexed	2024-12-14T23:20:18Z
format	Article
id	doaj.art-212178dfa247411b8864f81c70ed1d7c
institution	Directory Open Access Journal
issn	1471-2105
language	English
last_indexed	2024-12-14T23:20:18Z
publishDate	2019-05-01
publisher	BMC
record_format	Article
series	BMC Bioinformatics
spelling	doaj.art-212178dfa247411b8864f81c70ed1d7c2022-12-21T22:43:59ZengBMCBMC Bioinformatics1471-21052019-05-0120112510.1186/s12859-019-2885-3Robust identification of molecular phenotypes using semi-supervised learningHeinrich Roder0Carlos Oliveira1Lelia Net2Benjamin Linstid3Maxim Tsypin4Joanna Roder5Biodesix IncBiodesix IncBiodesix IncBiodesix IncBiodesix IncBiodesix IncAbstract Background Modern molecular profiling techniques are yielding vast amounts of data from patient samples that could be utilized with machine learning methods to provide important biological insights and improvements in patient outcomes. Unsupervised methods have been successfully used to identify molecularly-defined disease subtypes. However, these approaches do not take advantage of potential additional clinical outcome information. Supervised methods can be implemented when training classes are apparent (e.g., responders or non-responders to treatment). However, training classes can be difficult to define when assessing relative benefit of one therapy over another using gold standard clinical endpoints, since it is often not clear how much benefit each individual patient receives. Results We introduce an iterative approach to binary classification tasks based on the simultaneous refinement of training class labels and classifiers towards self-consistency. As training labels are refined during the process, the method is well suited to cases where training class definitions are not obvious or noisy. Clinical data, including time-to-event endpoints, can be incorporated into the approach to enable the iterative refinement to identify molecular phenotypes associated with a particular clinical variable. Using synthetic data, we show how this approach can be used to increase the accuracy of identification of outcome-related phenotypes and their associated molecular attributes. Further, we demonstrate that the advantages of the method persist in real world genomic datasets, allowing the reliable identification of molecular phenotypes and estimation of their association with outcome that generalizes to validation datasets. We show that at convergence of the iterative refinement, there is a consistent incorporation of the molecular data into the classifier yielding the molecular phenotype and that this allows a robust identification of associated attributes and the underlying biological processes. Conclusions The consistent incorporation of the structure of the molecular data into the classifier helps to minimize overfitting and facilitates not only good generalization of classification and molecular phenotypes, but also reliable identification of biologically relevant features and elucidation of underlying biological processes.http://link.springer.com/article/10.1186/s12859-019-2885-3Machine learningClusteringMolecular phenotypeSemi-supervised learning
spellingShingle	Heinrich Roder Carlos Oliveira Lelia Net Benjamin Linstid Maxim Tsypin Joanna Roder Robust identification of molecular phenotypes using semi-supervised learning BMC Bioinformatics Machine learning Clustering Molecular phenotype Semi-supervised learning
title	Robust identification of molecular phenotypes using semi-supervised learning
title_full	Robust identification of molecular phenotypes using semi-supervised learning
title_fullStr	Robust identification of molecular phenotypes using semi-supervised learning
title_full_unstemmed	Robust identification of molecular phenotypes using semi-supervised learning
title_short	Robust identification of molecular phenotypes using semi-supervised learning
title_sort	robust identification of molecular phenotypes using semi supervised learning
topic	Machine learning Clustering Molecular phenotype Semi-supervised learning
url	http://link.springer.com/article/10.1186/s12859-019-2885-3
work_keys_str_mv	AT heinrichroder robustidentificationofmolecularphenotypesusingsemisupervisedlearning AT carlosoliveira robustidentificationofmolecularphenotypesusingsemisupervisedlearning AT lelianet robustidentificationofmolecularphenotypesusingsemisupervisedlearning AT benjaminlinstid robustidentificationofmolecularphenotypesusingsemisupervisedlearning AT maximtsypin robustidentificationofmolecularphenotypesusingsemisupervisedlearning AT joannaroder robustidentificationofmolecularphenotypesusingsemisupervisedlearning

Robust identification of molecular phenotypes using semi-supervised learning

Similar Items