SlimPLS: a method for feature selection in gene expression-based disease classification.

A major challenge in biomedical studies in recent years has been the classification of gene expression profiles into categories, such as cases and controls. This is done by first training a classifier by using a labeled training set containing labeled samples from the two populations, and then using...

Full description

Bibliographic Details
Main Authors: Michael Gutkin, Ron Shamir, Gideon Dror
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2009-07-01
Series:PLoS ONE
Online Access:https://www.ncbi.nlm.nih.gov/pmc/articles/pmid/19649265/pdf/?tool=EBI
_version_ 1818682027064426496
author Michael Gutkin
Ron Shamir
Gideon Dror
author_facet Michael Gutkin
Ron Shamir
Gideon Dror
author_sort Michael Gutkin
collection DOAJ
description A major challenge in biomedical studies in recent years has been the classification of gene expression profiles into categories, such as cases and controls. This is done by first training a classifier by using a labeled training set containing labeled samples from the two populations, and then using that classifier to predict the labels of new samples. Such predictions have recently been shown to improve the diagnosis and treatment selection practices for several diseases. This procedure is complicated, however, by the high dimensionality if the data. While microarrays can measure the levels of thousands of genes per sample, case-control microarray studies usually involve no more than several dozen samples. Standard classifiers do not work well in these situations where the number of features (gene expression levels measured in these microarrays) far exceeds the number of samples. Selecting only the features that are most relevant for discriminating between the two categories can help construct better classifiers, in terms of both accuracy and efficiency. In this work we developed a novel method for multivariate feature selection based on the Partial Least Squares algorithm. We compared the method's variants with common feature selection techniques across a large number of real case-control datasets, using several classifiers. We demonstrate the advantages of the method and the preferable combinations of classifier and feature selection technique.
first_indexed 2024-12-17T10:12:18Z
format Article
id doaj.art-f60a986aeaf14bd5a72e9be2b371e2a2
institution Directory Open Access Journal
issn 1932-6203
language English
last_indexed 2024-12-17T10:12:18Z
publishDate 2009-07-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj.art-f60a986aeaf14bd5a72e9be2b371e2a22022-12-21T21:53:00ZengPublic Library of Science (PLoS)PLoS ONE1932-62032009-07-0147e641610.1371/journal.pone.0006416SlimPLS: a method for feature selection in gene expression-based disease classification.Michael GutkinRon ShamirGideon DrorA major challenge in biomedical studies in recent years has been the classification of gene expression profiles into categories, such as cases and controls. This is done by first training a classifier by using a labeled training set containing labeled samples from the two populations, and then using that classifier to predict the labels of new samples. Such predictions have recently been shown to improve the diagnosis and treatment selection practices for several diseases. This procedure is complicated, however, by the high dimensionality if the data. While microarrays can measure the levels of thousands of genes per sample, case-control microarray studies usually involve no more than several dozen samples. Standard classifiers do not work well in these situations where the number of features (gene expression levels measured in these microarrays) far exceeds the number of samples. Selecting only the features that are most relevant for discriminating between the two categories can help construct better classifiers, in terms of both accuracy and efficiency. In this work we developed a novel method for multivariate feature selection based on the Partial Least Squares algorithm. We compared the method's variants with common feature selection techniques across a large number of real case-control datasets, using several classifiers. We demonstrate the advantages of the method and the preferable combinations of classifier and feature selection technique.https://www.ncbi.nlm.nih.gov/pmc/articles/pmid/19649265/pdf/?tool=EBI
spellingShingle Michael Gutkin
Ron Shamir
Gideon Dror
SlimPLS: a method for feature selection in gene expression-based disease classification.
PLoS ONE
title SlimPLS: a method for feature selection in gene expression-based disease classification.
title_full SlimPLS: a method for feature selection in gene expression-based disease classification.
title_fullStr SlimPLS: a method for feature selection in gene expression-based disease classification.
title_full_unstemmed SlimPLS: a method for feature selection in gene expression-based disease classification.
title_short SlimPLS: a method for feature selection in gene expression-based disease classification.
title_sort slimpls a method for feature selection in gene expression based disease classification
url https://www.ncbi.nlm.nih.gov/pmc/articles/pmid/19649265/pdf/?tool=EBI
work_keys_str_mv AT michaelgutkin slimplsamethodforfeatureselectioningeneexpressionbaseddiseaseclassification
AT ronshamir slimplsamethodforfeatureselectioningeneexpressionbaseddiseaseclassification
AT gideondror slimplsamethodforfeatureselectioningeneexpressionbaseddiseaseclassification