SRIQ clustering: A fusion of Random Forest, QT clustering, and KNN concepts

Gene expression profiling together with unsupervised analysis methods, typically clustering methods, has been used extensively in cancer research to unravel, e.g., new molecular subtypes that hold promise of disease refinement that may ultimately benefit patients. However, many of the commonly used...

Full description

Bibliographic Details
Main Authors: Jacob Karlström, Mattias Aine, Johan Staaf, Srinivas Veerla
Format: Article
Language:English
Published: Elsevier 2022-01-01
Series:Computational and Structural Biotechnology Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2001037022001118
_version_ 1828088010972856320
author Jacob Karlström
Mattias Aine
Johan Staaf
Srinivas Veerla
author_facet Jacob Karlström
Mattias Aine
Johan Staaf
Srinivas Veerla
author_sort Jacob Karlström
collection DOAJ
description Gene expression profiling together with unsupervised analysis methods, typically clustering methods, has been used extensively in cancer research to unravel, e.g., new molecular subtypes that hold promise of disease refinement that may ultimately benefit patients. However, many of the commonly used methods require a prespecified number of clusters to extract and frequently require some type of feature pre-selection, e.g. variance filtering. This introduces subjectivity to the process of cluster discovery and the definition of putative novel tumor subtypes. Here, we introduce SRIQ, a novel unsupervised clustering method that could circumvent some of the issues in commonly used unsupervised analysis methods. SRIQ incorporates concepts from random forest machine learning as well as quality threshold- and k-nearest neighbor clustering. It is implemented as a Java and Python pipeline including data pre-processing, differential expression analysis, and pathway analysis. Using 434 lung adenocarcinomas profiled by RNA sequencing, we demonstrate the technical reproducibility of SRIQ and benchmark its performance compared to the commonly used consensus clustering method. Based on differential gene expression analysis and auxiliary molecular data we show that SRIQ can define new tumor subsets that appear biologically relevant and consistent compared and that these new subgroups seem to refine existing transcriptional subtypes that were defined using consensus clustering. Together, this provides support that SRIQ may be a useful new tool for unsupervised analysis of gene expression data from human malignancies.
first_indexed 2024-04-11T05:20:07Z
format Article
id doaj.art-f72b08a56ae947ceb3458ab37b76c041
institution Directory Open Access Journal
issn 2001-0370
language English
last_indexed 2024-04-11T05:20:07Z
publishDate 2022-01-01
publisher Elsevier
record_format Article
series Computational and Structural Biotechnology Journal
spelling doaj.art-f72b08a56ae947ceb3458ab37b76c0412022-12-24T04:51:55ZengElsevierComputational and Structural Biotechnology Journal2001-03702022-01-012015671579SRIQ clustering: A fusion of Random Forest, QT clustering, and KNN conceptsJacob Karlström0Mattias Aine1Johan Staaf2Srinivas Veerla3Division of Oncology, Department of Clinical Sciences Lund, Lund University, Medicon Village, SE-22381 Lund, SwedenDivision of Oncology, Department of Clinical Sciences Lund, Lund University, Medicon Village, SE-22381 Lund, SwedenCorresponding authors.; Division of Oncology, Department of Clinical Sciences Lund, Lund University, Medicon Village, SE-22381 Lund, SwedenCorresponding authors.; Division of Oncology, Department of Clinical Sciences Lund, Lund University, Medicon Village, SE-22381 Lund, SwedenGene expression profiling together with unsupervised analysis methods, typically clustering methods, has been used extensively in cancer research to unravel, e.g., new molecular subtypes that hold promise of disease refinement that may ultimately benefit patients. However, many of the commonly used methods require a prespecified number of clusters to extract and frequently require some type of feature pre-selection, e.g. variance filtering. This introduces subjectivity to the process of cluster discovery and the definition of putative novel tumor subtypes. Here, we introduce SRIQ, a novel unsupervised clustering method that could circumvent some of the issues in commonly used unsupervised analysis methods. SRIQ incorporates concepts from random forest machine learning as well as quality threshold- and k-nearest neighbor clustering. It is implemented as a Java and Python pipeline including data pre-processing, differential expression analysis, and pathway analysis. Using 434 lung adenocarcinomas profiled by RNA sequencing, we demonstrate the technical reproducibility of SRIQ and benchmark its performance compared to the commonly used consensus clustering method. Based on differential gene expression analysis and auxiliary molecular data we show that SRIQ can define new tumor subsets that appear biologically relevant and consistent compared and that these new subgroups seem to refine existing transcriptional subtypes that were defined using consensus clustering. Together, this provides support that SRIQ may be a useful new tool for unsupervised analysis of gene expression data from human malignancies.http://www.sciencedirect.com/science/article/pii/S2001037022001118Lung adenocarcinomaClusteringMolecular subtypesGene expressionRandom ForestQT clustering
spellingShingle Jacob Karlström
Mattias Aine
Johan Staaf
Srinivas Veerla
SRIQ clustering: A fusion of Random Forest, QT clustering, and KNN concepts
Computational and Structural Biotechnology Journal
Lung adenocarcinoma
Clustering
Molecular subtypes
Gene expression
Random Forest
QT clustering
title SRIQ clustering: A fusion of Random Forest, QT clustering, and KNN concepts
title_full SRIQ clustering: A fusion of Random Forest, QT clustering, and KNN concepts
title_fullStr SRIQ clustering: A fusion of Random Forest, QT clustering, and KNN concepts
title_full_unstemmed SRIQ clustering: A fusion of Random Forest, QT clustering, and KNN concepts
title_short SRIQ clustering: A fusion of Random Forest, QT clustering, and KNN concepts
title_sort sriq clustering a fusion of random forest qt clustering and knn concepts
topic Lung adenocarcinoma
Clustering
Molecular subtypes
Gene expression
Random Forest
QT clustering
url http://www.sciencedirect.com/science/article/pii/S2001037022001118
work_keys_str_mv AT jacobkarlstrom sriqclusteringafusionofrandomforestqtclusteringandknnconcepts
AT mattiasaine sriqclusteringafusionofrandomforestqtclusteringandknnconcepts
AT johanstaaf sriqclusteringafusionofrandomforestqtclusteringandknnconcepts
AT srinivasveerla sriqclusteringafusionofrandomforestqtclusteringandknnconcepts