A rank-based marker selection method for high throughput scRNA-seq data

Abstract Background High throughput microfluidic protocols in single cell RNA sequencing (scRNA-seq) collect mRNA counts from up to one million individual cells in a single experiment; this enables high resolution studies of rare cell types and cell development pathways. Determining small sets of ge...

Full description

Bibliographic Details
Main Authors:	Alexander H. S. Vargo, Anna C. Gilbert
Format:	Article
Language:	English
Published:	BMC 2020-10-01
Series:	BMC Bioinformatics
Subjects:	Single cell RNA-seq Marker selection Machine learning Data analysis Algorithms Benchmarking
Online Access:	http://link.springer.com/article/10.1186/s12859-020-03641-z

_version_	1818655294823071744
author	Alexander H. S. Vargo Anna C. Gilbert
author_facet	Alexander H. S. Vargo Anna C. Gilbert
author_sort	Alexander H. S. Vargo
collection	DOAJ
description	Abstract Background High throughput microfluidic protocols in single cell RNA sequencing (scRNA-seq) collect mRNA counts from up to one million individual cells in a single experiment; this enables high resolution studies of rare cell types and cell development pathways. Determining small sets of genetic markers that can identify specific cell populations is thus one of the major objectives of computational analysis of mRNA counts data. Many tools have been developed for marker selection on single cell data; most of them, however, are based on complex statistical models and handle the multi-class case in an ad-hoc manner. Results We introduce RankCorr, a fast method with strong mathematical underpinnings that performs multi-class marker selection in an informed manner. RankCorr proceeds by ranking the mRNA counts data before linearly separating the ranked data using a small number of genes. The step of ranking is intuitively natural for scRNA-seq data and provides a non-parametric method for analyzing count data. In addition, we present several performance measures for evaluating the quality of a set of markers when there is no known ground truth. Using these metrics, we compare the performance of RankCorr to a variety of other marker selection methods on an assortment of experimental and synthetic data sets that range in size from several thousand to one million cells. Conclusions According to the metrics introduced in this work, RankCorr is consistently one of most optimal marker selection methods on scRNA-seq data. Most methods show similar overall performance, however; thus, the speed of the algorithm is the most important consideration for large data sets (and comparing the markers selected by several methods can be fruitful). RankCorr is fast enough to easily handle the largest data sets and, as such, it is a useful tool to add into computational pipelines when dealing with high throughput scRNA-seq data. RankCorr software is available for download at https://github.com/ahsv/RankCorr with extensive documentation.
first_indexed	2024-12-17T03:07:25Z
format	Article
id	doaj.art-2b90c0814b264a37b75aa3866d26fe7b
institution	Directory Open Access Journal
issn	1471-2105
language	English
last_indexed	2024-12-17T03:07:25Z
publishDate	2020-10-01
publisher	BMC
record_format	Article
series	BMC Bioinformatics
spelling	doaj.art-2b90c0814b264a37b75aa3866d26fe7b2022-12-21T22:05:55ZengBMCBMC Bioinformatics1471-21052020-10-0121115110.1186/s12859-020-03641-zA rank-based marker selection method for high throughput scRNA-seq dataAlexander H. S. Vargo0Anna C. Gilbert1Department of Mathematics, University of MichiganDepartment of Mathematics, Yale UniversityAbstract Background High throughput microfluidic protocols in single cell RNA sequencing (scRNA-seq) collect mRNA counts from up to one million individual cells in a single experiment; this enables high resolution studies of rare cell types and cell development pathways. Determining small sets of genetic markers that can identify specific cell populations is thus one of the major objectives of computational analysis of mRNA counts data. Many tools have been developed for marker selection on single cell data; most of them, however, are based on complex statistical models and handle the multi-class case in an ad-hoc manner. Results We introduce RankCorr, a fast method with strong mathematical underpinnings that performs multi-class marker selection in an informed manner. RankCorr proceeds by ranking the mRNA counts data before linearly separating the ranked data using a small number of genes. The step of ranking is intuitively natural for scRNA-seq data and provides a non-parametric method for analyzing count data. In addition, we present several performance measures for evaluating the quality of a set of markers when there is no known ground truth. Using these metrics, we compare the performance of RankCorr to a variety of other marker selection methods on an assortment of experimental and synthetic data sets that range in size from several thousand to one million cells. Conclusions According to the metrics introduced in this work, RankCorr is consistently one of most optimal marker selection methods on scRNA-seq data. Most methods show similar overall performance, however; thus, the speed of the algorithm is the most important consideration for large data sets (and comparing the markers selected by several methods can be fruitful). RankCorr is fast enough to easily handle the largest data sets and, as such, it is a useful tool to add into computational pipelines when dealing with high throughput scRNA-seq data. RankCorr software is available for download at https://github.com/ahsv/RankCorr with extensive documentation.http://link.springer.com/article/10.1186/s12859-020-03641-zSingle cell RNA-seqMarker selectionMachine learningData analysisAlgorithmsBenchmarking
spellingShingle	Alexander H. S. Vargo Anna C. Gilbert A rank-based marker selection method for high throughput scRNA-seq data BMC Bioinformatics Single cell RNA-seq Marker selection Machine learning Data analysis Algorithms Benchmarking
title	A rank-based marker selection method for high throughput scRNA-seq data
title_full	A rank-based marker selection method for high throughput scRNA-seq data
title_fullStr	A rank-based marker selection method for high throughput scRNA-seq data
title_full_unstemmed	A rank-based marker selection method for high throughput scRNA-seq data
title_short	A rank-based marker selection method for high throughput scRNA-seq data
title_sort	rank based marker selection method for high throughput scrna seq data
topic	Single cell RNA-seq Marker selection Machine learning Data analysis Algorithms Benchmarking
url	http://link.springer.com/article/10.1186/s12859-020-03641-z
work_keys_str_mv	AT alexanderhsvargo arankbasedmarkerselectionmethodforhighthroughputscrnaseqdata AT annacgilbert arankbasedmarkerselectionmethodforhighthroughputscrnaseqdata AT alexanderhsvargo rankbasedmarkerselectionmethodforhighthroughputscrnaseqdata AT annacgilbert rankbasedmarkerselectionmethodforhighthroughputscrnaseqdata

A rank-based marker selection method for high throughput scRNA-seq data

Similar Items