Comparison and evaluation of statistical error models for scRNA-seq

Abstract Background Heterogeneity in single-cell RNA-seq (scRNA-seq) data is driven by multiple sources, including biological variation in cellular state as well as technical variation introduced during experimental processing. Deconvolving these effects is a key challenge for preprocessing workflow...

Full description

Bibliographic Details
Main Authors:	Saket Choudhary, Rahul Satija
Format:	Article
Language:	English
Published:	BMC 2022-01-01
Series:	Genome Biology
Subjects:	Single-cell RNA-seq Normalization Dimension reduction Variable genes Differential expression Feature selection
Online Access:	https://doi.org/10.1186/s13059-021-02584-9

_version_	1819261447926972416
author	Saket Choudhary Rahul Satija
author_facet	Saket Choudhary Rahul Satija
author_sort	Saket Choudhary
collection	DOAJ
description	Abstract Background Heterogeneity in single-cell RNA-seq (scRNA-seq) data is driven by multiple sources, including biological variation in cellular state as well as technical variation introduced during experimental processing. Deconvolving these effects is a key challenge for preprocessing workflows. Recent work has demonstrated the importance and utility of count models for scRNA-seq analysis, but there is a lack of consensus on which statistical distributions and parameter settings are appropriate. Results Here, we analyze 59 scRNA-seq datasets that span a wide range of technologies, systems, and sequencing depths in order to evaluate the performance of different error models. We find that while a Poisson error model appears appropriate for sparse datasets, we observe clear evidence of overdispersion for genes with sufficient sequencing depth in all biological systems, necessitating the use of a negative binomial model. Moreover, we find that the degree of overdispersion varies widely across datasets, systems, and gene abundances, and argues for a data-driven approach for parameter estimation. Conclusions Based on these analyses, we provide a set of recommendations for modeling variation in scRNA-seq data, particularly when using generalized linear models or likelihood-based approaches for preprocessing and downstream analysis.
first_indexed	2024-12-23T19:41:57Z
format	Article
id	doaj.art-609c7e20f7084346bc7c38f55ee9adcb
institution	Directory Open Access Journal
issn	1474-760X
language	English
last_indexed	2024-12-23T19:41:57Z
publishDate	2022-01-01
publisher	BMC
record_format	Article
series	Genome Biology
spelling	doaj.art-609c7e20f7084346bc7c38f55ee9adcb2022-12-21T17:33:38ZengBMCGenome Biology1474-760X2022-01-0123112010.1186/s13059-021-02584-9Comparison and evaluation of statistical error models for scRNA-seqSaket Choudhary0Rahul Satija1New York Genome CenterNew York Genome CenterAbstract Background Heterogeneity in single-cell RNA-seq (scRNA-seq) data is driven by multiple sources, including biological variation in cellular state as well as technical variation introduced during experimental processing. Deconvolving these effects is a key challenge for preprocessing workflows. Recent work has demonstrated the importance and utility of count models for scRNA-seq analysis, but there is a lack of consensus on which statistical distributions and parameter settings are appropriate. Results Here, we analyze 59 scRNA-seq datasets that span a wide range of technologies, systems, and sequencing depths in order to evaluate the performance of different error models. We find that while a Poisson error model appears appropriate for sparse datasets, we observe clear evidence of overdispersion for genes with sufficient sequencing depth in all biological systems, necessitating the use of a negative binomial model. Moreover, we find that the degree of overdispersion varies widely across datasets, systems, and gene abundances, and argues for a data-driven approach for parameter estimation. Conclusions Based on these analyses, we provide a set of recommendations for modeling variation in scRNA-seq data, particularly when using generalized linear models or likelihood-based approaches for preprocessing and downstream analysis.https://doi.org/10.1186/s13059-021-02584-9Single-cell RNA-seqNormalizationDimension reductionVariable genesDifferential expressionFeature selection
spellingShingle	Saket Choudhary Rahul Satija Comparison and evaluation of statistical error models for scRNA-seq Genome Biology Single-cell RNA-seq Normalization Dimension reduction Variable genes Differential expression Feature selection
title	Comparison and evaluation of statistical error models for scRNA-seq
title_full	Comparison and evaluation of statistical error models for scRNA-seq
title_fullStr	Comparison and evaluation of statistical error models for scRNA-seq
title_full_unstemmed	Comparison and evaluation of statistical error models for scRNA-seq
title_short	Comparison and evaluation of statistical error models for scRNA-seq
title_sort	comparison and evaluation of statistical error models for scrna seq
topic	Single-cell RNA-seq Normalization Dimension reduction Variable genes Differential expression Feature selection
url	https://doi.org/10.1186/s13059-021-02584-9
work_keys_str_mv	AT saketchoudhary comparisonandevaluationofstatisticalerrormodelsforscrnaseq AT rahulsatija comparisonandevaluationofstatisticalerrormodelsforscrnaseq

Comparison and evaluation of statistical error models for scRNA-seq

Similar Items