Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data

Abstract Background Standard preprocessing of single-cell RNA-seq UMI data includes normalization by sequencing depth to remove this technical variability, and nonlinear transformation to stabilize the variance across genes with different expression levels. Instead, two recent papers propose to use...

Full description

Bibliographic Details
Main Authors:	Jan Lause, Philipp Berens, Dmitry Kobak
Format:	Article
Language:	English
Published:	BMC 2021-09-01
Series:	Genome Biology
Online Access:	https://doi.org/10.1186/s13059-021-02451-7

_version_	1818364171165630464
author	Jan Lause Philipp Berens Dmitry Kobak
author_facet	Jan Lause Philipp Berens Dmitry Kobak
author_sort	Jan Lause
collection	DOAJ
description	Abstract Background Standard preprocessing of single-cell RNA-seq UMI data includes normalization by sequencing depth to remove this technical variability, and nonlinear transformation to stabilize the variance across genes with different expression levels. Instead, two recent papers propose to use statistical count models for these tasks: Hafemeister and Satija (Genome Biol 20:296, 2019) recommend using Pearson residuals from negative binomial regression, while Townes et al. (Genome Biol 20:295, 2019) recommend fitting a generalized PCA model. Here, we investigate the connection between these approaches theoretically and empirically, and compare their effects on downstream processing. Results We show that the model of Hafemeister and Satija produces noisy parameter estimates because it is overspecified, which is why the original paper employs post hoc smoothing. When specified more parsimoniously, it has a simple analytic solution equivalent to the rank-one Poisson GLM-PCA of Townes et al. Further, our analysis indicates that per-gene overdispersion estimates in Hafemeister and Satija are biased, and that the data are in fact consistent with the overdispersion parameter being independent of gene expression. We then use negative control data without biological variability to estimate the technical overdispersion of UMI counts, and find that across several different experimental protocols, the data are close to Poisson and suggest very moderate overdispersion. Finally, we perform a benchmark to compare the performance of Pearson residuals, variance-stabilizing transformations, and GLM-PCA on scRNA-seq datasets with known ground truth. Conclusions We demonstrate that analytic Pearson residuals strongly outperform other methods for identifying biologically variable genes, and capture more of the biologically meaningful variation when used for dimensionality reduction.
first_indexed	2024-12-13T22:00:07Z
format	Article
id	doaj.art-b0c1779c0b134cc88809ae17e2c25707
institution	Directory Open Access Journal
issn	1474-760X
language	English
last_indexed	2024-12-13T22:00:07Z
publishDate	2021-09-01
publisher	BMC
record_format	Article
series	Genome Biology
spelling	doaj.art-b0c1779c0b134cc88809ae17e2c257072022-12-21T23:30:02ZengBMCGenome Biology1474-760X2021-09-0122112010.1186/s13059-021-02451-7Analytic Pearson residuals for normalization of single-cell RNA-seq UMI dataJan Lause0Philipp Berens1Dmitry Kobak2University of Tübingen, Institute for Ophthalmic ResearchUniversity of Tübingen, Institute for Ophthalmic ResearchUniversity of Tübingen, Institute for Ophthalmic ResearchAbstract Background Standard preprocessing of single-cell RNA-seq UMI data includes normalization by sequencing depth to remove this technical variability, and nonlinear transformation to stabilize the variance across genes with different expression levels. Instead, two recent papers propose to use statistical count models for these tasks: Hafemeister and Satija (Genome Biol 20:296, 2019) recommend using Pearson residuals from negative binomial regression, while Townes et al. (Genome Biol 20:295, 2019) recommend fitting a generalized PCA model. Here, we investigate the connection between these approaches theoretically and empirically, and compare their effects on downstream processing. Results We show that the model of Hafemeister and Satija produces noisy parameter estimates because it is overspecified, which is why the original paper employs post hoc smoothing. When specified more parsimoniously, it has a simple analytic solution equivalent to the rank-one Poisson GLM-PCA of Townes et al. Further, our analysis indicates that per-gene overdispersion estimates in Hafemeister and Satija are biased, and that the data are in fact consistent with the overdispersion parameter being independent of gene expression. We then use negative control data without biological variability to estimate the technical overdispersion of UMI counts, and find that across several different experimental protocols, the data are close to Poisson and suggest very moderate overdispersion. Finally, we perform a benchmark to compare the performance of Pearson residuals, variance-stabilizing transformations, and GLM-PCA on scRNA-seq datasets with known ground truth. Conclusions We demonstrate that analytic Pearson residuals strongly outperform other methods for identifying biologically variable genes, and capture more of the biologically meaningful variation when used for dimensionality reduction.https://doi.org/10.1186/s13059-021-02451-7
spellingShingle	Jan Lause Philipp Berens Dmitry Kobak Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data Genome Biology
title	Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data
title_full	Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data
title_fullStr	Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data
title_full_unstemmed	Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data
title_short	Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data
title_sort	analytic pearson residuals for normalization of single cell rna seq umi data
url	https://doi.org/10.1186/s13059-021-02451-7
work_keys_str_mv	AT janlause analyticpearsonresidualsfornormalizationofsinglecellrnasequmidata AT philippberens analyticpearsonresidualsfornormalizationofsinglecellrnasequmidata AT dmitrykobak analyticpearsonresidualsfornormalizationofsinglecellrnasequmidata

Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data

Similar Items