The Poisson distribution model fits UMI-based single-cell RNA-sequencing data

Abstract Background Modeling of single cell RNA-sequencing (scRNA-seq) data remains challenging due to a high percentage of zeros and data heterogeneity, so improved modeling has strong potential to benefit many downstream data analyses. The existing zero-inflated or over-dispersed models are based...

Full description

Bibliographic Details
Main Authors: Yue Pan, Justin T. Landis, Razia Moorad, Di Wu, J. S. Marron, Dirk P. Dittmer
Format: Article
Language:English
Published: BMC 2023-06-01
Series:BMC Bioinformatics
Subjects:
Online Access:https://doi.org/10.1186/s12859-023-05349-2
_version_ 1827922790995460096
author Yue Pan
Justin T. Landis
Razia Moorad
Di Wu
J. S. Marron
Dirk P. Dittmer
author_facet Yue Pan
Justin T. Landis
Razia Moorad
Di Wu
J. S. Marron
Dirk P. Dittmer
author_sort Yue Pan
collection DOAJ
description Abstract Background Modeling of single cell RNA-sequencing (scRNA-seq) data remains challenging due to a high percentage of zeros and data heterogeneity, so improved modeling has strong potential to benefit many downstream data analyses. The existing zero-inflated or over-dispersed models are based on aggregations at either the gene or the cell level. However, they typically lose accuracy due to a too crude aggregation at those two levels. Results We avoid the crude approximations entailed by such aggregation through proposing an independent Poisson distribution (IPD) particularly at each individual entry in the scRNA-seq data matrix. This approach naturally and intuitively models the large number of zeros as matrix entries with a very small Poisson parameter. The critical challenge of cell clustering is approached via a novel data representation as Departures from a simple homogeneous IPD (DIPD) to capture the per-gene-per-cell intrinsic heterogeneity generated by cell clusters. Our experiments using real data and crafted experiments show that using DIPD as a data representation for scRNA-seq data can uncover novel cell subtypes that are missed or can only be found by careful parameter tuning using conventional methods. Conclusions This new method has multiple advantages, including (1) no need for prior feature selection or manual optimization of hyperparameters; (2) flexibility to combine with and improve upon other methods, such as Seurat. Another novel contribution is the use of crafted experiments as part of the validation of our newly developed DIPD-based clustering pipeline. This new clustering pipeline is implemented in the R (CRAN) package scpoisson.
first_indexed 2024-03-13T04:47:48Z
format Article
id doaj.art-c1b4528d7e5549bc82488d53ea1255dc
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-03-13T04:47:48Z
publishDate 2023-06-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-c1b4528d7e5549bc82488d53ea1255dc2023-06-18T11:26:22ZengBMCBMC Bioinformatics1471-21052023-06-0124112710.1186/s12859-023-05349-2The Poisson distribution model fits UMI-based single-cell RNA-sequencing dataYue Pan0Justin T. Landis1Razia Moorad2Di Wu3J. S. Marron4Dirk P. Dittmer5Department of Biostatistics, University of North Carolina at Chapel HillLineberger Comprehensive Cancer Center, University of North Carolina at Chapel HillLineberger Comprehensive Cancer Center, University of North Carolina at Chapel HillDepartment of Biostatistics, University of North Carolina at Chapel HillDepartment of Biostatistics, University of North Carolina at Chapel HillLineberger Comprehensive Cancer Center, University of North Carolina at Chapel HillAbstract Background Modeling of single cell RNA-sequencing (scRNA-seq) data remains challenging due to a high percentage of zeros and data heterogeneity, so improved modeling has strong potential to benefit many downstream data analyses. The existing zero-inflated or over-dispersed models are based on aggregations at either the gene or the cell level. However, they typically lose accuracy due to a too crude aggregation at those two levels. Results We avoid the crude approximations entailed by such aggregation through proposing an independent Poisson distribution (IPD) particularly at each individual entry in the scRNA-seq data matrix. This approach naturally and intuitively models the large number of zeros as matrix entries with a very small Poisson parameter. The critical challenge of cell clustering is approached via a novel data representation as Departures from a simple homogeneous IPD (DIPD) to capture the per-gene-per-cell intrinsic heterogeneity generated by cell clusters. Our experiments using real data and crafted experiments show that using DIPD as a data representation for scRNA-seq data can uncover novel cell subtypes that are missed or can only be found by careful parameter tuning using conventional methods. Conclusions This new method has multiple advantages, including (1) no need for prior feature selection or manual optimization of hyperparameters; (2) flexibility to combine with and improve upon other methods, such as Seurat. Another novel contribution is the use of crafted experiments as part of the validation of our newly developed DIPD-based clustering pipeline. This new clustering pipeline is implemented in the R (CRAN) package scpoisson.https://doi.org/10.1186/s12859-023-05349-2Single cellRNA-seqPoisson distributionData representation
spellingShingle Yue Pan
Justin T. Landis
Razia Moorad
Di Wu
J. S. Marron
Dirk P. Dittmer
The Poisson distribution model fits UMI-based single-cell RNA-sequencing data
BMC Bioinformatics
Single cell
RNA-seq
Poisson distribution
Data representation
title The Poisson distribution model fits UMI-based single-cell RNA-sequencing data
title_full The Poisson distribution model fits UMI-based single-cell RNA-sequencing data
title_fullStr The Poisson distribution model fits UMI-based single-cell RNA-sequencing data
title_full_unstemmed The Poisson distribution model fits UMI-based single-cell RNA-sequencing data
title_short The Poisson distribution model fits UMI-based single-cell RNA-sequencing data
title_sort poisson distribution model fits umi based single cell rna sequencing data
topic Single cell
RNA-seq
Poisson distribution
Data representation
url https://doi.org/10.1186/s12859-023-05349-2
work_keys_str_mv AT yuepan thepoissondistributionmodelfitsumibasedsinglecellrnasequencingdata
AT justintlandis thepoissondistributionmodelfitsumibasedsinglecellrnasequencingdata
AT raziamoorad thepoissondistributionmodelfitsumibasedsinglecellrnasequencingdata
AT diwu thepoissondistributionmodelfitsumibasedsinglecellrnasequencingdata
AT jsmarron thepoissondistributionmodelfitsumibasedsinglecellrnasequencingdata
AT dirkpdittmer thepoissondistributionmodelfitsumibasedsinglecellrnasequencingdata
AT yuepan poissondistributionmodelfitsumibasedsinglecellrnasequencingdata
AT justintlandis poissondistributionmodelfitsumibasedsinglecellrnasequencingdata
AT raziamoorad poissondistributionmodelfitsumibasedsinglecellrnasequencingdata
AT diwu poissondistributionmodelfitsumibasedsinglecellrnasequencingdata
AT jsmarron poissondistributionmodelfitsumibasedsinglecellrnasequencingdata
AT dirkpdittmer poissondistributionmodelfitsumibasedsinglecellrnasequencingdata