Gene dispersion is the key determinant of the read count bias in differential expression analysis of RNA-seq data

Abstract Background In differential expression analysis of RNA-sequencing (RNA-seq) read count data for two sample groups, it is known that highly expressed genes (or longer genes) are more likely to be differentially expressed which is called read count bias (or gene length bias). This bias had gre...

Full description

Bibliographic Details
Main Authors:	Sora Yoon, Dougu Nam
Format:	Article
Language:	English
Published:	BMC 2017-05-01
Series:	BMC Genomics
Subjects:	RNA-seq Differential expression analysis Read count bias Gene length bias Dispersion
Online Access:	http://link.springer.com/article/10.1186/s12864-017-3809-0

_version_	1811303066350125056
author	Sora Yoon Dougu Nam
author_facet	Sora Yoon Dougu Nam
author_sort	Sora Yoon
collection	DOAJ
description	Abstract Background In differential expression analysis of RNA-sequencing (RNA-seq) read count data for two sample groups, it is known that highly expressed genes (or longer genes) are more likely to be differentially expressed which is called read count bias (or gene length bias). This bias had great effect on the downstream Gene Ontology over-representation analysis. However, such a bias has not been systematically analyzed for different replicate types of RNA-seq data. Results We show that the dispersion coefficient of a gene in the negative binomial modeling of read counts is the critical determinant of the read count bias (and gene length bias) by mathematical inference and tests for a number of simulated and real RNA-seq datasets. We demonstrate that the read count bias is mostly confined to data with small gene dispersions (e.g., technical replicates and some of genetically identical replicates such as cell lines or inbred animals), and many biological replicate data from unrelated samples do not suffer from such a bias except for genes with some small counts. It is also shown that the sample-permuting GSEA method yields a considerable number of false positives caused by the read count bias, while the preranked method does not. Conclusion We showed the small gene variance (similarly, dispersion) is the main cause of read count bias (and gene length bias) for the first time and analyzed the read count bias for different replicate types of RNA-seq data and its effect on gene-set enrichment analysis.
first_indexed	2024-04-13T07:41:09Z
format	Article
id	doaj.art-4b5d2da50d744c7bad16ef3c30bdbdc1
institution	Directory Open Access Journal
issn	1471-2164
language	English
last_indexed	2024-04-13T07:41:09Z
publishDate	2017-05-01
publisher	BMC
record_format	Article
series	BMC Genomics
spelling	doaj.art-4b5d2da50d744c7bad16ef3c30bdbdc12022-12-22T02:55:54ZengBMCBMC Genomics1471-21642017-05-0118111110.1186/s12864-017-3809-0Gene dispersion is the key determinant of the read count bias in differential expression analysis of RNA-seq dataSora Yoon0Dougu Nam1School of Life Sciences, Ulsan National Institute of Science and TechnologySchool of Life Sciences, Ulsan National Institute of Science and TechnologyAbstract Background In differential expression analysis of RNA-sequencing (RNA-seq) read count data for two sample groups, it is known that highly expressed genes (or longer genes) are more likely to be differentially expressed which is called read count bias (or gene length bias). This bias had great effect on the downstream Gene Ontology over-representation analysis. However, such a bias has not been systematically analyzed for different replicate types of RNA-seq data. Results We show that the dispersion coefficient of a gene in the negative binomial modeling of read counts is the critical determinant of the read count bias (and gene length bias) by mathematical inference and tests for a number of simulated and real RNA-seq datasets. We demonstrate that the read count bias is mostly confined to data with small gene dispersions (e.g., technical replicates and some of genetically identical replicates such as cell lines or inbred animals), and many biological replicate data from unrelated samples do not suffer from such a bias except for genes with some small counts. It is also shown that the sample-permuting GSEA method yields a considerable number of false positives caused by the read count bias, while the preranked method does not. Conclusion We showed the small gene variance (similarly, dispersion) is the main cause of read count bias (and gene length bias) for the first time and analyzed the read count bias for different replicate types of RNA-seq data and its effect on gene-set enrichment analysis.http://link.springer.com/article/10.1186/s12864-017-3809-0RNA-seqDifferential expression analysisRead count biasGene length biasDispersion
spellingShingle	Sora Yoon Dougu Nam Gene dispersion is the key determinant of the read count bias in differential expression analysis of RNA-seq data BMC Genomics RNA-seq Differential expression analysis Read count bias Gene length bias Dispersion
title	Gene dispersion is the key determinant of the read count bias in differential expression analysis of RNA-seq data
title_full	Gene dispersion is the key determinant of the read count bias in differential expression analysis of RNA-seq data
title_fullStr	Gene dispersion is the key determinant of the read count bias in differential expression analysis of RNA-seq data
title_full_unstemmed	Gene dispersion is the key determinant of the read count bias in differential expression analysis of RNA-seq data
title_short	Gene dispersion is the key determinant of the read count bias in differential expression analysis of RNA-seq data
title_sort	gene dispersion is the key determinant of the read count bias in differential expression analysis of rna seq data
topic	RNA-seq Differential expression analysis Read count bias Gene length bias Dispersion
url	http://link.springer.com/article/10.1186/s12864-017-3809-0
work_keys_str_mv	AT sorayoon genedispersionisthekeydeterminantofthereadcountbiasindifferentialexpressionanalysisofrnaseqdata AT dougunam genedispersionisthekeydeterminantofthereadcountbiasindifferentialexpressionanalysisofrnaseqdata

Gene dispersion is the key determinant of the read count bias in differential expression analysis of RNA-seq data

Similar Items