Bias detection and correction in RNA-Sequencing data

Abstract Background High throughput sequencing technology provides us unprecedented opportunities to study transcriptome dynamics. Compared to microarray-based gene expression profiling, RNA-Seq has many advantages, such as high resolution, low backgrou...

Full description

Bibliographic Details
Main Authors:	Zhao Hongyu, Chung Lisa M, Zheng Wei
Format:	Article
Language:	English
Published:	BMC 2011-07-01
Series:	BMC Bioinformatics
Online Access:	http://www.biomedcentral.com/1471-2105/12/290

_version_	1818380910740897792
author	Zhao Hongyu Chung Lisa M Zheng Wei
author_facet	Zhao Hongyu Chung Lisa M Zheng Wei
author_sort	Zhao Hongyu
collection	DOAJ
description	<p>Abstract</p> <p>Background</p> <p>High throughput sequencing technology provides us unprecedented opportunities to study transcriptome dynamics. Compared to microarray-based gene expression profiling, RNA-Seq has many advantages, such as high resolution, low background, and ability to identify novel transcripts. Moreover, for genes with multiple isoforms, expression of each isoform may be estimated from RNA-Seq data. Despite these advantages, recent work revealed that base level read counts from RNA-Seq data may not be randomly distributed and can be affected by local nucleotide composition. It was not clear though how the base level read count bias may affect gene level expression estimates.</p> <p>Results</p> <p>In this paper, by using five published RNA-Seq data sets from different biological sources and with different data preprocessing schemes, we showed that commonly used estimates of gene expression levels from RNA-Seq data, such as reads per kilobase of gene length per million reads (RPKM), are biased in terms of gene length, GC content and dinucleotide frequencies. We directly examined the biases at the gene-level, and proposed a simple generalized-additive-model based approach to correct different sources of biases simultaneously. Compared to previously proposed base level correction methods, our method reduces bias in gene-level expression estimates more effectively.</p> <p>Conclusions</p> <p>Our method identifies and corrects different sources of biases in gene-level expression measures from RNA-Seq data, and provides more accurate estimates of gene expression levels from RNA-Seq. This method should prove useful in meta-analysis of gene expression levels using different platforms or experimental protocols.</p>
first_indexed	2024-12-14T02:26:11Z
format	Article
id	doaj.art-c24ddf1f952941f4a4db217d7400ca00
institution	Directory Open Access Journal
issn	1471-2105
language	English
last_indexed	2024-12-14T02:26:11Z
publishDate	2011-07-01
publisher	BMC
record_format	Article
series	BMC Bioinformatics
spelling	doaj.art-c24ddf1f952941f4a4db217d7400ca002022-12-21T23:20:23ZengBMCBMC Bioinformatics1471-21052011-07-0112129010.1186/1471-2105-12-290Bias detection and correction in RNA-Sequencing dataZhao HongyuChung Lisa MZheng Wei<p>Abstract</p> <p>Background</p> <p>High throughput sequencing technology provides us unprecedented opportunities to study transcriptome dynamics. Compared to microarray-based gene expression profiling, RNA-Seq has many advantages, such as high resolution, low background, and ability to identify novel transcripts. Moreover, for genes with multiple isoforms, expression of each isoform may be estimated from RNA-Seq data. Despite these advantages, recent work revealed that base level read counts from RNA-Seq data may not be randomly distributed and can be affected by local nucleotide composition. It was not clear though how the base level read count bias may affect gene level expression estimates.</p> <p>Results</p> <p>In this paper, by using five published RNA-Seq data sets from different biological sources and with different data preprocessing schemes, we showed that commonly used estimates of gene expression levels from RNA-Seq data, such as reads per kilobase of gene length per million reads (RPKM), are biased in terms of gene length, GC content and dinucleotide frequencies. We directly examined the biases at the gene-level, and proposed a simple generalized-additive-model based approach to correct different sources of biases simultaneously. Compared to previously proposed base level correction methods, our method reduces bias in gene-level expression estimates more effectively.</p> <p>Conclusions</p> <p>Our method identifies and corrects different sources of biases in gene-level expression measures from RNA-Seq data, and provides more accurate estimates of gene expression levels from RNA-Seq. This method should prove useful in meta-analysis of gene expression levels using different platforms or experimental protocols.</p>http://www.biomedcentral.com/1471-2105/12/290
spellingShingle	Zhao Hongyu Chung Lisa M Zheng Wei Bias detection and correction in RNA-Sequencing data BMC Bioinformatics
title	Bias detection and correction in RNA-Sequencing data
title_full	Bias detection and correction in RNA-Sequencing data
title_fullStr	Bias detection and correction in RNA-Sequencing data
title_full_unstemmed	Bias detection and correction in RNA-Sequencing data
title_short	Bias detection and correction in RNA-Sequencing data
title_sort	bias detection and correction in rna sequencing data
url	http://www.biomedcentral.com/1471-2105/12/290
work_keys_str_mv	AT zhaohongyu biasdetectionandcorrectioninrnasequencingdata AT chunglisam biasdetectionandcorrectioninrnasequencingdata AT zhengwei biasdetectionandcorrectioninrnasequencingdata

Bias detection and correction in RNA-Sequencing data

Similar Items