Mapping accuracy of short reads from massively parallel sequencing and the implications for quantitative expression profiling.

BACKGROUND:Massively parallel sequencing offers an enormous potential for expression profiling, in particular for interspecific comparisons. Currently, different platforms for massively parallel sequencing are available, which differ in read length and sequencing costs. The 454-technology offers the...

Full description

Bibliographic Details
Main Authors: Nicola Palmieri, Christian Schlötterer
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2009-07-01
Series:PLoS ONE
Online Access:http://europepmc.org/articles/PMC2712089?pdf=render
_version_ 1811330488111988736
author Nicola Palmieri
Christian Schlötterer
author_facet Nicola Palmieri
Christian Schlötterer
author_sort Nicola Palmieri
collection DOAJ
description BACKGROUND:Massively parallel sequencing offers an enormous potential for expression profiling, in particular for interspecific comparisons. Currently, different platforms for massively parallel sequencing are available, which differ in read length and sequencing costs. The 454-technology offers the highest read length. The other sequencing technologies are more cost effective, on the expense of shorter reads. Reliable expression profiling by massively parallel sequencing depends crucially on the accuracy to which the reads could be mapped to the corresponding genes. METHODOLOGY/PRINCIPAL FINDINGS:We performed an in silico analysis to evaluate whether incorrect mapping of the sequence reads results in a biased expression pattern. A comparison of six available mapping software tools indicated a considerable heterogeneity in mapping speed and accuracy. Independently of the software used to map the reads, we found that for compact genomes both short (35 bp, 50 bp) and long sequence reads (100 bp) result in an almost unbiased expression pattern. In contrast, for species with a larger genome containing more gene families and repetitive DNA, shorter reads (35-50 bp) produced a considerable bias in gene expression. In humans, about 10% of the genes had fewer than 50% of the sequence reads correctly mapped. Sequence polymorphism up to 9% had almost no effect on the mapping accuracy of 100 bp reads. For 35 bp reads up to 3% sequence divergence did not affect the mapping accuracy strongly. The effect of indels on the mapping efficiency strongly depends on the mapping software. CONCLUSIONS/SIGNIFICANCE:In complex genomes, expression profiling by massively parallel sequencing could introduce a considerable bias due to incorrectly mapped sequence reads if the read length is short. Nevertheless, this bias could be accounted for if the genomic sequence is known. Furthermore, sequence polymorphisms and indels also affect the mapping accuracy and may cause a biased gene expression measurement. The choice of the mapping software is highly critical and the reliability depends on the presence/absence of indels and the divergence between reads and the reference genome. Overall, we found SSAHA2 and CLC to produce the most reliable mapping results.
first_indexed 2024-04-13T16:02:32Z
format Article
id doaj.art-dbae2f1edcfd4b80bb15927b7fff5726
institution Directory Open Access Journal
issn 1932-6203
language English
last_indexed 2024-04-13T16:02:32Z
publishDate 2009-07-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj.art-dbae2f1edcfd4b80bb15927b7fff57262022-12-22T02:40:30ZengPublic Library of Science (PLoS)PLoS ONE1932-62032009-07-0147e632310.1371/journal.pone.0006323Mapping accuracy of short reads from massively parallel sequencing and the implications for quantitative expression profiling.Nicola PalmieriChristian SchlöttererBACKGROUND:Massively parallel sequencing offers an enormous potential for expression profiling, in particular for interspecific comparisons. Currently, different platforms for massively parallel sequencing are available, which differ in read length and sequencing costs. The 454-technology offers the highest read length. The other sequencing technologies are more cost effective, on the expense of shorter reads. Reliable expression profiling by massively parallel sequencing depends crucially on the accuracy to which the reads could be mapped to the corresponding genes. METHODOLOGY/PRINCIPAL FINDINGS:We performed an in silico analysis to evaluate whether incorrect mapping of the sequence reads results in a biased expression pattern. A comparison of six available mapping software tools indicated a considerable heterogeneity in mapping speed and accuracy. Independently of the software used to map the reads, we found that for compact genomes both short (35 bp, 50 bp) and long sequence reads (100 bp) result in an almost unbiased expression pattern. In contrast, for species with a larger genome containing more gene families and repetitive DNA, shorter reads (35-50 bp) produced a considerable bias in gene expression. In humans, about 10% of the genes had fewer than 50% of the sequence reads correctly mapped. Sequence polymorphism up to 9% had almost no effect on the mapping accuracy of 100 bp reads. For 35 bp reads up to 3% sequence divergence did not affect the mapping accuracy strongly. The effect of indels on the mapping efficiency strongly depends on the mapping software. CONCLUSIONS/SIGNIFICANCE:In complex genomes, expression profiling by massively parallel sequencing could introduce a considerable bias due to incorrectly mapped sequence reads if the read length is short. Nevertheless, this bias could be accounted for if the genomic sequence is known. Furthermore, sequence polymorphisms and indels also affect the mapping accuracy and may cause a biased gene expression measurement. The choice of the mapping software is highly critical and the reliability depends on the presence/absence of indels and the divergence between reads and the reference genome. Overall, we found SSAHA2 and CLC to produce the most reliable mapping results.http://europepmc.org/articles/PMC2712089?pdf=render
spellingShingle Nicola Palmieri
Christian Schlötterer
Mapping accuracy of short reads from massively parallel sequencing and the implications for quantitative expression profiling.
PLoS ONE
title Mapping accuracy of short reads from massively parallel sequencing and the implications for quantitative expression profiling.
title_full Mapping accuracy of short reads from massively parallel sequencing and the implications for quantitative expression profiling.
title_fullStr Mapping accuracy of short reads from massively parallel sequencing and the implications for quantitative expression profiling.
title_full_unstemmed Mapping accuracy of short reads from massively parallel sequencing and the implications for quantitative expression profiling.
title_short Mapping accuracy of short reads from massively parallel sequencing and the implications for quantitative expression profiling.
title_sort mapping accuracy of short reads from massively parallel sequencing and the implications for quantitative expression profiling
url http://europepmc.org/articles/PMC2712089?pdf=render
work_keys_str_mv AT nicolapalmieri mappingaccuracyofshortreadsfrommassivelyparallelsequencingandtheimplicationsforquantitativeexpressionprofiling
AT christianschlotterer mappingaccuracyofshortreadsfrommassivelyparallelsequencingandtheimplicationsforquantitativeexpressionprofiling