Scaffolding low quality genomes using orthologous protein sequences.

<p><strong>Motivation:</strong> The ready availability of next-generation sequencing has led to a situation where it is easy to produce very fragmentary genome assemblies. We present a pipeline, SWiPS (Scaffolding With Protein Sequences), that uses orthologous proteins to improve l...

Full description

Bibliographic Details
Main Authors: Li, Y, Copley, R
Format: Journal article
Language:English
Published: Oxford University Press 2013
Subjects:
Description
Summary:<p><strong>Motivation:</strong> The ready availability of next-generation sequencing has led to a situation where it is easy to produce very fragmentary genome assemblies. We present a pipeline, SWiPS (Scaffolding With Protein Sequences), that uses orthologous proteins to improve low quality genome assemblies. The protein sequences are used as guides to scaffold existing contigs, while simultaneously allowing the gene structure to be predicted by homology.</p> <p><strong>Results:</strong> To perform, SWiPS does not depend on a high N50 or whole proteins being encoded on a single contig. We tested our algorithm on simulated next-generation data from <em>Ciona intestinalis</em>, real next-generation data from <em>Drosophila melanogaster</em>, a complex genome assembly of <em>Homo sapiens</em> and the low coverage Sanger sequence assembly of <em>Callorhinchus milii</em>. The improvements in N50 are of the order of 20% for the <em>C.intestinalis</em> and <em>H.sapiens</em> assemblies, which is significant, considering the large size of intergenic regions in these eukaryotes. Using the CEGMA pipeline to assess the gene space represented in the genome assemblies, the number of genes retrieved increased by &gt;110% for <em>C.milii</em> and from 20 to 40% for <em>C.intestinalis</em>. The scaffold error rates are low: 85–90% of scaffolds are fully correct, and &gt;95% of local contig joins are correct.</p>