Genome annotation errors and how to fix them

<p>Many inferences about the biological properties of an organism depend on the completeness and accuracy of its genome annotation. Advances in sequencing technologies and the associated decreased costs have brought whole genome sequencing projects into the reach of individual laboratories, pr...

Deskribapen osoa

Xehetasun bibliografikoak
Egile nagusia: Dunne, M
Beste egile batzuk: Kelly, S
Formatua: Thesis
Hizkuntza:English
Argitaratua: 2018
Gaiak:
Deskribapena
Gaia:<p>Many inferences about the biological properties of an organism depend on the completeness and accuracy of its genome annotation. Advances in sequencing technologies and the associated decreased costs have brought whole genome sequencing projects into the reach of individual laboratories, precipitating a huge acceleration in the publication of draft genome assemblies and annotations. While genome assembly quality metrics have received substantial attention, adequate frameworks for quantifying and controlling errors in genome annotations are lacking, and thus the completeness and accuracy of published genome annotations are unknown. Moreover, genome annotations are frequently taken at face value by those using them, and any errors present are propagated in downstream inferences and analyses. Despite underpinning much of comparative genomic research, little attention has been paid to quantifying the extent of genome annotation inaccuracies and the majority of attempts to systematically rectify such errors have relied either on manual input, which is impractical on a large scale, or on universally conserved gene sets, which only account for a small percentage of genes. The aim of the research described in this thesis is to provide methods to assess and rectify two main classes of genome annotation errors (missing genes and incorrect gene models) at a phylogenetically local level, by mutually improving genome annotations for sets of related species, in the absence of extrinsic experimental data. I introduce several non-extrinsic metrics for assessing genome annotation completeness and the accuracy of the gene models contained therein, and provide two self-contained methods that improve genome annotation accuracy. In summary, this thesis reveals that genome annotation errors are widespread, even in widely studied community annotated genomes, and that many of these errors can be identified and corrected using automated phylogenetically local approaches.</p>