Identifying protein-coding genes and synonymous constraint elements using phylogenetic codon models

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.

Bibliographic Details
Main Author: Lin, Michael F. (Michael Fong-Jay)
Other Authors: Manolis Kellis.
Format: Thesis
Language:eng
Published: Massachusetts Institute of Technology 2012
Subjects:
Online Access:http://hdl.handle.net/1721.1/71480
_version_ 1811079471694872576
author Lin, Michael F. (Michael Fong-Jay)
author2 Manolis Kellis.
author_facet Manolis Kellis.
Lin, Michael F. (Michael Fong-Jay)
author_sort Lin, Michael F. (Michael Fong-Jay)
collection MIT
description Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.
first_indexed 2024-09-23T11:15:33Z
format Thesis
id mit-1721.1/71480
institution Massachusetts Institute of Technology
language eng
last_indexed 2024-09-23T11:15:33Z
publishDate 2012
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/714802019-04-10T15:57:42Z Identifying protein-coding genes and synonymous constraint elements using phylogenetic codon models Lin, Michael F. (Michael Fong-Jay) Manolis Kellis. Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. Electrical Engineering and Computer Science. Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012. Cataloged from PDF version of thesis. Includes bibliographical references (p. 93-105). We develop novel methods for comparative genomics analysis of protein-coding genes using phylogenetic codon models, in pursuit of two main lines of biological investigation: First, we develop PhyloCSF, an algorithm based on empirical phylogenetic codon models to distinguish protein-coding and non-coding regions in multi-species genome alignments. We benchmark PhyloCSF to show that it outperforms other methods, and we apply it to discover novel genes and analyze existing gene annotations in the human, mouse, zebrafish, fruitfly and fungal genomes. We use our predictions to revise the canonical annotations of these genomes in collaboration with GENCODE, FlyBase and other curators. We also reveal a surprisingly widespread mechanism of stop codon readthrough in the fruitfly genome, with additional examples found in mammals. Our work contributes to more-complete gene catalogs and sheds light on fascinating unusual gene structures in the human and other eukaryotic genomes. Second, we design phylogenetic codon models to detect evolutionary constraint at synonymous sites of mammalian genes. These sites are frequently assumed to evolve neutrally, but increased conservation would suggest they encode additional information overlapping the protein-coding sequence. We produce the first high-resolution catalog of individual human coding regions showing highly conserved synonymous sites across mammals, which we call Synonymous Constraint Elements (SCEs). We locate more than 10,000 SCEs, covering -2% of synonymous sites, and found within over one-quarter of all human genes. We present evidence that they indeed encode numerous overlapping biological functions, including splicing- and translation-associated regulatory motifs, microRNA target sites, RNA secondary structures, dual-coding genes, and developmental enhancers. We also develop a lineage-specific test which we use to study the evolutionary history of SCEs, and a Bayesian framework that further increases the resolution with which we can identify them. Our methods and datasets can inform future studies on mammalian gene structures, human disease associations, and personal genome interpretation. by Michael F. Lin. Ph.D. 2012-07-02T15:46:34Z 2012-07-02T15:46:34Z 2012 2012 Thesis http://hdl.handle.net/1721.1/71480 795569070 eng M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. http://dspace.mit.edu/handle/1721.1/7582 105 p. application/pdf Massachusetts Institute of Technology
spellingShingle Electrical Engineering and Computer Science.
Lin, Michael F. (Michael Fong-Jay)
Identifying protein-coding genes and synonymous constraint elements using phylogenetic codon models
title Identifying protein-coding genes and synonymous constraint elements using phylogenetic codon models
title_full Identifying protein-coding genes and synonymous constraint elements using phylogenetic codon models
title_fullStr Identifying protein-coding genes and synonymous constraint elements using phylogenetic codon models
title_full_unstemmed Identifying protein-coding genes and synonymous constraint elements using phylogenetic codon models
title_short Identifying protein-coding genes and synonymous constraint elements using phylogenetic codon models
title_sort identifying protein coding genes and synonymous constraint elements using phylogenetic codon models
topic Electrical Engineering and Computer Science.
url http://hdl.handle.net/1721.1/71480
work_keys_str_mv AT linmichaelfmichaelfongjay identifyingproteincodinggenesandsynonymousconstraintelementsusingphylogeneticcodonmodels