Identifying protein-coding genes and synonymous constraint elements using phylogenetic codon models
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Language: | eng |
Published: |
Massachusetts Institute of Technology
2012
|
Subjects: | |
Online Access: | http://hdl.handle.net/1721.1/71480 |
_version_ | 1811079471694872576 |
---|---|
author | Lin, Michael F. (Michael Fong-Jay) |
author2 | Manolis Kellis. |
author_facet | Manolis Kellis. Lin, Michael F. (Michael Fong-Jay) |
author_sort | Lin, Michael F. (Michael Fong-Jay) |
collection | MIT |
description | Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012. |
first_indexed | 2024-09-23T11:15:33Z |
format | Thesis |
id | mit-1721.1/71480 |
institution | Massachusetts Institute of Technology |
language | eng |
last_indexed | 2024-09-23T11:15:33Z |
publishDate | 2012 |
publisher | Massachusetts Institute of Technology |
record_format | dspace |
spelling | mit-1721.1/714802019-04-10T15:57:42Z Identifying protein-coding genes and synonymous constraint elements using phylogenetic codon models Lin, Michael F. (Michael Fong-Jay) Manolis Kellis. Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. Electrical Engineering and Computer Science. Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012. Cataloged from PDF version of thesis. Includes bibliographical references (p. 93-105). We develop novel methods for comparative genomics analysis of protein-coding genes using phylogenetic codon models, in pursuit of two main lines of biological investigation: First, we develop PhyloCSF, an algorithm based on empirical phylogenetic codon models to distinguish protein-coding and non-coding regions in multi-species genome alignments. We benchmark PhyloCSF to show that it outperforms other methods, and we apply it to discover novel genes and analyze existing gene annotations in the human, mouse, zebrafish, fruitfly and fungal genomes. We use our predictions to revise the canonical annotations of these genomes in collaboration with GENCODE, FlyBase and other curators. We also reveal a surprisingly widespread mechanism of stop codon readthrough in the fruitfly genome, with additional examples found in mammals. Our work contributes to more-complete gene catalogs and sheds light on fascinating unusual gene structures in the human and other eukaryotic genomes. Second, we design phylogenetic codon models to detect evolutionary constraint at synonymous sites of mammalian genes. These sites are frequently assumed to evolve neutrally, but increased conservation would suggest they encode additional information overlapping the protein-coding sequence. We produce the first high-resolution catalog of individual human coding regions showing highly conserved synonymous sites across mammals, which we call Synonymous Constraint Elements (SCEs). We locate more than 10,000 SCEs, covering -2% of synonymous sites, and found within over one-quarter of all human genes. We present evidence that they indeed encode numerous overlapping biological functions, including splicing- and translation-associated regulatory motifs, microRNA target sites, RNA secondary structures, dual-coding genes, and developmental enhancers. We also develop a lineage-specific test which we use to study the evolutionary history of SCEs, and a Bayesian framework that further increases the resolution with which we can identify them. Our methods and datasets can inform future studies on mammalian gene structures, human disease associations, and personal genome interpretation. by Michael F. Lin. Ph.D. 2012-07-02T15:46:34Z 2012-07-02T15:46:34Z 2012 2012 Thesis http://hdl.handle.net/1721.1/71480 795569070 eng M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. http://dspace.mit.edu/handle/1721.1/7582 105 p. application/pdf Massachusetts Institute of Technology |
spellingShingle | Electrical Engineering and Computer Science. Lin, Michael F. (Michael Fong-Jay) Identifying protein-coding genes and synonymous constraint elements using phylogenetic codon models |
title | Identifying protein-coding genes and synonymous constraint elements using phylogenetic codon models |
title_full | Identifying protein-coding genes and synonymous constraint elements using phylogenetic codon models |
title_fullStr | Identifying protein-coding genes and synonymous constraint elements using phylogenetic codon models |
title_full_unstemmed | Identifying protein-coding genes and synonymous constraint elements using phylogenetic codon models |
title_short | Identifying protein-coding genes and synonymous constraint elements using phylogenetic codon models |
title_sort | identifying protein coding genes and synonymous constraint elements using phylogenetic codon models |
topic | Electrical Engineering and Computer Science. |
url | http://hdl.handle.net/1721.1/71480 |
work_keys_str_mv | AT linmichaelfmichaelfongjay identifyingproteincodinggenesandsynonymousconstraintelementsusingphylogeneticcodonmodels |