Gene identification using phylogenetic metrics with conditional random fields

Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.

Bibliographic Details
Main Author: Deoras, Ameya Nitin
Other Authors: Manolis Kellis.
Format: Thesis
Language:eng
Published: Massachusetts Institute of Technology 2008
Subjects:
Online Access:http://hdl.handle.net/1721.1/40533
_version_ 1811082041605750784
author Deoras, Ameya Nitin
author2 Manolis Kellis.
author_facet Manolis Kellis.
Deoras, Ameya Nitin
author_sort Deoras, Ameya Nitin
collection MIT
description Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.
first_indexed 2024-09-23T11:56:30Z
format Thesis
id mit-1721.1/40533
institution Massachusetts Institute of Technology
language eng
last_indexed 2024-09-23T11:56:30Z
publishDate 2008
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/405332019-04-11T14:36:13Z Gene identification using phylogenetic metrics with conditional random fields Deoras, Ameya Nitin Manolis Kellis. Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. Electrical Engineering and Computer Science. Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007. Includes bibliographical references (p. 69-72). While the complete sequence of the human genome contains all the information necessary for encoding a complete human being, its interpretation remains a major challenge of modern biology. The first step to any genomic analysis is a comprehensive and accurate annotation of all genes encoded in the genome, providing the basis for understanding human variation, gene regulation, health and disease. Traditionally, the problem of computational gene prediction has been addressed using graphical probabilistic models of genomic sequence. While such models have been successful for small genomes with relatively simple gene structure, new methods are necessary for scaling these to the complete human genome, and for leveraging information across multiple mammalian species currently being sequenced. While generative models like hidden Markov models (HMMs) face the difficulty of modeling both coding and non-coding regions across a complete genome, discriminative models such as Conditional Random Fields (CRFs) have recently emerged, which focus specifically on the discrimination problem of gene identification, and can therefore be more powerful. One of the most attractive characteristics of these models is that their general framework also allows the incorporation of any number of independently derived feature functions (metrics), which can increase discriminatory power. While most of the work on CRFs for gene finding has been on model construction and training, there has not been much focus on the metrics used in such discriminatory frameworks. This is particularly important with the availability of rich comparative genome data, enabling the development of phylogenetic gene identification metrics which can maximally use alignments of a large number of genomes. (cont.) In this work I address the question of gene identification using multiple related genomes. I first present novel comparative metrics for gene classification that show considerable improvement over existing work, and also scale well with an increase in the number of aligned genomes. Second, I describe a general methodology of extending pair-wise metrics to alignments of multiple genomes that incorporates the evolutionary phylogenetic relationship between informant species. Third, I evaluate various methods of combining metrics that exploit metric independence and result in superior classification. Finally, I incorporate the metrics into a Conditional Random Field gene model, to perform unrestricted de novo gene prediction on 12-species alignments of the D. melanogaster genome, and demonstrate accuracy rivaling that of state-of-the-art gene prediction systems. by Ameya Nitin Deoras. S.M. 2008-02-27T22:44:26Z 2008-02-27T22:44:26Z 2007 2007 Thesis http://hdl.handle.net/1721.1/40533 191957958 eng M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. http://dspace.mit.edu/handle/1721.1/7582 78 p. application/pdf Massachusetts Institute of Technology
spellingShingle Electrical Engineering and Computer Science.
Deoras, Ameya Nitin
Gene identification using phylogenetic metrics with conditional random fields
title Gene identification using phylogenetic metrics with conditional random fields
title_full Gene identification using phylogenetic metrics with conditional random fields
title_fullStr Gene identification using phylogenetic metrics with conditional random fields
title_full_unstemmed Gene identification using phylogenetic metrics with conditional random fields
title_short Gene identification using phylogenetic metrics with conditional random fields
title_sort gene identification using phylogenetic metrics with conditional random fields
topic Electrical Engineering and Computer Science.
url http://hdl.handle.net/1721.1/40533
work_keys_str_mv AT deorasameyanitin geneidentificationusingphylogeneticmetricswithconditionalrandomfields