Multiple sequence analysis in the presence of alignment uncertainty

<p>Sequence alignment is one of the most intensely studied problems in bioinformatics, and is an important step in a wide range of analyses. An issue that has gained much attention in recent years is the fact that downstream analyses are often highly sensitive to the specific choice of alignme...

Full description

Bibliographic Details
Main Authors: Herman, J, Joseph Herman
Other Authors: Hein, J
Format: Thesis
Language:English
Published: 2014
Subjects:
_version_ 1826283358044291072
author Herman, J
Joseph Herman
author2 Hein, J
author_facet Hein, J
Herman, J
Joseph Herman
author_sort Herman, J
collection OXFORD
description <p>Sequence alignment is one of the most intensely studied problems in bioinformatics, and is an important step in a wide range of analyses. An issue that has gained much attention in recent years is the fact that downstream analyses are often highly sensitive to the specific choice of alignment.</p> <p>One way to address this is to jointly sample alignments along with other parameters of interest. In order to extend the range of applicability of this approach, the first chapter of this thesis introduces a probabilistic evolutionary model for protein structures on a phylogenetic tree; since protein structures typically diverge much more slowly than sequences, this allows for more reliable detection of remote homologies, improving the accuracy of the resulting alignments and trees, and reducing sensitivity of the results to the choice of dataset. In order to carry out inference under such a model, a number of new Markov chain Monte Carlo approaches are developed, allowing for more efficient convergence and mixing on the high-dimensional parameter space.</p> <p>The second part of the thesis presents a directed acyclic graph (DAG)-based approach for representing a collection of sampled alignments. This DAG representation allows the initial collection of samples to be used to generate a larger set of alignments under the same approximate distribution, enabling posterior alignment probabilities to be estimated reliably from a reasonable number of samples. If desired, summary alignments can then be generated as maximum-weight paths through the DAG, under various types of loss or scoring functions.</p> <p>The acyclic nature of the graph also permits various other types of algorithms to be easily adapted to operate on the entire set of alignments in the DAG. In the final part of this work, methodology is introduced for alignment-DAG-based sequence annotation using hidden Markov models, and RNA secondary structure prediction using stochastic context-free grammars. Results on test datasets indicate that the additional information contained within the DAG allows for improved predictions, resulting in substantial gains over simply analysing a set of alignments one by one.</p>
first_indexed 2024-03-07T00:57:42Z
format Thesis
id oxford-uuid:88a56d9f-a96e-48e3-b8dc-a73f3efc8472
institution University of Oxford
language English
last_indexed 2024-03-07T00:57:42Z
publishDate 2014
record_format dspace
spelling oxford-uuid:88a56d9f-a96e-48e3-b8dc-a73f3efc84722022-03-26T22:18:52ZMultiple sequence analysis in the presence of alignment uncertaintyThesishttp://purl.org/coar/resource_type/c_db06uuid:88a56d9f-a96e-48e3-b8dc-a73f3efc8472ProbabilityBioinformatics (life sciences)Stochastic processesStructural genomicsMathematical genetics and bioinformatics (statistics)EnglishOxford University Research Archive - Valet2014Herman, JJoseph HermanHein, J<p>Sequence alignment is one of the most intensely studied problems in bioinformatics, and is an important step in a wide range of analyses. An issue that has gained much attention in recent years is the fact that downstream analyses are often highly sensitive to the specific choice of alignment.</p> <p>One way to address this is to jointly sample alignments along with other parameters of interest. In order to extend the range of applicability of this approach, the first chapter of this thesis introduces a probabilistic evolutionary model for protein structures on a phylogenetic tree; since protein structures typically diverge much more slowly than sequences, this allows for more reliable detection of remote homologies, improving the accuracy of the resulting alignments and trees, and reducing sensitivity of the results to the choice of dataset. In order to carry out inference under such a model, a number of new Markov chain Monte Carlo approaches are developed, allowing for more efficient convergence and mixing on the high-dimensional parameter space.</p> <p>The second part of the thesis presents a directed acyclic graph (DAG)-based approach for representing a collection of sampled alignments. This DAG representation allows the initial collection of samples to be used to generate a larger set of alignments under the same approximate distribution, enabling posterior alignment probabilities to be estimated reliably from a reasonable number of samples. If desired, summary alignments can then be generated as maximum-weight paths through the DAG, under various types of loss or scoring functions.</p> <p>The acyclic nature of the graph also permits various other types of algorithms to be easily adapted to operate on the entire set of alignments in the DAG. In the final part of this work, methodology is introduced for alignment-DAG-based sequence annotation using hidden Markov models, and RNA secondary structure prediction using stochastic context-free grammars. Results on test datasets indicate that the additional information contained within the DAG allows for improved predictions, resulting in substantial gains over simply analysing a set of alignments one by one.</p>
spellingShingle Probability
Bioinformatics (life sciences)
Stochastic processes
Structural genomics
Mathematical genetics and bioinformatics (statistics)
Herman, J
Joseph Herman
Multiple sequence analysis in the presence of alignment uncertainty
title Multiple sequence analysis in the presence of alignment uncertainty
title_full Multiple sequence analysis in the presence of alignment uncertainty
title_fullStr Multiple sequence analysis in the presence of alignment uncertainty
title_full_unstemmed Multiple sequence analysis in the presence of alignment uncertainty
title_short Multiple sequence analysis in the presence of alignment uncertainty
title_sort multiple sequence analysis in the presence of alignment uncertainty
topic Probability
Bioinformatics (life sciences)
Stochastic processes
Structural genomics
Mathematical genetics and bioinformatics (statistics)
work_keys_str_mv AT hermanj multiplesequenceanalysisinthepresenceofalignmentuncertainty
AT josephherman multiplesequenceanalysisinthepresenceofalignmentuncertainty