Fitting hidden Markov models of protein domains to a target species: application to <it>Plasmodium falciparum</it>

Abstract Background Hidden Markov Models (HMMs) are a powerful tool for protein domain identification. The Pfam database notably provides a large collection of HMMs which are widely used for the annotation of proteins in new sequenced organisms. In Pfam...

Full description

Bibliographic Details
Main Authors:	Terrapon Nicolas, Gascuel Olivier, Maréchal Éric, Bréhélin Laurent
Format:	Article
Language:	English
Published:	BMC 2012-05-01
Series:	BMC Bioinformatics
Online Access:	http://www.biomedcentral.com/1471-2105/13/67

_version_	1819137312460636160
author	Terrapon Nicolas Gascuel Olivier Maréchal Éric Bréhélin Laurent
author_facet	Terrapon Nicolas Gascuel Olivier Maréchal Éric Bréhélin Laurent
author_sort	Terrapon Nicolas
collection	DOAJ
description	<p>Abstract</p> <p>Background</p> <p>Hidden Markov Models (HMMs) are a powerful tool for protein domain identification. The Pfam database notably provides a large collection of HMMs which are widely used for the annotation of proteins in new sequenced organisms. In Pfam, each domain family is represented by a curated multiple sequence alignment from which a profile HMM is built. In spite of their high specificity, HMMs may lack sensitivity when searching for domains in divergent organisms. This is particularly the case for species with a biased amino-acid composition, such as <it>P. falciparum</it>, the main causal agent of human malaria. In this context, fitting HMMs to the specificities of the target proteome can help identify additional domains.</p> <p>Results</p> <p>Using <it>P. falciparum</it> as an example, we compare approaches that have been proposed for this problem, and present two alternative methods. Because previous attempts strongly rely on known domain occurrences in the target species or its close relatives, they mainly improve the detection of domains which belong to already identified families. Our methods learn global correction rules that adjust amino-acid distributions associated with the match states of HMMs. These rules are applied to all match states of the whole HMM library, thus enabling the detection of domains from previously absent families. Additionally, we propose a procedure to estimate the proportion of false positives among the newly discovered domains. Starting with the Pfam standard library, we build several new libraries with the different HMM-fitting approaches. These libraries are first used to detect new domain occurrences with low E-values. Second, by applying the Co-Occurrence Domain Discovery (CODD) procedure we have recently proposed, the libraries are further used to identify likely occurrences among potential domains with higher E-values.</p> <p>Conclusion</p> <p>We show that the new approaches allow identification of several domain families previously absent in the <it>P. falciparum</it> proteome and the Apicomplexa phylum, and identify many domains that are not detected by previous approaches. In terms of the number of new discovered domains, the new approaches outperform the previous ones when no close species are available or when they are used to identify likely occurrences among potential domains with high E-values. All predictions on <it>P. falciparum</it> have been integrated into a dedicated website which pools all known/new annotations of protein domains and functions for this organism. A software implementing the two proposed approaches is available at the same address: <url>http://www.lirmm.fr/∼terrapon/HMMﬁt/</url></p>
first_indexed	2024-12-22T10:48:52Z
format	Article
id	doaj.art-2e5ed8fe06914947bfd567ae5c792483
institution	Directory Open Access Journal
issn	1471-2105
language	English
last_indexed	2024-12-22T10:48:52Z
publishDate	2012-05-01
publisher	BMC
record_format	Article
series	BMC Bioinformatics
spelling	doaj.art-2e5ed8fe06914947bfd567ae5c7924832022-12-21T18:28:50ZengBMCBMC Bioinformatics1471-21052012-05-011316710.1186/1471-2105-13-67Fitting hidden Markov models of protein domains to a target species: application to <it>Plasmodium falciparum</it>Terrapon NicolasGascuel OlivierMaréchal ÉricBréhélin Laurent<p>Abstract</p> <p>Background</p> <p>Hidden Markov Models (HMMs) are a powerful tool for protein domain identification. The Pfam database notably provides a large collection of HMMs which are widely used for the annotation of proteins in new sequenced organisms. In Pfam, each domain family is represented by a curated multiple sequence alignment from which a profile HMM is built. In spite of their high specificity, HMMs may lack sensitivity when searching for domains in divergent organisms. This is particularly the case for species with a biased amino-acid composition, such as <it>P. falciparum</it>, the main causal agent of human malaria. In this context, fitting HMMs to the specificities of the target proteome can help identify additional domains.</p> <p>Results</p> <p>Using <it>P. falciparum</it> as an example, we compare approaches that have been proposed for this problem, and present two alternative methods. Because previous attempts strongly rely on known domain occurrences in the target species or its close relatives, they mainly improve the detection of domains which belong to already identified families. Our methods learn global correction rules that adjust amino-acid distributions associated with the match states of HMMs. These rules are applied to all match states of the whole HMM library, thus enabling the detection of domains from previously absent families. Additionally, we propose a procedure to estimate the proportion of false positives among the newly discovered domains. Starting with the Pfam standard library, we build several new libraries with the different HMM-fitting approaches. These libraries are first used to detect new domain occurrences with low E-values. Second, by applying the Co-Occurrence Domain Discovery (CODD) procedure we have recently proposed, the libraries are further used to identify likely occurrences among potential domains with higher E-values.</p> <p>Conclusion</p> <p>We show that the new approaches allow identification of several domain families previously absent in the <it>P. falciparum</it> proteome and the Apicomplexa phylum, and identify many domains that are not detected by previous approaches. In terms of the number of new discovered domains, the new approaches outperform the previous ones when no close species are available or when they are used to identify likely occurrences among potential domains with high E-values. All predictions on <it>P. falciparum</it> have been integrated into a dedicated website which pools all known/new annotations of protein domains and functions for this organism. A software implementing the two proposed approaches is available at the same address: <url>http://www.lirmm.fr/∼terrapon/HMMﬁt/</url></p>http://www.biomedcentral.com/1471-2105/13/67
spellingShingle	Terrapon Nicolas Gascuel Olivier Maréchal Éric Bréhélin Laurent Fitting hidden Markov models of protein domains to a target species: application to <it>Plasmodium falciparum</it> BMC Bioinformatics
title	Fitting hidden Markov models of protein domains to a target species: application to <it>Plasmodium falciparum</it>
title_full	Fitting hidden Markov models of protein domains to a target species: application to <it>Plasmodium falciparum</it>
title_fullStr	Fitting hidden Markov models of protein domains to a target species: application to <it>Plasmodium falciparum</it>
title_full_unstemmed	Fitting hidden Markov models of protein domains to a target species: application to <it>Plasmodium falciparum</it>
title_short	Fitting hidden Markov models of protein domains to a target species: application to <it>Plasmodium falciparum</it>
title_sort	fitting hidden markov models of protein domains to a target species application to it plasmodium falciparum it
url	http://www.biomedcentral.com/1471-2105/13/67
work_keys_str_mv	AT terraponnicolas fittinghiddenmarkovmodelsofproteindomainstoatargetspeciesapplicationtoitplasmodiumfalciparumit AT gascuelolivier fittinghiddenmarkovmodelsofproteindomainstoatargetspeciesapplicationtoitplasmodiumfalciparumit AT marechaleric fittinghiddenmarkovmodelsofproteindomainstoatargetspeciesapplicationtoitplasmodiumfalciparumit AT brehelinlaurent fittinghiddenmarkovmodelsofproteindomainstoatargetspeciesapplicationtoitplasmodiumfalciparumit

Fitting hidden Markov models of protein domains to a target species: application to <it>Plasmodium falciparum</it>

Similar Items