Homology-based method for identification of protein repeats using statistical significance estimates.

Short protein repeats, frequently with a length between 20 and 40 residues, represent a significant fraction of known proteins. Many repeats appear to possess high amino acid substitution rates and thus recognition of repeat homologues is highly problematic. Even if the presence of a certain repeat...

Full description

Bibliographic Details
Main Authors: Andrade, M, Ponting, C, Gibson, T, Bork, P
Format: Journal article
Language:English
Published: 2000
_version_ 1797099309305430016
author Andrade, M
Ponting, C
Gibson, T
Bork, P
author_facet Andrade, M
Ponting, C
Gibson, T
Bork, P
author_sort Andrade, M
collection OXFORD
description Short protein repeats, frequently with a length between 20 and 40 residues, represent a significant fraction of known proteins. Many repeats appear to possess high amino acid substitution rates and thus recognition of repeat homologues is highly problematic. Even if the presence of a certain repeat family is known, the exact locations and the number of repetitive units often cannot be determined using current methods. We have devised an iterative algorithm based on optimal and sub-optimal score distributions from profile analysis that estimates the significance of all repeats that are detected in a single sequence. This procedure allows the identification of homologues at alignment scores lower than the highest optimal alignment score for non-homologous sequences. The method has been used to investigate the occurrence of eleven families of repeats in Saccharomyces cerevisiae, Caenorhabditis elegans and Homo sapiens accounting for 1055, 2205 and 2320 repeats, respectively. For these examples, the method is both more sensitive and more selective than conventional homology search procedures. The method allowed the detection in the SwissProt database of more than 2000 previously unrecognised repeats belonging to the 11 families. In addition, the method was used to merge several repeat families that previously were supposed to be distinct, indicating common phylogenetic origins for these families.
first_indexed 2024-03-07T05:21:58Z
format Journal article
id oxford-uuid:df376586-2fe3-484b-9cdd-8e07ad0be80c
institution University of Oxford
language English
last_indexed 2024-03-07T05:21:58Z
publishDate 2000
record_format dspace
spelling oxford-uuid:df376586-2fe3-484b-9cdd-8e07ad0be80c2022-03-27T09:37:53ZHomology-based method for identification of protein repeats using statistical significance estimates.Journal articlehttp://purl.org/coar/resource_type/c_dcae04bcuuid:df376586-2fe3-484b-9cdd-8e07ad0be80cEnglishSymplectic Elements at Oxford2000Andrade, MPonting, CGibson, TBork, PShort protein repeats, frequently with a length between 20 and 40 residues, represent a significant fraction of known proteins. Many repeats appear to possess high amino acid substitution rates and thus recognition of repeat homologues is highly problematic. Even if the presence of a certain repeat family is known, the exact locations and the number of repetitive units often cannot be determined using current methods. We have devised an iterative algorithm based on optimal and sub-optimal score distributions from profile analysis that estimates the significance of all repeats that are detected in a single sequence. This procedure allows the identification of homologues at alignment scores lower than the highest optimal alignment score for non-homologous sequences. The method has been used to investigate the occurrence of eleven families of repeats in Saccharomyces cerevisiae, Caenorhabditis elegans and Homo sapiens accounting for 1055, 2205 and 2320 repeats, respectively. For these examples, the method is both more sensitive and more selective than conventional homology search procedures. The method allowed the detection in the SwissProt database of more than 2000 previously unrecognised repeats belonging to the 11 families. In addition, the method was used to merge several repeat families that previously were supposed to be distinct, indicating common phylogenetic origins for these families.
spellingShingle Andrade, M
Ponting, C
Gibson, T
Bork, P
Homology-based method for identification of protein repeats using statistical significance estimates.
title Homology-based method for identification of protein repeats using statistical significance estimates.
title_full Homology-based method for identification of protein repeats using statistical significance estimates.
title_fullStr Homology-based method for identification of protein repeats using statistical significance estimates.
title_full_unstemmed Homology-based method for identification of protein repeats using statistical significance estimates.
title_short Homology-based method for identification of protein repeats using statistical significance estimates.
title_sort homology based method for identification of protein repeats using statistical significance estimates
work_keys_str_mv AT andradem homologybasedmethodforidentificationofproteinrepeatsusingstatisticalsignificanceestimates
AT pontingc homologybasedmethodforidentificationofproteinrepeatsusingstatisticalsignificanceestimates
AT gibsont homologybasedmethodforidentificationofproteinrepeatsusingstatisticalsignificanceestimates
AT borkp homologybasedmethodforidentificationofproteinrepeatsusingstatisticalsignificanceestimates