FASTA Herder: a web application to trim protein sequence sets

The ever increasing number of sequences in protein databases usually turns out large numbers of homologs in sequence similarity searches. While information from homology can be very useful for functional prediction based on amino acid conservation, many of these homologs usually have high levels of...

Full description

Bibliographic Details
Format: Article
Language:English
Published: ScienceOpen 2014-03-01
Series:ScienceOpen Research
Online Access:https://www.scienceopen.com/document?vid=c769ab9d-1a2f-4b32-8a91-9eaec8e6f677
_version_ 1827092495466496000
collection DOAJ
description The ever increasing number of sequences in protein databases usually turns out large numbers of homologs in sequence similarity searches. While information from homology can be very useful for functional prediction based on amino acid conservation, many of these homologs usually have high levels of identity among themselves, which hinders multiple sequence alignment (MSA) computation and, especially, visualization. More generally, high redundancy reduces the usability of a protein set in machine learning applications and biases statistical analyses. We developed an algorithm to identify redundant sequence homologs that can be culled producing a streamlined FASTA file. As a difference from other automatic approaches that only aggregate sequences with high identity, our method clusters near-full length homologs allowing for lower sequence identity thresholds. Our method was fully tested and implemented in a web application called FASTA Herder, publicly available at http://www.ogic.ca/projects/fh/orain.html .
first_indexed 2024-04-10T07:50:14Z
format Article
id doaj.art-bc95a25ee6a44b8c9cfd88765afa2cff
institution Directory Open Access Journal
issn 2199-1006
language English
last_indexed 2025-03-20T06:13:06Z
publishDate 2014-03-01
publisher ScienceOpen
record_format Article
series ScienceOpen Research
spelling doaj.art-bc95a25ee6a44b8c9cfd88765afa2cff2024-10-02T18:26:16ZengScienceOpenScienceOpen Research2199-10062014-03-0110.14293/S2199-1006.1.SOR-LIFE.A67837.v1FASTA Herder: a web application to trim protein sequence setsThe ever increasing number of sequences in protein databases usually turns out large numbers of homologs in sequence similarity searches. While information from homology can be very useful for functional prediction based on amino acid conservation, many of these homologs usually have high levels of identity among themselves, which hinders multiple sequence alignment (MSA) computation and, especially, visualization. More generally, high redundancy reduces the usability of a protein set in machine learning applications and biases statistical analyses. We developed an algorithm to identify redundant sequence homologs that can be culled producing a streamlined FASTA file. As a difference from other automatic approaches that only aggregate sequences with high identity, our method clusters near-full length homologs allowing for lower sequence identity thresholds. Our method was fully tested and implemented in a web application called FASTA Herder, publicly available at http://www.ogic.ca/projects/fh/orain.html .https://www.scienceopen.com/document?vid=c769ab9d-1a2f-4b32-8a91-9eaec8e6f677
spellingShingle FASTA Herder: a web application to trim protein sequence sets
ScienceOpen Research
title FASTA Herder: a web application to trim protein sequence sets
title_full FASTA Herder: a web application to trim protein sequence sets
title_fullStr FASTA Herder: a web application to trim protein sequence sets
title_full_unstemmed FASTA Herder: a web application to trim protein sequence sets
title_short FASTA Herder: a web application to trim protein sequence sets
title_sort fasta herder a web application to trim protein sequence sets
url https://www.scienceopen.com/document?vid=c769ab9d-1a2f-4b32-8a91-9eaec8e6f677