FASTA Herder: a web application to trim protein sequence sets

The ever increasing number of sequences in protein databases usually turns out large numbers of homologs in sequence similarity searches. While information from homology can be very useful for functional prediction based on amino acid conservation, many of these homologs usually have high levels of...

Full description

Bibliographic Details
Format:	Article
Language:	English
Published:	ScienceOpen 2014-03-01
Series:	ScienceOpen Research
Online Access:	https://www.scienceopen.com/document?vid=c769ab9d-1a2f-4b32-8a91-9eaec8e6f677

_version_	1827092495466496000
collection	DOAJ
description	The ever increasing number of sequences in protein databases usually turns out large numbers of homologs in sequence similarity searches. While information from homology can be very useful for functional prediction based on amino acid conservation, many of these homologs usually have high levels of identity among themselves, which hinders multiple sequence alignment (MSA) computation and, especially, visualization. More generally, high redundancy reduces the usability of a protein set in machine learning applications and biases statistical analyses. We developed an algorithm to identify redundant sequence homologs that can be culled producing a streamlined FASTA file. As a difference from other automatic approaches that only aggregate sequences with high identity, our method clusters near-full length homologs allowing for lower sequence identity thresholds. Our method was fully tested and implemented in a web application called FASTA Herder, publicly available at http://www.ogic.ca/projects/fh/orain.html .
first_indexed	2024-04-10T07:50:14Z
format	Article
id	doaj.art-bc95a25ee6a44b8c9cfd88765afa2cff
institution	Directory Open Access Journal
issn	2199-1006
language	English
last_indexed	2025-03-20T06:13:06Z
publishDate	2014-03-01
publisher	ScienceOpen
record_format	Article
series	ScienceOpen Research
spelling	doaj.art-bc95a25ee6a44b8c9cfd88765afa2cff2024-10-02T18:26:16ZengScienceOpenScienceOpen Research2199-10062014-03-0110.14293/S2199-1006.1.SOR-LIFE.A67837.v1FASTA Herder: a web application to trim protein sequence setsThe ever increasing number of sequences in protein databases usually turns out large numbers of homologs in sequence similarity searches. While information from homology can be very useful for functional prediction based on amino acid conservation, many of these homologs usually have high levels of identity among themselves, which hinders multiple sequence alignment (MSA) computation and, especially, visualization. More generally, high redundancy reduces the usability of a protein set in machine learning applications and biases statistical analyses. We developed an algorithm to identify redundant sequence homologs that can be culled producing a streamlined FASTA file. As a difference from other automatic approaches that only aggregate sequences with high identity, our method clusters near-full length homologs allowing for lower sequence identity thresholds. Our method was fully tested and implemented in a web application called FASTA Herder, publicly available at http://www.ogic.ca/projects/fh/orain.html .https://www.scienceopen.com/document?vid=c769ab9d-1a2f-4b32-8a91-9eaec8e6f677
spellingShingle	FASTA Herder: a web application to trim protein sequence sets ScienceOpen Research
title	FASTA Herder: a web application to trim protein sequence sets
title_full	FASTA Herder: a web application to trim protein sequence sets
title_fullStr	FASTA Herder: a web application to trim protein sequence sets
title_full_unstemmed	FASTA Herder: a web application to trim protein sequence sets
title_short	FASTA Herder: a web application to trim protein sequence sets
title_sort	fasta herder a web application to trim protein sequence sets
url	https://www.scienceopen.com/document?vid=c769ab9d-1a2f-4b32-8a91-9eaec8e6f677

FASTA Herder: a web application to trim protein sequence sets

Similar Items