FASTA Herder: a web application to trim protein sequence sets

Abstract The ever increasing number of sequences in protein databases usually turns out large numbers of homologs in sequence similarity searches. While information from homology can be very useful for functional prediction based on amino acid conservation, many of these homologs usually have high...

Full description

Bibliographic Details
Main Authors:	Miguel Andrade, Caroline Louis-Jeune, Carol Perez-Iratxeta
Format:	Article
Language:	English
Published:	ScienceOpen 2015-08-01
Series:	ScienceOpen Research
Online Access:	https://www.scienceopen.com/document?vid=5df5dc75-0b14-497d-804d-0075d0201d15

_version_	1797896885973811200
author	Miguel Andrade Caroline Louis-Jeune Carol Perez-Iratxeta
author_facet	Miguel Andrade Caroline Louis-Jeune Carol Perez-Iratxeta
author_sort	Miguel Andrade
collection	DOAJ
description	Abstract The ever increasing number of sequences in protein databases usually turns out large numbers of homologs in sequence similarity searches. While information from homology can be very useful for functional prediction based on amino acid conservation, many of these homologs usually have high levels of identity among themselves, which hinders multiple sequence alignment computation and, especially, visualization. More generally, high redundancy reduces the usability of a protein set in machine learning applications and biases statistical analyses. We developed an algorithm to identify redundant sequence homologs that can be culled producing a streamlined FASTA file. As a difference from other automatic approaches that only aggregate sequences with high identity, our method clusters near-full length homologs allowing for lower sequence identity thresholds. Our method was fully tested and implemented in a web application called FASTA Herder, publicly available at http://fh.ogic.ca/.
first_indexed	2024-04-10T07:49:54Z
format	Article
id	doaj.art-70aef44adee341afb79c5ce59210a6f3
institution	Directory Open Access Journal
issn	2199-1006
language	English
last_indexed	2024-04-10T07:49:54Z
publishDate	2015-08-01
publisher	ScienceOpen
record_format	Article
series	ScienceOpen Research
spelling	doaj.art-70aef44adee341afb79c5ce59210a6f32023-02-23T10:21:15ZengScienceOpenScienceOpen Research2199-10062015-08-0110.14293/S2199-1006.1.SOR-LIFE.A67837.v2FASTA Herder: a web application to trim protein sequence setsMiguel AndradeCaroline Louis-JeuneCarol Perez-Iratxeta Abstract The ever increasing number of sequences in protein databases usually turns out large numbers of homologs in sequence similarity searches. While information from homology can be very useful for functional prediction based on amino acid conservation, many of these homologs usually have high levels of identity among themselves, which hinders multiple sequence alignment computation and, especially, visualization. More generally, high redundancy reduces the usability of a protein set in machine learning applications and biases statistical analyses. We developed an algorithm to identify redundant sequence homologs that can be culled producing a streamlined FASTA file. As a difference from other automatic approaches that only aggregate sequences with high identity, our method clusters near-full length homologs allowing for lower sequence identity thresholds. Our method was fully tested and implemented in a web application called FASTA Herder, publicly available at http://fh.ogic.ca/. https://www.scienceopen.com/document?vid=5df5dc75-0b14-497d-804d-0075d0201d15
spellingShingle	Miguel Andrade Caroline Louis-Jeune Carol Perez-Iratxeta FASTA Herder: a web application to trim protein sequence sets ScienceOpen Research
title	FASTA Herder: a web application to trim protein sequence sets
title_full	FASTA Herder: a web application to trim protein sequence sets
title_fullStr	FASTA Herder: a web application to trim protein sequence sets
title_full_unstemmed	FASTA Herder: a web application to trim protein sequence sets
title_short	FASTA Herder: a web application to trim protein sequence sets
title_sort	fasta herder a web application to trim protein sequence sets
url	https://www.scienceopen.com/document?vid=5df5dc75-0b14-497d-804d-0075d0201d15
work_keys_str_mv	AT miguelandrade fastaherderawebapplicationtotrimproteinsequencesets AT carolinelouisjeune fastaherderawebapplicationtotrimproteinsequencesets AT carolpereziratxeta fastaherderawebapplicationtotrimproteinsequencesets

FASTA Herder: a web application to trim protein sequence sets

Similar Items