FASTA Herder: a web application to trim protein sequence sets

Abstract The ever increasing number of sequences in protein databases usually turns out large numbers of homologs in sequence similarity searches. While information from homology can be very useful for functional prediction based on amino acid conservation, many of these homologs usually have high...

Full description

Bibliographic Details
Main Authors: Miguel Andrade, Caroline Louis-Jeune, Carol Perez-Iratxeta
Format: Article
Language:English
Published: ScienceOpen 2015-08-01
Series:ScienceOpen Research
Online Access:https://www.scienceopen.com/document?vid=5df5dc75-0b14-497d-804d-0075d0201d15
_version_ 1797896885973811200
author Miguel Andrade
Caroline Louis-Jeune
Carol Perez-Iratxeta
author_facet Miguel Andrade
Caroline Louis-Jeune
Carol Perez-Iratxeta
author_sort Miguel Andrade
collection DOAJ
description Abstract The ever increasing number of sequences in protein databases usually turns out large numbers of homologs in sequence similarity searches. While information from homology can be very useful for functional prediction based on amino acid conservation, many of these homologs usually have high levels of identity among themselves, which hinders multiple sequence alignment computation and, especially, visualization. More generally, high redundancy reduces the usability of a protein set in machine learning applications and biases statistical analyses. We developed an algorithm to identify redundant sequence homologs that can be culled producing a streamlined FASTA file. As a difference from other automatic approaches that only aggregate sequences with high identity, our method clusters near-full length homologs allowing for lower sequence identity thresholds. Our method was fully tested and implemented in a web application called FASTA Herder, publicly available at http://fh.ogic.ca/.
first_indexed 2024-04-10T07:49:54Z
format Article
id doaj.art-70aef44adee341afb79c5ce59210a6f3
institution Directory Open Access Journal
issn 2199-1006
language English
last_indexed 2024-04-10T07:49:54Z
publishDate 2015-08-01
publisher ScienceOpen
record_format Article
series ScienceOpen Research
spelling doaj.art-70aef44adee341afb79c5ce59210a6f32023-02-23T10:21:15ZengScienceOpenScienceOpen Research2199-10062015-08-0110.14293/S2199-1006.1.SOR-LIFE.A67837.v2FASTA Herder: a web application to trim protein sequence setsMiguel AndradeCaroline Louis-JeuneCarol Perez-Iratxeta Abstract The ever increasing number of sequences in protein databases usually turns out large numbers of homologs in sequence similarity searches. While information from homology can be very useful for functional prediction based on amino acid conservation, many of these homologs usually have high levels of identity among themselves, which hinders multiple sequence alignment computation and, especially, visualization. More generally, high redundancy reduces the usability of a protein set in machine learning applications and biases statistical analyses. We developed an algorithm to identify redundant sequence homologs that can be culled producing a streamlined FASTA file. As a difference from other automatic approaches that only aggregate sequences with high identity, our method clusters near-full length homologs allowing for lower sequence identity thresholds. Our method was fully tested and implemented in a web application called FASTA Herder, publicly available at http://fh.ogic.ca/. https://www.scienceopen.com/document?vid=5df5dc75-0b14-497d-804d-0075d0201d15
spellingShingle Miguel Andrade
Caroline Louis-Jeune
Carol Perez-Iratxeta
FASTA Herder: a web application to trim protein sequence sets
ScienceOpen Research
title FASTA Herder: a web application to trim protein sequence sets
title_full FASTA Herder: a web application to trim protein sequence sets
title_fullStr FASTA Herder: a web application to trim protein sequence sets
title_full_unstemmed FASTA Herder: a web application to trim protein sequence sets
title_short FASTA Herder: a web application to trim protein sequence sets
title_sort fasta herder a web application to trim protein sequence sets
url https://www.scienceopen.com/document?vid=5df5dc75-0b14-497d-804d-0075d0201d15
work_keys_str_mv AT miguelandrade fastaherderawebapplicationtotrimproteinsequencesets
AT carolinelouisjeune fastaherderawebapplicationtotrimproteinsequencesets
AT carolpereziratxeta fastaherderawebapplicationtotrimproteinsequencesets