FASTA Herder: a web application to trim protein sequence sets
Abstract The ever increasing number of sequences in protein databases usually turns out large numbers of homologs in sequence similarity searches. While information from homology can be very useful for functional prediction based on amino acid conservation, many of these homologs usually have high...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
ScienceOpen
2015-08-01
|
Series: | ScienceOpen Research |
Online Access: | https://www.scienceopen.com/document?vid=5df5dc75-0b14-497d-804d-0075d0201d15 |
_version_ | 1797896885973811200 |
---|---|
author | Miguel Andrade Caroline Louis-Jeune Carol Perez-Iratxeta |
author_facet | Miguel Andrade Caroline Louis-Jeune Carol Perez-Iratxeta |
author_sort | Miguel Andrade |
collection | DOAJ |
description |
Abstract
The ever increasing number of sequences in protein databases usually turns out large numbers of homologs in sequence similarity searches. While information from homology can be very useful for functional prediction based on amino acid conservation, many of these homologs usually have high levels of identity among themselves, which hinders multiple sequence alignment computation and, especially, visualization. More generally, high redundancy reduces the usability of a protein set in machine learning applications and biases statistical analyses. We developed an algorithm to identify redundant sequence homologs that can be culled producing a streamlined FASTA file. As a difference from other automatic approaches that only aggregate sequences with high identity, our method clusters near-full length homologs allowing for lower sequence identity thresholds. Our method was fully tested and implemented in a web application called FASTA Herder, publicly available at http://fh.ogic.ca/.
|
first_indexed | 2024-04-10T07:49:54Z |
format | Article |
id | doaj.art-70aef44adee341afb79c5ce59210a6f3 |
institution | Directory Open Access Journal |
issn | 2199-1006 |
language | English |
last_indexed | 2024-04-10T07:49:54Z |
publishDate | 2015-08-01 |
publisher | ScienceOpen |
record_format | Article |
series | ScienceOpen Research |
spelling | doaj.art-70aef44adee341afb79c5ce59210a6f32023-02-23T10:21:15ZengScienceOpenScienceOpen Research2199-10062015-08-0110.14293/S2199-1006.1.SOR-LIFE.A67837.v2FASTA Herder: a web application to trim protein sequence setsMiguel AndradeCaroline Louis-JeuneCarol Perez-Iratxeta Abstract The ever increasing number of sequences in protein databases usually turns out large numbers of homologs in sequence similarity searches. While information from homology can be very useful for functional prediction based on amino acid conservation, many of these homologs usually have high levels of identity among themselves, which hinders multiple sequence alignment computation and, especially, visualization. More generally, high redundancy reduces the usability of a protein set in machine learning applications and biases statistical analyses. We developed an algorithm to identify redundant sequence homologs that can be culled producing a streamlined FASTA file. As a difference from other automatic approaches that only aggregate sequences with high identity, our method clusters near-full length homologs allowing for lower sequence identity thresholds. Our method was fully tested and implemented in a web application called FASTA Herder, publicly available at http://fh.ogic.ca/. https://www.scienceopen.com/document?vid=5df5dc75-0b14-497d-804d-0075d0201d15 |
spellingShingle | Miguel Andrade Caroline Louis-Jeune Carol Perez-Iratxeta FASTA Herder: a web application to trim protein sequence sets ScienceOpen Research |
title | FASTA Herder: a web application to trim protein sequence sets |
title_full | FASTA Herder: a web application to trim protein sequence sets |
title_fullStr | FASTA Herder: a web application to trim protein sequence sets |
title_full_unstemmed | FASTA Herder: a web application to trim protein sequence sets |
title_short | FASTA Herder: a web application to trim protein sequence sets |
title_sort | fasta herder a web application to trim protein sequence sets |
url | https://www.scienceopen.com/document?vid=5df5dc75-0b14-497d-804d-0075d0201d15 |
work_keys_str_mv | AT miguelandrade fastaherderawebapplicationtotrimproteinsequencesets AT carolinelouisjeune fastaherderawebapplicationtotrimproteinsequencesets AT carolpereziratxeta fastaherderawebapplicationtotrimproteinsequencesets |