Constructing benchmark test sets for biological sequence analysis using independent set algorithms.
Biological sequence families contain many sequences that are very similar to each other because they are related by evolution, so the strategy for splitting data into separate training and test sets is a nontrivial choice in benchmarking sequence analysis methods. A random split is insufficient beca...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Public Library of Science (PLoS)
2022-03-01
|
Series: | PLoS Computational Biology |
Online Access: | https://doi.org/10.1371/journal.pcbi.1009492 |
_version_ | 1811155398990757888 |
---|---|
author | Samantha Petti Sean R Eddy |
author_facet | Samantha Petti Sean R Eddy |
author_sort | Samantha Petti |
collection | DOAJ |
description | Biological sequence families contain many sequences that are very similar to each other because they are related by evolution, so the strategy for splitting data into separate training and test sets is a nontrivial choice in benchmarking sequence analysis methods. A random split is insufficient because it will yield test sequences that are closely related or even identical to training sequences. Adapting ideas from independent set graph algorithms, we describe two new methods for splitting sequence data into dissimilar training and test sets. These algorithms input a sequence family and produce a split in which each test sequence is less than p% identical to any individual training sequence. These algorithms successfully split more families than a previous approach, enabling construction of more diverse benchmark datasets. |
first_indexed | 2024-04-10T04:33:35Z |
format | Article |
id | doaj.art-18c1a0210b59456abeb5f34c4f02fe4f |
institution | Directory Open Access Journal |
issn | 1553-734X 1553-7358 |
language | English |
last_indexed | 2024-04-10T04:33:35Z |
publishDate | 2022-03-01 |
publisher | Public Library of Science (PLoS) |
record_format | Article |
series | PLoS Computational Biology |
spelling | doaj.art-18c1a0210b59456abeb5f34c4f02fe4f2023-03-10T05:31:33ZengPublic Library of Science (PLoS)PLoS Computational Biology1553-734X1553-73582022-03-01183e100949210.1371/journal.pcbi.1009492Constructing benchmark test sets for biological sequence analysis using independent set algorithms.Samantha PettiSean R EddyBiological sequence families contain many sequences that are very similar to each other because they are related by evolution, so the strategy for splitting data into separate training and test sets is a nontrivial choice in benchmarking sequence analysis methods. A random split is insufficient because it will yield test sequences that are closely related or even identical to training sequences. Adapting ideas from independent set graph algorithms, we describe two new methods for splitting sequence data into dissimilar training and test sets. These algorithms input a sequence family and produce a split in which each test sequence is less than p% identical to any individual training sequence. These algorithms successfully split more families than a previous approach, enabling construction of more diverse benchmark datasets.https://doi.org/10.1371/journal.pcbi.1009492 |
spellingShingle | Samantha Petti Sean R Eddy Constructing benchmark test sets for biological sequence analysis using independent set algorithms. PLoS Computational Biology |
title | Constructing benchmark test sets for biological sequence analysis using independent set algorithms. |
title_full | Constructing benchmark test sets for biological sequence analysis using independent set algorithms. |
title_fullStr | Constructing benchmark test sets for biological sequence analysis using independent set algorithms. |
title_full_unstemmed | Constructing benchmark test sets for biological sequence analysis using independent set algorithms. |
title_short | Constructing benchmark test sets for biological sequence analysis using independent set algorithms. |
title_sort | constructing benchmark test sets for biological sequence analysis using independent set algorithms |
url | https://doi.org/10.1371/journal.pcbi.1009492 |
work_keys_str_mv | AT samanthapetti constructingbenchmarktestsetsforbiologicalsequenceanalysisusingindependentsetalgorithms AT seanreddy constructingbenchmarktestsetsforbiologicalsequenceanalysisusingindependentsetalgorithms |