Constructing benchmark test sets for biological sequence analysis using independent set algorithms.

Biological sequence families contain many sequences that are very similar to each other because they are related by evolution, so the strategy for splitting data into separate training and test sets is a nontrivial choice in benchmarking sequence analysis methods. A random split is insufficient beca...

Full description

Bibliographic Details
Main Authors: Samantha Petti, Sean R Eddy
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2022-03-01
Series:PLoS Computational Biology
Online Access:https://doi.org/10.1371/journal.pcbi.1009492
_version_ 1811155398990757888
author Samantha Petti
Sean R Eddy
author_facet Samantha Petti
Sean R Eddy
author_sort Samantha Petti
collection DOAJ
description Biological sequence families contain many sequences that are very similar to each other because they are related by evolution, so the strategy for splitting data into separate training and test sets is a nontrivial choice in benchmarking sequence analysis methods. A random split is insufficient because it will yield test sequences that are closely related or even identical to training sequences. Adapting ideas from independent set graph algorithms, we describe two new methods for splitting sequence data into dissimilar training and test sets. These algorithms input a sequence family and produce a split in which each test sequence is less than p% identical to any individual training sequence. These algorithms successfully split more families than a previous approach, enabling construction of more diverse benchmark datasets.
first_indexed 2024-04-10T04:33:35Z
format Article
id doaj.art-18c1a0210b59456abeb5f34c4f02fe4f
institution Directory Open Access Journal
issn 1553-734X
1553-7358
language English
last_indexed 2024-04-10T04:33:35Z
publishDate 2022-03-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS Computational Biology
spelling doaj.art-18c1a0210b59456abeb5f34c4f02fe4f2023-03-10T05:31:33ZengPublic Library of Science (PLoS)PLoS Computational Biology1553-734X1553-73582022-03-01183e100949210.1371/journal.pcbi.1009492Constructing benchmark test sets for biological sequence analysis using independent set algorithms.Samantha PettiSean R EddyBiological sequence families contain many sequences that are very similar to each other because they are related by evolution, so the strategy for splitting data into separate training and test sets is a nontrivial choice in benchmarking sequence analysis methods. A random split is insufficient because it will yield test sequences that are closely related or even identical to training sequences. Adapting ideas from independent set graph algorithms, we describe two new methods for splitting sequence data into dissimilar training and test sets. These algorithms input a sequence family and produce a split in which each test sequence is less than p% identical to any individual training sequence. These algorithms successfully split more families than a previous approach, enabling construction of more diverse benchmark datasets.https://doi.org/10.1371/journal.pcbi.1009492
spellingShingle Samantha Petti
Sean R Eddy
Constructing benchmark test sets for biological sequence analysis using independent set algorithms.
PLoS Computational Biology
title Constructing benchmark test sets for biological sequence analysis using independent set algorithms.
title_full Constructing benchmark test sets for biological sequence analysis using independent set algorithms.
title_fullStr Constructing benchmark test sets for biological sequence analysis using independent set algorithms.
title_full_unstemmed Constructing benchmark test sets for biological sequence analysis using independent set algorithms.
title_short Constructing benchmark test sets for biological sequence analysis using independent set algorithms.
title_sort constructing benchmark test sets for biological sequence analysis using independent set algorithms
url https://doi.org/10.1371/journal.pcbi.1009492
work_keys_str_mv AT samanthapetti constructingbenchmarktestsetsforbiologicalsequenceanalysisusingindependentsetalgorithms
AT seanreddy constructingbenchmarktestsetsforbiologicalsequenceanalysisusingindependentsetalgorithms