Constructing benchmark test sets for biological sequence analysis using independent set algorithms.

Biological sequence families contain many sequences that are very similar to each other because they are related by evolution, so the strategy for splitting data into separate training and test sets is a nontrivial choice in benchmarking sequence analysis methods. A random split is insufficient beca...

Full description

Bibliographic Details
Main Authors:	Samantha Petti, Sean R Eddy
Format:	Article
Language:	English
Published:	Public Library of Science (PLoS) 2022-03-01
Series:	PLoS Computational Biology
Online Access:	https://doi.org/10.1371/journal.pcbi.1009492

_version_	1811155398990757888
author	Samantha Petti Sean R Eddy
author_facet	Samantha Petti Sean R Eddy
author_sort	Samantha Petti
collection	DOAJ
description	Biological sequence families contain many sequences that are very similar to each other because they are related by evolution, so the strategy for splitting data into separate training and test sets is a nontrivial choice in benchmarking sequence analysis methods. A random split is insufficient because it will yield test sequences that are closely related or even identical to training sequences. Adapting ideas from independent set graph algorithms, we describe two new methods for splitting sequence data into dissimilar training and test sets. These algorithms input a sequence family and produce a split in which each test sequence is less than p% identical to any individual training sequence. These algorithms successfully split more families than a previous approach, enabling construction of more diverse benchmark datasets.
first_indexed	2024-04-10T04:33:35Z
format	Article
id	doaj.art-18c1a0210b59456abeb5f34c4f02fe4f
institution	Directory Open Access Journal
issn	1553-734X 1553-7358
language	English
last_indexed	2024-04-10T04:33:35Z
publishDate	2022-03-01
publisher	Public Library of Science (PLoS)
record_format	Article
series	PLoS Computational Biology
spelling	doaj.art-18c1a0210b59456abeb5f34c4f02fe4f2023-03-10T05:31:33ZengPublic Library of Science (PLoS)PLoS Computational Biology1553-734X1553-73582022-03-01183e100949210.1371/journal.pcbi.1009492Constructing benchmark test sets for biological sequence analysis using independent set algorithms.Samantha PettiSean R EddyBiological sequence families contain many sequences that are very similar to each other because they are related by evolution, so the strategy for splitting data into separate training and test sets is a nontrivial choice in benchmarking sequence analysis methods. A random split is insufficient because it will yield test sequences that are closely related or even identical to training sequences. Adapting ideas from independent set graph algorithms, we describe two new methods for splitting sequence data into dissimilar training and test sets. These algorithms input a sequence family and produce a split in which each test sequence is less than p% identical to any individual training sequence. These algorithms successfully split more families than a previous approach, enabling construction of more diverse benchmark datasets.https://doi.org/10.1371/journal.pcbi.1009492
spellingShingle	Samantha Petti Sean R Eddy Constructing benchmark test sets for biological sequence analysis using independent set algorithms. PLoS Computational Biology
title	Constructing benchmark test sets for biological sequence analysis using independent set algorithms.
title_full	Constructing benchmark test sets for biological sequence analysis using independent set algorithms.
title_fullStr	Constructing benchmark test sets for biological sequence analysis using independent set algorithms.
title_full_unstemmed	Constructing benchmark test sets for biological sequence analysis using independent set algorithms.
title_short	Constructing benchmark test sets for biological sequence analysis using independent set algorithms.
title_sort	constructing benchmark test sets for biological sequence analysis using independent set algorithms
url	https://doi.org/10.1371/journal.pcbi.1009492
work_keys_str_mv	AT samanthapetti constructingbenchmarktestsetsforbiologicalsequenceanalysisusingindependentsetalgorithms AT seanreddy constructingbenchmarktestsetsforbiologicalsequenceanalysisusingindependentsetalgorithms

Constructing benchmark test sets for biological sequence analysis using independent set algorithms.

Similar Items