Alignment-free sequence comparison (I): statistics and power.

Large-scale comparison of the similarities between two biological sequences is a major issue in computational biology; a fast method, the D(2) statistic, relies on the comparison of the k-tuple content for both sequences. Although it has been known for some years that the D(2) statistic is not suita...

Full description

Bibliographic Details
Main Authors:	Reinert, G, Chew, D, Sun, F, Waterman, MS
Format:	Journal article
Language:	English
Published:	2009

_version_	1826285178069188608
author	Reinert, G Chew, D Sun, F Waterman, MS
author_facet	Reinert, G Chew, D Sun, F Waterman, MS
author_sort	Reinert, G
collection	OXFORD
description	Large-scale comparison of the similarities between two biological sequences is a major issue in computational biology; a fast method, the D(2) statistic, relies on the comparison of the k-tuple content for both sequences. Although it has been known for some years that the D(2) statistic is not suitable for this task, as it tends to be dominated by single-sequence noise, to date no suitable adjustments have been proposed. In this article, we suggest two new variants of the D(2) word count statistic, which we call D(2)(S) and D(2)(). For D(2)(S), which is a self-standardized statistic, we show that the statistic is asymptotically normally distributed, when sequence lengths tend to infinity, and not dominated by the noise in the individual sequences. The second statistic, D(2)(), outperforms D(2)(S) in terms of power for detecting the relatedness between the two sequences in our examples; but although it is straightforward to simulate from the asymptotic distribution of D(2)(*), we cannot provide a closed form for power calculations.
first_indexed	2024-03-07T01:24:56Z
format	Journal article
id	oxford-uuid:91a6d8da-6a25-4620-9472-c7871f3f492e
institution	University of Oxford
language	English
last_indexed	2024-03-07T01:24:56Z
publishDate	2009
record_format	dspace
spelling	oxford-uuid:91a6d8da-6a25-4620-9472-c7871f3f492e2022-03-26T23:20:07ZAlignment-free sequence comparison (I): statistics and power.Journal articlehttp://purl.org/coar/resource_type/c_dcae04bcuuid:91a6d8da-6a25-4620-9472-c7871f3f492eEnglishSymplectic Elements at Oxford2009Reinert, GChew, DSun, FWaterman, MSLarge-scale comparison of the similarities between two biological sequences is a major issue in computational biology; a fast method, the D(2) statistic, relies on the comparison of the k-tuple content for both sequences. Although it has been known for some years that the D(2) statistic is not suitable for this task, as it tends to be dominated by single-sequence noise, to date no suitable adjustments have been proposed. In this article, we suggest two new variants of the D(2) word count statistic, which we call D(2)(S) and D(2)(). For D(2)(S), which is a self-standardized statistic, we show that the statistic is asymptotically normally distributed, when sequence lengths tend to infinity, and not dominated by the noise in the individual sequences. The second statistic, D(2)(), outperforms D(2)(S) in terms of power for detecting the relatedness between the two sequences in our examples; but although it is straightforward to simulate from the asymptotic distribution of D(2)(*), we cannot provide a closed form for power calculations.
spellingShingle	Reinert, G Chew, D Sun, F Waterman, MS Alignment-free sequence comparison (I): statistics and power.
title	Alignment-free sequence comparison (I): statistics and power.
title_full	Alignment-free sequence comparison (I): statistics and power.
title_fullStr	Alignment-free sequence comparison (I): statistics and power.
title_full_unstemmed	Alignment-free sequence comparison (I): statistics and power.
title_short	Alignment-free sequence comparison (I): statistics and power.
title_sort	alignment free sequence comparison i statistics and power
work_keys_str_mv	AT reinertg alignmentfreesequencecomparisonistatisticsandpower AT chewd alignmentfreesequencecomparisonistatisticsandpower AT sunf alignmentfreesequencecomparisonistatisticsandpower AT watermanms alignmentfreesequencecomparisonistatisticsandpower

Alignment-free sequence comparison (I): statistics and power.

Similar Items