On counting the frequency distribution of string motifs in molecular sequences

This work investigates frequency distributions of strings within a text. The mathematical derivation accounts for variable alphabet size, character probabilities, and string/text lengths, under both the Bernoullian and the Markovian model for string generation. The analysis is limited to the set of...

Full description

Bibliographic Details
Main Authors: Prosperi, M, Prosperi, L, Gray, R, Salemi, M
Format: Journal article
Language:English
Published: 2012
_version_ 1797098832643751936
author Prosperi, M
Prosperi, L
Gray, R
Salemi, M
author_facet Prosperi, M
Prosperi, L
Gray, R
Salemi, M
author_sort Prosperi, M
collection OXFORD
description This work investigates frequency distributions of strings within a text. The mathematical derivation accounts for variable alphabet size, character probabilities, and string/text lengths, under both the Bernoullian and the Markovian model for string generation. The analysis is limited to the set of non-clumpable strings, that cannot overlap with themselves. Two formulae (exact and approximated) are derived, calculating the frequency distribution of a string of length m found inside a text of length n (with m < n). The approximated formula has a constant complexity (in contrast to an exponential complexity of the exact) and makes it applicable to very long texts. The proposed formulae were applied to analyze string frequencies in a portion of the human genome, and to recalculate frequencies of known repeated motif within genes, associated to genetic diseases. A comparison with state-of-the-art methods was provided. The formulae presented here can be of use in the statistical evaluation of specific motif frequencies within very long texts (e.g. genes or genomes) and help in characterizing motifs in pathologic conditions. © 2012 World Scientific Publishing Company.
first_indexed 2024-03-07T05:15:19Z
format Journal article
id oxford-uuid:dcf51b6e-e7ee-425d-80ad-eb6768981622
institution University of Oxford
language English
last_indexed 2024-03-07T05:15:19Z
publishDate 2012
record_format dspace
spelling oxford-uuid:dcf51b6e-e7ee-425d-80ad-eb67689816222022-03-27T09:21:39ZOn counting the frequency distribution of string motifs in molecular sequencesJournal articlehttp://purl.org/coar/resource_type/c_dcae04bcuuid:dcf51b6e-e7ee-425d-80ad-eb6768981622EnglishSymplectic Elements at Oxford2012Prosperi, MProsperi, LGray, RSalemi, MThis work investigates frequency distributions of strings within a text. The mathematical derivation accounts for variable alphabet size, character probabilities, and string/text lengths, under both the Bernoullian and the Markovian model for string generation. The analysis is limited to the set of non-clumpable strings, that cannot overlap with themselves. Two formulae (exact and approximated) are derived, calculating the frequency distribution of a string of length m found inside a text of length n (with m < n). The approximated formula has a constant complexity (in contrast to an exponential complexity of the exact) and makes it applicable to very long texts. The proposed formulae were applied to analyze string frequencies in a portion of the human genome, and to recalculate frequencies of known repeated motif within genes, associated to genetic diseases. A comparison with state-of-the-art methods was provided. The formulae presented here can be of use in the statistical evaluation of specific motif frequencies within very long texts (e.g. genes or genomes) and help in characterizing motifs in pathologic conditions. © 2012 World Scientific Publishing Company.
spellingShingle Prosperi, M
Prosperi, L
Gray, R
Salemi, M
On counting the frequency distribution of string motifs in molecular sequences
title On counting the frequency distribution of string motifs in molecular sequences
title_full On counting the frequency distribution of string motifs in molecular sequences
title_fullStr On counting the frequency distribution of string motifs in molecular sequences
title_full_unstemmed On counting the frequency distribution of string motifs in molecular sequences
title_short On counting the frequency distribution of string motifs in molecular sequences
title_sort on counting the frequency distribution of string motifs in molecular sequences
work_keys_str_mv AT prosperim oncountingthefrequencydistributionofstringmotifsinmolecularsequences
AT prosperil oncountingthefrequencydistributionofstringmotifsinmolecularsequences
AT grayr oncountingthefrequencydistributionofstringmotifsinmolecularsequences
AT salemim oncountingthefrequencydistributionofstringmotifsinmolecularsequences