Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of <it>cis</it>-regulatory modules

Abstract Background <it>cis</it>-Regulatory modules (CRMs) of eukaryotic genes often contain multiple binding sites for transcription factors. The phenomenon that binding sites form clusters in CRMs is exploited in many algorithms to locate...

Full description

Bibliographic Details
Main Authors:	Roytberg Mikhail A, Régnier Mireille, Clément Julien, Boeva Valentina, Makeev Vsevolod J
Format:	Article
Language:	English
Published:	BMC 2007-10-01
Series:	Algorithms for Molecular Biology
Online Access:	http://www.almob.org/content/2/1/13

_version_	1818472291406708736
author	Roytberg Mikhail A Régnier Mireille Clément Julien Boeva Valentina Makeev Vsevolod J
author_facet	Roytberg Mikhail A Régnier Mireille Clément Julien Boeva Valentina Makeev Vsevolod J
author_sort	Roytberg Mikhail A
collection	DOAJ
description	<p>Abstract</p> <p>Background</p> <p><it>cis</it>-Regulatory modules (CRMs) of eukaryotic genes often contain multiple binding sites for transcription factors. The phenomenon that binding sites form clusters in CRMs is exploited in many algorithms to locate CRMs in a genome. This gives rise to the problem of calculating the statistical significance of the event that multiple sites, recognized by different factors, would be found simultaneously in a text of a fixed length. The main difficulty comes from overlapping occurrences of motifs. So far, no tools have been developed allowing the computation of <it>p</it>-values for simultaneous occurrences of different motifs which can overlap.</p> <p>Results</p> <p>We developed and implemented an algorithm computing the <it>p</it>-value that <it>s </it>different motifs occur respectively <it>k</it><sub>1</sub>, ..., <it>k</it><sub><it>s </it></sub>or more times, possibly overlapping, in a random text. Motifs can be represented with a majority of popular motif models, but in all cases, without indels. Zero or first order Markov chains can be adopted as a model for the random text. The computational tool was tested on the set of <it>cis</it>-regulatory modules involved in <it>D. melanogaster </it>early development, for which there exists an annotation of binding sites for transcription factors. Our test allowed us to correctly identify transcription factors cooperatively/competitively binding to DNA.</p> <p>Method</p> <p>The algorithm that precisely computes the probability of simultaneous motif occurrences is inspired by the Aho-Corasick automaton and employs a prefix tree together with a transition function. The algorithm runs with the <it>O</it>(<it>n</it>\|Σ\|(<it>m</it>\|<inline-formula><m:math name="1748-7188-2-13-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mi>ℋ</m:mi><m:annotation encoding="MathType-MTEF"> MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@</m:annotation></m:semantics></m:math></inline-formula>\| + <it>K</it>\|<it>σ</it>\|<sup><it>K</it></sup>) ∏<sub><it>i </it></sub><it>k</it><sub><it>i</it></sub>) time complexity, where <it>n </it>is the length of the text, \|Σ\| is the alphabet size, <it>m </it>is the maximal motif length, \|<inline-formula><m:math name="1748-7188-2-13-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mi>ℋ</m:mi><m:annotation encoding="MathType-MTEF"> MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@</m:annotation></m:semantics></m:math></inline-formula>\| is the total number of words in motifs, <it>K </it>is the order of Markov model, and <it>k</it><sub><it>i </it></sub>is the number of occurrences of the <it>i</it>th motif.</p> <p>Conclusion</p> <p>The primary objective of the program is to assess the likelihood that a given DNA segment is CRM regulated with a known set of regulatory factors. In addition, the program can also be used to select the appropriate threshold for PWM scanning. Another application is assessing similarity of different motifs.</p> <p>Availability</p> <p>Project web page, stand-alone version and documentation can be found at <url>http://bioinform.genetika.ru/AhoPro/</url></p>
first_indexed	2024-04-14T04:05:02Z
format	Article
id	doaj.art-900edf70c7c0460faf20bb434454edbc
institution	Directory Open Access Journal
issn	1748-7188
language	English
last_indexed	2024-04-14T04:05:02Z
publishDate	2007-10-01
publisher	BMC
record_format	Article
series	Algorithms for Molecular Biology
spelling	doaj.art-900edf70c7c0460faf20bb434454edbc2022-12-22T02:13:24ZengBMCAlgorithms for Molecular Biology1748-71882007-10-01211310.1186/1748-7188-2-13Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of <it>cis</it>-regulatory modulesRoytberg Mikhail ARégnier MireilleClément JulienBoeva ValentinaMakeev Vsevolod J<p>Abstract</p> <p>Background</p> <p><it>cis</it>-Regulatory modules (CRMs) of eukaryotic genes often contain multiple binding sites for transcription factors. The phenomenon that binding sites form clusters in CRMs is exploited in many algorithms to locate CRMs in a genome. This gives rise to the problem of calculating the statistical significance of the event that multiple sites, recognized by different factors, would be found simultaneously in a text of a fixed length. The main difficulty comes from overlapping occurrences of motifs. So far, no tools have been developed allowing the computation of <it>p</it>-values for simultaneous occurrences of different motifs which can overlap.</p> <p>Results</p> <p>We developed and implemented an algorithm computing the <it>p</it>-value that <it>s </it>different motifs occur respectively <it>k</it><sub>1</sub>, ..., <it>k</it><sub><it>s </it></sub>or more times, possibly overlapping, in a random text. Motifs can be represented with a majority of popular motif models, but in all cases, without indels. Zero or first order Markov chains can be adopted as a model for the random text. The computational tool was tested on the set of <it>cis</it>-regulatory modules involved in <it>D. melanogaster </it>early development, for which there exists an annotation of binding sites for transcription factors. Our test allowed us to correctly identify transcription factors cooperatively/competitively binding to DNA.</p> <p>Method</p> <p>The algorithm that precisely computes the probability of simultaneous motif occurrences is inspired by the Aho-Corasick automaton and employs a prefix tree together with a transition function. The algorithm runs with the <it>O</it>(<it>n</it>\|Σ\|(<it>m</it>\|<inline-formula><m:math name="1748-7188-2-13-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mi>ℋ</m:mi><m:annotation encoding="MathType-MTEF"> MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@</m:annotation></m:semantics></m:math></inline-formula>\| + <it>K</it>\|<it>σ</it>\|<sup><it>K</it></sup>) ∏<sub><it>i </it></sub><it>k</it><sub><it>i</it></sub>) time complexity, where <it>n </it>is the length of the text, \|Σ\| is the alphabet size, <it>m </it>is the maximal motif length, \|<inline-formula><m:math name="1748-7188-2-13-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mi>ℋ</m:mi><m:annotation encoding="MathType-MTEF"> MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@</m:annotation></m:semantics></m:math></inline-formula>\| is the total number of words in motifs, <it>K </it>is the order of Markov model, and <it>k</it><sub><it>i </it></sub>is the number of occurrences of the <it>i</it>th motif.</p> <p>Conclusion</p> <p>The primary objective of the program is to assess the likelihood that a given DNA segment is CRM regulated with a known set of regulatory factors. In addition, the program can also be used to select the appropriate threshold for PWM scanning. Another application is assessing similarity of different motifs.</p> <p>Availability</p> <p>Project web page, stand-alone version and documentation can be found at <url>http://bioinform.genetika.ru/AhoPro/</url></p>http://www.almob.org/content/2/1/13
spellingShingle	Roytberg Mikhail A Régnier Mireille Clément Julien Boeva Valentina Makeev Vsevolod J Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of <it>cis</it>-regulatory modules Algorithms for Molecular Biology
title	Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of <it>cis</it>-regulatory modules
title_full	Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of <it>cis</it>-regulatory modules
title_fullStr	Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of <it>cis</it>-regulatory modules
title_full_unstemmed	Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of <it>cis</it>-regulatory modules
title_short	Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of <it>cis</it>-regulatory modules
title_sort	exact p value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of it cis it regulatory modules
url	http://www.almob.org/content/2/1/13
work_keys_str_mv	AT roytbergmikhaila exactpvaluecalculationforheterotypicclustersofregulatorymotifsanditsapplicationincomputationalannotationofitcisitregulatorymodules AT regniermireille exactpvaluecalculationforheterotypicclustersofregulatorymotifsanditsapplicationincomputationalannotationofitcisitregulatorymodules AT clementjulien exactpvaluecalculationforheterotypicclustersofregulatorymotifsanditsapplicationincomputationalannotationofitcisitregulatorymodules AT boevavalentina exactpvaluecalculationforheterotypicclustersofregulatorymotifsanditsapplicationincomputationalannotationofitcisitregulatorymodules AT makeevvsevolodj exactpvaluecalculationforheterotypicclustersofregulatorymotifsanditsapplicationincomputationalannotationofitcisitregulatorymodules

Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of <it>cis</it>-regulatory modules

Similar Items