Predicting tissue specific cis-regulatory modules in the human genome using pairs of co-occurring motifs

<p>Abstract</p> <p>Background</p> <p>Researchers seeking to unlock the genetic basis of human physiology and diseases have been studying gene transcription regulation. The temporal and spatial patterns of gene expression are controlled by mainly non-coding elements know...

Full description

Bibliographic Details
Main Authors: Girgis Hani Z, Ovcharenko Ivan
Format: Article
Language:English
Published: BMC 2012-02-01
Series:BMC Bioinformatics
Online Access:http://www.biomedcentral.com/1471-2105/13/25
_version_ 1819031660442681344
author Girgis Hani Z
Ovcharenko Ivan
author_facet Girgis Hani Z
Ovcharenko Ivan
author_sort Girgis Hani Z
collection DOAJ
description <p>Abstract</p> <p>Background</p> <p>Researchers seeking to unlock the genetic basis of human physiology and diseases have been studying gene transcription regulation. The temporal and spatial patterns of gene expression are controlled by mainly non-coding elements known as cis-regulatory modules (CRMs) and epigenetic factors. CRMs modulating related genes share the regulatory signature which consists of transcription factor (TF) binding sites (TFBSs). Identifying such CRMs is a challenging problem due to the prohibitive number of sequence sets that need to be analyzed.</p> <p>Results</p> <p>We formulated the challenge as a supervised classification problem even though experimentally validated CRMs were not required. Our efforts resulted in a software system named CrmMiner. The system mines for CRMs in the vicinity of related genes. CrmMiner requires two sets of sequences: a mixed set and a control set. Sequences in the vicinity of the related genes comprise the mixed set, whereas the control set includes random genomic sequences. CrmMiner assumes that a large percentage of the mixed set is made of background sequences that do not include CRMs. The system identifies pairs of closely located motifs representing vertebrate TFBSs that are enriched in the training mixed set consisting of 50% of the gene loci. In addition, CrmMiner selects a group of the enriched pairs to represent the tissue-specific regulatory signature. The mixed and the control sets are searched for candidate sequences that include any of the selected pairs. Next, an optimal Bayesian classifier is used to distinguish candidates found in the mixed set from their control counterparts. Our study proposes 62 tissue-specific regulatory signatures and putative CRMs for different human tissues and cell types. These signatures consist of assortments of ubiquitously expressed TFs and tissue-specific TFs. Under controlled settings, CrmMiner identified known CRMs in noisy sets up to 1:25 signal-to-noise ratio. CrmMiner was 21-75% more precise than a related CRM predictor. The sensitivity of the system to locate known human heart enhancers reached up to 83%. CrmMiner precision reached 82% while mining for CRMs specific to the human CD4<sup>+ </sup>T cells. On several data sets, the system achieved 99% specificity.</p> <p>Conclusion</p> <p>These results suggest that CrmMiner predictions are accurate and likely to be tissue-specific CRMs. We expect that the predicted tissue-specific CRMs and the regulatory signatures broaden our knowledge of gene transcription regulation.</p>
first_indexed 2024-12-21T06:49:35Z
format Article
id doaj.art-580654a8ee164fb0af8b5ee5a2297718
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-12-21T06:49:35Z
publishDate 2012-02-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-580654a8ee164fb0af8b5ee5a22977182022-12-21T19:12:31ZengBMCBMC Bioinformatics1471-21052012-02-011312510.1186/1471-2105-13-25Predicting tissue specific cis-regulatory modules in the human genome using pairs of co-occurring motifsGirgis Hani ZOvcharenko Ivan<p>Abstract</p> <p>Background</p> <p>Researchers seeking to unlock the genetic basis of human physiology and diseases have been studying gene transcription regulation. The temporal and spatial patterns of gene expression are controlled by mainly non-coding elements known as cis-regulatory modules (CRMs) and epigenetic factors. CRMs modulating related genes share the regulatory signature which consists of transcription factor (TF) binding sites (TFBSs). Identifying such CRMs is a challenging problem due to the prohibitive number of sequence sets that need to be analyzed.</p> <p>Results</p> <p>We formulated the challenge as a supervised classification problem even though experimentally validated CRMs were not required. Our efforts resulted in a software system named CrmMiner. The system mines for CRMs in the vicinity of related genes. CrmMiner requires two sets of sequences: a mixed set and a control set. Sequences in the vicinity of the related genes comprise the mixed set, whereas the control set includes random genomic sequences. CrmMiner assumes that a large percentage of the mixed set is made of background sequences that do not include CRMs. The system identifies pairs of closely located motifs representing vertebrate TFBSs that are enriched in the training mixed set consisting of 50% of the gene loci. In addition, CrmMiner selects a group of the enriched pairs to represent the tissue-specific regulatory signature. The mixed and the control sets are searched for candidate sequences that include any of the selected pairs. Next, an optimal Bayesian classifier is used to distinguish candidates found in the mixed set from their control counterparts. Our study proposes 62 tissue-specific regulatory signatures and putative CRMs for different human tissues and cell types. These signatures consist of assortments of ubiquitously expressed TFs and tissue-specific TFs. Under controlled settings, CrmMiner identified known CRMs in noisy sets up to 1:25 signal-to-noise ratio. CrmMiner was 21-75% more precise than a related CRM predictor. The sensitivity of the system to locate known human heart enhancers reached up to 83%. CrmMiner precision reached 82% while mining for CRMs specific to the human CD4<sup>+ </sup>T cells. On several data sets, the system achieved 99% specificity.</p> <p>Conclusion</p> <p>These results suggest that CrmMiner predictions are accurate and likely to be tissue-specific CRMs. We expect that the predicted tissue-specific CRMs and the regulatory signatures broaden our knowledge of gene transcription regulation.</p>http://www.biomedcentral.com/1471-2105/13/25
spellingShingle Girgis Hani Z
Ovcharenko Ivan
Predicting tissue specific cis-regulatory modules in the human genome using pairs of co-occurring motifs
BMC Bioinformatics
title Predicting tissue specific cis-regulatory modules in the human genome using pairs of co-occurring motifs
title_full Predicting tissue specific cis-regulatory modules in the human genome using pairs of co-occurring motifs
title_fullStr Predicting tissue specific cis-regulatory modules in the human genome using pairs of co-occurring motifs
title_full_unstemmed Predicting tissue specific cis-regulatory modules in the human genome using pairs of co-occurring motifs
title_short Predicting tissue specific cis-regulatory modules in the human genome using pairs of co-occurring motifs
title_sort predicting tissue specific cis regulatory modules in the human genome using pairs of co occurring motifs
url http://www.biomedcentral.com/1471-2105/13/25
work_keys_str_mv AT girgishaniz predictingtissuespecificcisregulatorymodulesinthehumangenomeusingpairsofcooccurringmotifs
AT ovcharenkoivan predictingtissuespecificcisregulatorymodulesinthehumangenomeusingpairsofcooccurringmotifs