Prediction accuracy of regulatory elements from sequence varies by functional sequencing technique

IntroductionVarious sequencing based approaches are used to identify and characterize the activities of cis-regulatory elements in a genome-wide fashion. Some of these techniques rely on indirect markers such as histone modifications (ChIP-seq with histone antibodies) or chromatin accessibility (ATA...

Full description

Bibliographic Details
Main Authors: Ronald J. Nowling, Kimani Njoya, John G. Peters, Michelle M. Riehle
Format: Article
Language:English
Published: Frontiers Media S.A. 2023-08-01
Series:Frontiers in Cellular and Infection Microbiology
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fcimb.2023.1182567/full
_version_ 1797756355818291200
author Ronald J. Nowling
Kimani Njoya
John G. Peters
Michelle M. Riehle
author_facet Ronald J. Nowling
Kimani Njoya
John G. Peters
Michelle M. Riehle
author_sort Ronald J. Nowling
collection DOAJ
description IntroductionVarious sequencing based approaches are used to identify and characterize the activities of cis-regulatory elements in a genome-wide fashion. Some of these techniques rely on indirect markers such as histone modifications (ChIP-seq with histone antibodies) or chromatin accessibility (ATAC-seq, DNase-seq, FAIRE-seq), while other techniques use direct measures such as episomal assays measuring the enhancer properties of DNA sequences (STARR-seq) and direct measurement of the binding of transcription factors (ChIP-seq with transcription factor-specific antibodies). The activities of cis-regulatory elements such as enhancers, promoters, and repressors are determined by their sequence and secondary processes such as chromatin accessibility, DNA methylation, and bound histone markers.MethodsHere, machine learning models are employed to evaluate the accuracy with which cis-regulatory elements identified by various commonly used sequencing techniques can be predicted by their underlying sequence alone to distinguish between cis-regulatory activity that is reflective of sequence content versus secondary processes.Results and discussionModels trained and evaluated on D. melanogaster sequences identified through DNase-seq and STARR-seq are significantly more accurate than models trained on sequences identified by H3K4me1, H3K4me3, and H3K27ac ChIP-seq, FAIRE-seq, and ATAC-seq. These results suggest that the activity detected by DNase-seq and STARR-seq can be largely explained by underlying DNA sequence, independent of secondary processes. Experimentally, a subset of DNase-seq and H3K4me1 ChIP-seq sequences were tested for enhancer activity using luciferase assays and compared with previous tests performed on STARR-seq sequences. The experimental data indicated that STARR-seq sequences are substantially enriched for enhancer-specific activity, while the DNase-seq and H3K4me1 ChIP-seq sequences are not. Taken together, these results indicate that the DNase-seq approach identifies a broad class of regulatory elements of which enhancers are a subset and the associated data are appropriate for training models for detecting regulatory activity from sequence alone, STARR-seq data are best for training enhancer-specific sequence models, and H3K4me1 ChIP-seq data are not well suited for training and evaluating sequence-based models for cis-regulatory element prediction.
first_indexed 2024-03-12T18:00:12Z
format Article
id doaj.art-3016afc463ad4fa3ba1dc6222152a5fb
institution Directory Open Access Journal
issn 2235-2988
language English
last_indexed 2024-03-12T18:00:12Z
publishDate 2023-08-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Cellular and Infection Microbiology
spelling doaj.art-3016afc463ad4fa3ba1dc6222152a5fb2023-08-02T11:58:19ZengFrontiers Media S.A.Frontiers in Cellular and Infection Microbiology2235-29882023-08-011310.3389/fcimb.2023.11825671182567Prediction accuracy of regulatory elements from sequence varies by functional sequencing techniqueRonald J. Nowling0Kimani Njoya1John G. Peters2Michelle M. Riehle3Electrical Engineering and Computer Science, Milwaukee School of Engineering, Milwaukee, WI, United StatesDepartment of Microbiology and Immunology, Medical College of Wisconsin, Milwaukee, WI, United StatesElectrical Engineering and Computer Science, Milwaukee School of Engineering, Milwaukee, WI, United StatesDepartment of Microbiology and Immunology, Medical College of Wisconsin, Milwaukee, WI, United StatesIntroductionVarious sequencing based approaches are used to identify and characterize the activities of cis-regulatory elements in a genome-wide fashion. Some of these techniques rely on indirect markers such as histone modifications (ChIP-seq with histone antibodies) or chromatin accessibility (ATAC-seq, DNase-seq, FAIRE-seq), while other techniques use direct measures such as episomal assays measuring the enhancer properties of DNA sequences (STARR-seq) and direct measurement of the binding of transcription factors (ChIP-seq with transcription factor-specific antibodies). The activities of cis-regulatory elements such as enhancers, promoters, and repressors are determined by their sequence and secondary processes such as chromatin accessibility, DNA methylation, and bound histone markers.MethodsHere, machine learning models are employed to evaluate the accuracy with which cis-regulatory elements identified by various commonly used sequencing techniques can be predicted by their underlying sequence alone to distinguish between cis-regulatory activity that is reflective of sequence content versus secondary processes.Results and discussionModels trained and evaluated on D. melanogaster sequences identified through DNase-seq and STARR-seq are significantly more accurate than models trained on sequences identified by H3K4me1, H3K4me3, and H3K27ac ChIP-seq, FAIRE-seq, and ATAC-seq. These results suggest that the activity detected by DNase-seq and STARR-seq can be largely explained by underlying DNA sequence, independent of secondary processes. Experimentally, a subset of DNase-seq and H3K4me1 ChIP-seq sequences were tested for enhancer activity using luciferase assays and compared with previous tests performed on STARR-seq sequences. The experimental data indicated that STARR-seq sequences are substantially enriched for enhancer-specific activity, while the DNase-seq and H3K4me1 ChIP-seq sequences are not. Taken together, these results indicate that the DNase-seq approach identifies a broad class of regulatory elements of which enhancers are a subset and the associated data are appropriate for training models for detecting regulatory activity from sequence alone, STARR-seq data are best for training enhancer-specific sequence models, and H3K4me1 ChIP-seq data are not well suited for training and evaluating sequence-based models for cis-regulatory element prediction.https://www.frontiersin.org/articles/10.3389/fcimb.2023.1182567/fullenhancersfunctional sequencingmachine learningsequence modelsDNase-seqSTARR-seq
spellingShingle Ronald J. Nowling
Kimani Njoya
John G. Peters
Michelle M. Riehle
Prediction accuracy of regulatory elements from sequence varies by functional sequencing technique
Frontiers in Cellular and Infection Microbiology
enhancers
functional sequencing
machine learning
sequence models
DNase-seq
STARR-seq
title Prediction accuracy of regulatory elements from sequence varies by functional sequencing technique
title_full Prediction accuracy of regulatory elements from sequence varies by functional sequencing technique
title_fullStr Prediction accuracy of regulatory elements from sequence varies by functional sequencing technique
title_full_unstemmed Prediction accuracy of regulatory elements from sequence varies by functional sequencing technique
title_short Prediction accuracy of regulatory elements from sequence varies by functional sequencing technique
title_sort prediction accuracy of regulatory elements from sequence varies by functional sequencing technique
topic enhancers
functional sequencing
machine learning
sequence models
DNase-seq
STARR-seq
url https://www.frontiersin.org/articles/10.3389/fcimb.2023.1182567/full
work_keys_str_mv AT ronaldjnowling predictionaccuracyofregulatoryelementsfromsequencevariesbyfunctionalsequencingtechnique
AT kimaninjoya predictionaccuracyofregulatoryelementsfromsequencevariesbyfunctionalsequencingtechnique
AT johngpeters predictionaccuracyofregulatoryelementsfromsequencevariesbyfunctionalsequencingtechnique
AT michellemriehle predictionaccuracyofregulatoryelementsfromsequencevariesbyfunctionalsequencingtechnique