DeepTSS: multi-branch convolutional neural network for transcription start site identification from CAGE data

Abstract Background The widespread usage of Cap Analysis of Gene Expression (CAGE) has led to numerous breakthroughs in understanding the transcription mechanisms. Recent evidence in the literature, however, suggests that CAGE suffers from transcriptional and technical noise. Regardless of the sampl...

Full description

Bibliographic Details
Main Authors: Dimitris Grigoriadis, Nikos Perdikopanis, Georgios K. Georgakilas, Artemis G. Hatzigeorgiou
Format: Article
Language:English
Published: BMC 2022-12-01
Series:BMC Bioinformatics
Subjects:
Online Access:https://doi.org/10.1186/s12859-022-04945-y
_version_ 1811196636061237248
author Dimitris Grigoriadis
Nikos Perdikopanis
Georgios K. Georgakilas
Artemis G. Hatzigeorgiou
author_facet Dimitris Grigoriadis
Nikos Perdikopanis
Georgios K. Georgakilas
Artemis G. Hatzigeorgiou
author_sort Dimitris Grigoriadis
collection DOAJ
description Abstract Background The widespread usage of Cap Analysis of Gene Expression (CAGE) has led to numerous breakthroughs in understanding the transcription mechanisms. Recent evidence in the literature, however, suggests that CAGE suffers from transcriptional and technical noise. Regardless of the sample quality, there is a significant number of CAGE peaks that are not associated with transcription initiation events. This type of signal is typically attributed to technical noise and more frequently to random five-prime capping or transcription bioproducts. Thus, the need for computational methods emerges, that can accurately increase the signal-to-noise ratio in CAGE data, resulting in error-free transcription start site (TSS) annotation and quantification of regulatory region usage. In this study, we present DeepTSS, a novel computational method for processing CAGE samples, that combines genomic signal processing (GSP), structural DNA features, evolutionary conservation evidence and raw DNA sequence with Deep Learning (DL) to provide single-nucleotide TSS predictions with unprecedented levels of performance. Results To evaluate DeepTSS, we utilized experimental data, protein-coding gene annotations and computationally-derived genome segmentations by chromatin states. DeepTSS was found to outperform existing algorithms on all benchmarks, achieving 98% precision and 96% sensitivity (accuracy 95.4%) on the protein-coding gene strategy, with 96.66% of its positive predictions overlapping active chromatin, 98.27% and 92.04% co-localized with at least one transcription factor and H3K4me3 peak. Conclusions CAGE is a key protocol in deciphering the language of transcription, however, as every experimental protocol, it suffers from biological and technical noise that can severely affect downstream analyses. DeepTSS is a novel DL-based method for effectively removing noisy CAGE signal. In contrast to existing software, DeepTSS does not require feature selection since the embedded convolutional layers can readily identify patterns and only utilize the important ones for the classification task. This study highlights the key role that DL can play in Molecular Biology, by removing the inherent flaws of experimental protocols, that form the backbone of contemporary research. Here, we show how DeepTSS can unleash the full potential of an already popular and mature method such as CAGE, and push the boundaries of coding and non-coding gene expression regulator research even further.
first_indexed 2024-04-12T01:02:17Z
format Article
id doaj.art-79def2d984ad4e3c8a8be5e9ca87eebf
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-04-12T01:02:17Z
publishDate 2022-12-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-79def2d984ad4e3c8a8be5e9ca87eebf2022-12-22T03:54:25ZengBMCBMC Bioinformatics1471-21052022-12-0123S211710.1186/s12859-022-04945-yDeepTSS: multi-branch convolutional neural network for transcription start site identification from CAGE dataDimitris Grigoriadis0Nikos Perdikopanis1Georgios K. Georgakilas2Artemis G. Hatzigeorgiou3Hellenic Pasteur InstituteHellenic Pasteur InstituteDepartment of Electrical and Computer Engineering, University of ThessalyHellenic Pasteur InstituteAbstract Background The widespread usage of Cap Analysis of Gene Expression (CAGE) has led to numerous breakthroughs in understanding the transcription mechanisms. Recent evidence in the literature, however, suggests that CAGE suffers from transcriptional and technical noise. Regardless of the sample quality, there is a significant number of CAGE peaks that are not associated with transcription initiation events. This type of signal is typically attributed to technical noise and more frequently to random five-prime capping or transcription bioproducts. Thus, the need for computational methods emerges, that can accurately increase the signal-to-noise ratio in CAGE data, resulting in error-free transcription start site (TSS) annotation and quantification of regulatory region usage. In this study, we present DeepTSS, a novel computational method for processing CAGE samples, that combines genomic signal processing (GSP), structural DNA features, evolutionary conservation evidence and raw DNA sequence with Deep Learning (DL) to provide single-nucleotide TSS predictions with unprecedented levels of performance. Results To evaluate DeepTSS, we utilized experimental data, protein-coding gene annotations and computationally-derived genome segmentations by chromatin states. DeepTSS was found to outperform existing algorithms on all benchmarks, achieving 98% precision and 96% sensitivity (accuracy 95.4%) on the protein-coding gene strategy, with 96.66% of its positive predictions overlapping active chromatin, 98.27% and 92.04% co-localized with at least one transcription factor and H3K4me3 peak. Conclusions CAGE is a key protocol in deciphering the language of transcription, however, as every experimental protocol, it suffers from biological and technical noise that can severely affect downstream analyses. DeepTSS is a novel DL-based method for effectively removing noisy CAGE signal. In contrast to existing software, DeepTSS does not require feature selection since the embedded convolutional layers can readily identify patterns and only utilize the important ones for the classification task. This study highlights the key role that DL can play in Molecular Biology, by removing the inherent flaws of experimental protocols, that form the backbone of contemporary research. Here, we show how DeepTSS can unleash the full potential of an already popular and mature method such as CAGE, and push the boundaries of coding and non-coding gene expression regulator research even further.https://doi.org/10.1186/s12859-022-04945-yTSSCAGEBioinformaticsPromoterTranscriptionMachine Learning
spellingShingle Dimitris Grigoriadis
Nikos Perdikopanis
Georgios K. Georgakilas
Artemis G. Hatzigeorgiou
DeepTSS: multi-branch convolutional neural network for transcription start site identification from CAGE data
BMC Bioinformatics
TSS
CAGE
Bioinformatics
Promoter
Transcription
Machine Learning
title DeepTSS: multi-branch convolutional neural network for transcription start site identification from CAGE data
title_full DeepTSS: multi-branch convolutional neural network for transcription start site identification from CAGE data
title_fullStr DeepTSS: multi-branch convolutional neural network for transcription start site identification from CAGE data
title_full_unstemmed DeepTSS: multi-branch convolutional neural network for transcription start site identification from CAGE data
title_short DeepTSS: multi-branch convolutional neural network for transcription start site identification from CAGE data
title_sort deeptss multi branch convolutional neural network for transcription start site identification from cage data
topic TSS
CAGE
Bioinformatics
Promoter
Transcription
Machine Learning
url https://doi.org/10.1186/s12859-022-04945-y
work_keys_str_mv AT dimitrisgrigoriadis deeptssmultibranchconvolutionalneuralnetworkfortranscriptionstartsiteidentificationfromcagedata
AT nikosperdikopanis deeptssmultibranchconvolutionalneuralnetworkfortranscriptionstartsiteidentificationfromcagedata
AT georgioskgeorgakilas deeptssmultibranchconvolutionalneuralnetworkfortranscriptionstartsiteidentificationfromcagedata
AT artemisghatzigeorgiou deeptssmultibranchconvolutionalneuralnetworkfortranscriptionstartsiteidentificationfromcagedata