iPromoter-Seqvec: identifying promoters using bidirectional long short-term memory and sequence-embedded features

Abstract Background Promoters, non-coding DNA sequences located at upstream regions of the transcription start site of genes/gene clusters, are essential regulatory elements for the initiation and regulation of transcriptional processes. Furthermore, identifying promoters in DNA sequences and genome...

Full description

Bibliographic Details
Main Authors: Thanh-Hoang Nguyen-Vo, Quang H. Trinh, Loc Nguyen, Phuong-Uyen Nguyen-Hoang, Susanto Rahardja, Binh P. Nguyen
Format: Article
Language:English
Published: BMC 2022-10-01
Series:BMC Genomics
Subjects:
Online Access:https://doi.org/10.1186/s12864-022-08829-6
_version_ 1811194884885839872
author Thanh-Hoang Nguyen-Vo
Quang H. Trinh
Loc Nguyen
Phuong-Uyen Nguyen-Hoang
Susanto Rahardja
Binh P. Nguyen
author_facet Thanh-Hoang Nguyen-Vo
Quang H. Trinh
Loc Nguyen
Phuong-Uyen Nguyen-Hoang
Susanto Rahardja
Binh P. Nguyen
author_sort Thanh-Hoang Nguyen-Vo
collection DOAJ
description Abstract Background Promoters, non-coding DNA sequences located at upstream regions of the transcription start site of genes/gene clusters, are essential regulatory elements for the initiation and regulation of transcriptional processes. Furthermore, identifying promoters in DNA sequences and genomes significantly contributes to discovering entire structures of genes of interest. Therefore, exploration of promoter regions is one of the most imperative topics in molecular genetics and biology. Besides experimental techniques, computational methods have been developed to predict promoters. In this study, we propose iPromoter-Seqvec – an efficient computational model to predict TATA and non-TATA promoters in human and mouse genomes using bidirectional long short-term memory neural networks in combination with sequence-embedded features extracted from input sequences. The promoter and non-promoter sequences were retrieved from the Eukaryotic Promoter database and then were refined to create four benchmark datasets. Results The area under the receiver operating characteristic curve (AUCROC) and the area under the precision-recall curve (AUCPR) were used as two key metrics to evaluate model performance. Results on independent test sets showed that iPromoter-Seqvec outperformed other state-of-the-art methods with AUCROC values ranging from 0.85 to 0.99 and AUCPR values ranging from 0.86 to 0.99. Models predicting TATA promoters in both species had slightly higher predictive power compared to those predicting non-TATA promoters. With a novel idea of constructing artificial non-promoter sequences based on promoter sequences, our models were able to learn highly specific characteristics discriminating promoters from non-promoters to improve predictive efficiency. Conclusions iPromoter-Seqvec is a stable and robust model for predicting both TATA and non-TATA promoters in human and mouse genomes. Our proposed method was also deployed as an online web server with a user-friendly interface to support research communities. Links to our source codes and web server are available at https://github.com/mldlproject/2022-iPromoter-Seqvec .
first_indexed 2024-04-12T00:34:50Z
format Article
id doaj.art-08e33e81a2c047b2ab6f7804ec59d920
institution Directory Open Access Journal
issn 1471-2164
language English
last_indexed 2024-04-12T00:34:50Z
publishDate 2022-10-01
publisher BMC
record_format Article
series BMC Genomics
spelling doaj.art-08e33e81a2c047b2ab6f7804ec59d9202022-12-22T03:55:11ZengBMCBMC Genomics1471-21642022-10-0123S511210.1186/s12864-022-08829-6iPromoter-Seqvec: identifying promoters using bidirectional long short-term memory and sequence-embedded featuresThanh-Hoang Nguyen-Vo0Quang H. Trinh1Loc Nguyen2Phuong-Uyen Nguyen-Hoang3Susanto Rahardja4Binh P. Nguyen5School of Mathematics and Statistics, Victoria University of WellingtonSchool of Information and Communication Technology, Hanoi University of Science and TechnologySchool of Mathematics and Statistics, Victoria University of WellingtonComputational Biology Center, International University - VNU HCMCSchool of Marine Science and Technology, Northwestern Polytechnical UniversitySchool of Mathematics and Statistics, Victoria University of WellingtonAbstract Background Promoters, non-coding DNA sequences located at upstream regions of the transcription start site of genes/gene clusters, are essential regulatory elements for the initiation and regulation of transcriptional processes. Furthermore, identifying promoters in DNA sequences and genomes significantly contributes to discovering entire structures of genes of interest. Therefore, exploration of promoter regions is one of the most imperative topics in molecular genetics and biology. Besides experimental techniques, computational methods have been developed to predict promoters. In this study, we propose iPromoter-Seqvec – an efficient computational model to predict TATA and non-TATA promoters in human and mouse genomes using bidirectional long short-term memory neural networks in combination with sequence-embedded features extracted from input sequences. The promoter and non-promoter sequences were retrieved from the Eukaryotic Promoter database and then were refined to create four benchmark datasets. Results The area under the receiver operating characteristic curve (AUCROC) and the area under the precision-recall curve (AUCPR) were used as two key metrics to evaluate model performance. Results on independent test sets showed that iPromoter-Seqvec outperformed other state-of-the-art methods with AUCROC values ranging from 0.85 to 0.99 and AUCPR values ranging from 0.86 to 0.99. Models predicting TATA promoters in both species had slightly higher predictive power compared to those predicting non-TATA promoters. With a novel idea of constructing artificial non-promoter sequences based on promoter sequences, our models were able to learn highly specific characteristics discriminating promoters from non-promoters to improve predictive efficiency. Conclusions iPromoter-Seqvec is a stable and robust model for predicting both TATA and non-TATA promoters in human and mouse genomes. Our proposed method was also deployed as an online web server with a user-friendly interface to support research communities. Links to our source codes and web server are available at https://github.com/mldlproject/2022-iPromoter-Seqvec .https://doi.org/10.1186/s12864-022-08829-6DNATranscription start sitePromoterTATA-boxBidirectional long short-term memory
spellingShingle Thanh-Hoang Nguyen-Vo
Quang H. Trinh
Loc Nguyen
Phuong-Uyen Nguyen-Hoang
Susanto Rahardja
Binh P. Nguyen
iPromoter-Seqvec: identifying promoters using bidirectional long short-term memory and sequence-embedded features
BMC Genomics
DNA
Transcription start site
Promoter
TATA-box
Bidirectional long short-term memory
title iPromoter-Seqvec: identifying promoters using bidirectional long short-term memory and sequence-embedded features
title_full iPromoter-Seqvec: identifying promoters using bidirectional long short-term memory and sequence-embedded features
title_fullStr iPromoter-Seqvec: identifying promoters using bidirectional long short-term memory and sequence-embedded features
title_full_unstemmed iPromoter-Seqvec: identifying promoters using bidirectional long short-term memory and sequence-embedded features
title_short iPromoter-Seqvec: identifying promoters using bidirectional long short-term memory and sequence-embedded features
title_sort ipromoter seqvec identifying promoters using bidirectional long short term memory and sequence embedded features
topic DNA
Transcription start site
Promoter
TATA-box
Bidirectional long short-term memory
url https://doi.org/10.1186/s12864-022-08829-6
work_keys_str_mv AT thanhhoangnguyenvo ipromoterseqvecidentifyingpromotersusingbidirectionallongshorttermmemoryandsequenceembeddedfeatures
AT quanghtrinh ipromoterseqvecidentifyingpromotersusingbidirectionallongshorttermmemoryandsequenceembeddedfeatures
AT locnguyen ipromoterseqvecidentifyingpromotersusingbidirectionallongshorttermmemoryandsequenceembeddedfeatures
AT phuonguyennguyenhoang ipromoterseqvecidentifyingpromotersusingbidirectionallongshorttermmemoryandsequenceembeddedfeatures
AT susantorahardja ipromoterseqvecidentifyingpromotersusingbidirectionallongshorttermmemoryandsequenceembeddedfeatures
AT binhpnguyen ipromoterseqvecidentifyingpromotersusingbidirectionallongshorttermmemoryandsequenceembeddedfeatures