DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data

Bacteriophages (phages) are the most abundant and diverse biological entity on Earth. Due to the lack of universal gene markers and database representatives, there about 50–90% of genes of phages are unable to assign functions. This makes it a challenge to identify phage genomes and annotate functio...

Full description

Bibliographic Details
Main Authors: Yunmeng Chu, Shun Guo, Dachao Cui, Xiongfei Fu, Yingfei Ma
Format: Article
Language:English
Published: PeerJ Inc. 2022-06-01
Series:PeerJ
Subjects:
Online Access:https://peerj.com/articles/13404.pdf
_version_ 1827607919955279872
author Yunmeng Chu
Shun Guo
Dachao Cui
Xiongfei Fu
Yingfei Ma
author_facet Yunmeng Chu
Shun Guo
Dachao Cui
Xiongfei Fu
Yingfei Ma
author_sort Yunmeng Chu
collection DOAJ
description Bacteriophages (phages) are the most abundant and diverse biological entity on Earth. Due to the lack of universal gene markers and database representatives, there about 50–90% of genes of phages are unable to assign functions. This makes it a challenge to identify phage genomes and annotate functions of phage genes efficiently by homology search on a large scale, especially for newly phages. Portal (portal protein), TerL (large terminase subunit protein), and TerS (small terminase subunit protein) are three specific proteins of Caudovirales phage. Here, we developed a CNN (convolutional neural network)-based framework, DeephageTP, to identify the three specific proteins from metagenomic data. The framework takes one-hot encoding data of original protein sequences as the input and automatically extracts predictive features in the process of modeling. To overcome the false positive problem, a cutoff-loss-value strategy is introduced based on the distributions of the loss values of protein sequences within the same category. The proposed model with a set of cutoff-loss-values demonstrates high performance in terms of Precision in identifying TerL and Portal sequences (94% and 90%, respectively) from the mimic metagenomic dataset. Finally, we tested the efficacy of the framework using three real metagenomic datasets, and the results shown that compared to the conventional alignment-based methods, our proposed framework had a particular advantage in identifying the novel phage-specific protein sequences of portal and TerL with remote homology to their counterparts in the training datasets. In summary, our study for the first time develops a CNN-based framework for identifying the phage-specific protein sequences with high complexity and low conservation, and this framework will help us find novel phages in metagenomic sequencing data. The DeephageTP is available at https://github.com/chuym726/DeephageTP.
first_indexed 2024-03-09T07:02:43Z
format Article
id doaj.art-f120024be34549e9a02b0951ec7c819b
institution Directory Open Access Journal
issn 2167-8359
language English
last_indexed 2024-03-09T07:02:43Z
publishDate 2022-06-01
publisher PeerJ Inc.
record_format Article
series PeerJ
spelling doaj.art-f120024be34549e9a02b0951ec7c819b2023-12-03T09:48:35ZengPeerJ Inc.PeerJ2167-83592022-06-0110e1340410.7717/peerj.13404DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing dataYunmeng Chu0Shun Guo1Dachao Cui2Xiongfei Fu3Yingfei Ma4Shenzhen Key Laboratory of Synthetic Genomics, Guangdong Provincial Key Laboratory of Synthetic Genomics, CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese, Shenzhen, Guangdong, P.R. ChinaShenzhen Key Laboratory of Synthetic Genomics, Guangdong Provincial Key Laboratory of Synthetic Genomics, CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese, Shenzhen, Guangdong, P.R. ChinaShenzhen Key Laboratory of Synthetic Genomics, Guangdong Provincial Key Laboratory of Synthetic Genomics, CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese, Shenzhen, Guangdong, P.R. ChinaShenzhen Key Laboratory of Synthetic Genomics, Guangdong Provincial Key Laboratory of Synthetic Genomics, CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese, Shenzhen, Guangdong, P.R. ChinaShenzhen Key Laboratory of Synthetic Genomics, Guangdong Provincial Key Laboratory of Synthetic Genomics, CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese, Shenzhen, Guangdong, P.R. ChinaBacteriophages (phages) are the most abundant and diverse biological entity on Earth. Due to the lack of universal gene markers and database representatives, there about 50–90% of genes of phages are unable to assign functions. This makes it a challenge to identify phage genomes and annotate functions of phage genes efficiently by homology search on a large scale, especially for newly phages. Portal (portal protein), TerL (large terminase subunit protein), and TerS (small terminase subunit protein) are three specific proteins of Caudovirales phage. Here, we developed a CNN (convolutional neural network)-based framework, DeephageTP, to identify the three specific proteins from metagenomic data. The framework takes one-hot encoding data of original protein sequences as the input and automatically extracts predictive features in the process of modeling. To overcome the false positive problem, a cutoff-loss-value strategy is introduced based on the distributions of the loss values of protein sequences within the same category. The proposed model with a set of cutoff-loss-values demonstrates high performance in terms of Precision in identifying TerL and Portal sequences (94% and 90%, respectively) from the mimic metagenomic dataset. Finally, we tested the efficacy of the framework using three real metagenomic datasets, and the results shown that compared to the conventional alignment-based methods, our proposed framework had a particular advantage in identifying the novel phage-specific protein sequences of portal and TerL with remote homology to their counterparts in the training datasets. In summary, our study for the first time develops a CNN-based framework for identifying the phage-specific protein sequences with high complexity and low conservation, and this framework will help us find novel phages in metagenomic sequencing data. The DeephageTP is available at https://github.com/chuym726/DeephageTP.https://peerj.com/articles/13404.pdfConvolutional neural networkDeep learningPhage-specific proteinPhageMetagenomics
spellingShingle Yunmeng Chu
Shun Guo
Dachao Cui
Xiongfei Fu
Yingfei Ma
DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data
PeerJ
Convolutional neural network
Deep learning
Phage-specific protein
Phage
Metagenomics
title DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data
title_full DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data
title_fullStr DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data
title_full_unstemmed DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data
title_short DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data
title_sort deephagetp a convolutional neural network framework for identifying phage specific proteins from metagenomic sequencing data
topic Convolutional neural network
Deep learning
Phage-specific protein
Phage
Metagenomics
url https://peerj.com/articles/13404.pdf
work_keys_str_mv AT yunmengchu deephagetpaconvolutionalneuralnetworkframeworkforidentifyingphagespecificproteinsfrommetagenomicsequencingdata
AT shunguo deephagetpaconvolutionalneuralnetworkframeworkforidentifyingphagespecificproteinsfrommetagenomicsequencingdata
AT dachaocui deephagetpaconvolutionalneuralnetworkframeworkforidentifyingphagespecificproteinsfrommetagenomicsequencingdata
AT xiongfeifu deephagetpaconvolutionalneuralnetworkframeworkforidentifyingphagespecificproteinsfrommetagenomicsequencingdata
AT yingfeima deephagetpaconvolutionalneuralnetworkframeworkforidentifyingphagespecificproteinsfrommetagenomicsequencingdata