DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data
Bacteriophages (phages) are the most abundant and diverse biological entity on Earth. Due to the lack of universal gene markers and database representatives, there about 50–90% of genes of phages are unable to assign functions. This makes it a challenge to identify phage genomes and annotate functio...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
PeerJ Inc.
2022-06-01
|
Series: | PeerJ |
Subjects: | |
Online Access: | https://peerj.com/articles/13404.pdf |
_version_ | 1827607919955279872 |
---|---|
author | Yunmeng Chu Shun Guo Dachao Cui Xiongfei Fu Yingfei Ma |
author_facet | Yunmeng Chu Shun Guo Dachao Cui Xiongfei Fu Yingfei Ma |
author_sort | Yunmeng Chu |
collection | DOAJ |
description | Bacteriophages (phages) are the most abundant and diverse biological entity on Earth. Due to the lack of universal gene markers and database representatives, there about 50–90% of genes of phages are unable to assign functions. This makes it a challenge to identify phage genomes and annotate functions of phage genes efficiently by homology search on a large scale, especially for newly phages. Portal (portal protein), TerL (large terminase subunit protein), and TerS (small terminase subunit protein) are three specific proteins of Caudovirales phage. Here, we developed a CNN (convolutional neural network)-based framework, DeephageTP, to identify the three specific proteins from metagenomic data. The framework takes one-hot encoding data of original protein sequences as the input and automatically extracts predictive features in the process of modeling. To overcome the false positive problem, a cutoff-loss-value strategy is introduced based on the distributions of the loss values of protein sequences within the same category. The proposed model with a set of cutoff-loss-values demonstrates high performance in terms of Precision in identifying TerL and Portal sequences (94% and 90%, respectively) from the mimic metagenomic dataset. Finally, we tested the efficacy of the framework using three real metagenomic datasets, and the results shown that compared to the conventional alignment-based methods, our proposed framework had a particular advantage in identifying the novel phage-specific protein sequences of portal and TerL with remote homology to their counterparts in the training datasets. In summary, our study for the first time develops a CNN-based framework for identifying the phage-specific protein sequences with high complexity and low conservation, and this framework will help us find novel phages in metagenomic sequencing data. The DeephageTP is available at https://github.com/chuym726/DeephageTP. |
first_indexed | 2024-03-09T07:02:43Z |
format | Article |
id | doaj.art-f120024be34549e9a02b0951ec7c819b |
institution | Directory Open Access Journal |
issn | 2167-8359 |
language | English |
last_indexed | 2024-03-09T07:02:43Z |
publishDate | 2022-06-01 |
publisher | PeerJ Inc. |
record_format | Article |
series | PeerJ |
spelling | doaj.art-f120024be34549e9a02b0951ec7c819b2023-12-03T09:48:35ZengPeerJ Inc.PeerJ2167-83592022-06-0110e1340410.7717/peerj.13404DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing dataYunmeng Chu0Shun Guo1Dachao Cui2Xiongfei Fu3Yingfei Ma4Shenzhen Key Laboratory of Synthetic Genomics, Guangdong Provincial Key Laboratory of Synthetic Genomics, CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese, Shenzhen, Guangdong, P.R. ChinaShenzhen Key Laboratory of Synthetic Genomics, Guangdong Provincial Key Laboratory of Synthetic Genomics, CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese, Shenzhen, Guangdong, P.R. ChinaShenzhen Key Laboratory of Synthetic Genomics, Guangdong Provincial Key Laboratory of Synthetic Genomics, CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese, Shenzhen, Guangdong, P.R. ChinaShenzhen Key Laboratory of Synthetic Genomics, Guangdong Provincial Key Laboratory of Synthetic Genomics, CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese, Shenzhen, Guangdong, P.R. ChinaShenzhen Key Laboratory of Synthetic Genomics, Guangdong Provincial Key Laboratory of Synthetic Genomics, CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese, Shenzhen, Guangdong, P.R. ChinaBacteriophages (phages) are the most abundant and diverse biological entity on Earth. Due to the lack of universal gene markers and database representatives, there about 50–90% of genes of phages are unable to assign functions. This makes it a challenge to identify phage genomes and annotate functions of phage genes efficiently by homology search on a large scale, especially for newly phages. Portal (portal protein), TerL (large terminase subunit protein), and TerS (small terminase subunit protein) are three specific proteins of Caudovirales phage. Here, we developed a CNN (convolutional neural network)-based framework, DeephageTP, to identify the three specific proteins from metagenomic data. The framework takes one-hot encoding data of original protein sequences as the input and automatically extracts predictive features in the process of modeling. To overcome the false positive problem, a cutoff-loss-value strategy is introduced based on the distributions of the loss values of protein sequences within the same category. The proposed model with a set of cutoff-loss-values demonstrates high performance in terms of Precision in identifying TerL and Portal sequences (94% and 90%, respectively) from the mimic metagenomic dataset. Finally, we tested the efficacy of the framework using three real metagenomic datasets, and the results shown that compared to the conventional alignment-based methods, our proposed framework had a particular advantage in identifying the novel phage-specific protein sequences of portal and TerL with remote homology to their counterparts in the training datasets. In summary, our study for the first time develops a CNN-based framework for identifying the phage-specific protein sequences with high complexity and low conservation, and this framework will help us find novel phages in metagenomic sequencing data. The DeephageTP is available at https://github.com/chuym726/DeephageTP.https://peerj.com/articles/13404.pdfConvolutional neural networkDeep learningPhage-specific proteinPhageMetagenomics |
spellingShingle | Yunmeng Chu Shun Guo Dachao Cui Xiongfei Fu Yingfei Ma DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data PeerJ Convolutional neural network Deep learning Phage-specific protein Phage Metagenomics |
title | DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data |
title_full | DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data |
title_fullStr | DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data |
title_full_unstemmed | DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data |
title_short | DeephageTP: a convolutional neural network framework for identifying phage-specific proteins from metagenomic sequencing data |
title_sort | deephagetp a convolutional neural network framework for identifying phage specific proteins from metagenomic sequencing data |
topic | Convolutional neural network Deep learning Phage-specific protein Phage Metagenomics |
url | https://peerj.com/articles/13404.pdf |
work_keys_str_mv | AT yunmengchu deephagetpaconvolutionalneuralnetworkframeworkforidentifyingphagespecificproteinsfrommetagenomicsequencingdata AT shunguo deephagetpaconvolutionalneuralnetworkframeworkforidentifyingphagespecificproteinsfrommetagenomicsequencingdata AT dachaocui deephagetpaconvolutionalneuralnetworkframeworkforidentifyingphagespecificproteinsfrommetagenomicsequencingdata AT xiongfeifu deephagetpaconvolutionalneuralnetworkframeworkforidentifyingphagespecificproteinsfrommetagenomicsequencingdata AT yingfeima deephagetpaconvolutionalneuralnetworkframeworkforidentifyingphagespecificproteinsfrommetagenomicsequencingdata |