Contigs directed gene annotation (ConDiGA) for accurate protein sequence database construction in metaproteomics

Abstract Background Microbiota are closely associated with human health and disease. Metaproteomics can provide a direct means to identify microbial proteins in microbiota for compositional and functional characterization. However, in-depth and accurate metaproteomics is still limited due to the ext...

Full description

Bibliographic Details
Main Authors:	Enhui Wu, Vijini Mallawaarachchi, Jinzhi Zhao, Yi Yang, Hebin Liu, Xiaoqing Wang, Chengpin Shen, Yu Lin, Liang Qiao
Format:	Article
Language:	English
Published:	BMC 2024-03-01
Series:	Microbiome
Subjects:	Taxonomic annotation Metaproteomics Metagenomics Microbiota Mass spectrometry
Online Access:	https://doi.org/10.1186/s40168-024-01775-3

_version_	1827310057393487872
author	Enhui Wu Vijini Mallawaarachchi Jinzhi Zhao Yi Yang Hebin Liu Xiaoqing Wang Chengpin Shen Yu Lin Liang Qiao
author_facet	Enhui Wu Vijini Mallawaarachchi Jinzhi Zhao Yi Yang Hebin Liu Xiaoqing Wang Chengpin Shen Yu Lin Liang Qiao
author_sort	Enhui Wu
collection	DOAJ
description	Abstract Background Microbiota are closely associated with human health and disease. Metaproteomics can provide a direct means to identify microbial proteins in microbiota for compositional and functional characterization. However, in-depth and accurate metaproteomics is still limited due to the extreme complexity and high diversity of microbiota samples. It is generally recommended to use metagenomic data from the same samples to construct the protein sequence database for metaproteomic data analysis. Although different metagenomics-based database construction strategies have been developed, an optimization of gene taxonomic annotation has not been reported, which, however, is extremely important for accurate metaproteomic analysis. Results Herein, we proposed an accurate taxonomic annotation pipeline for genes from metagenomic data, namely contigs directed gene annotation (ConDiGA), and used the method to build a protein sequence database for metaproteomic analysis. We compared our pipeline (ConDiGA or MD3) with two other popular annotation pipelines (MD1 and MD2). In MD1, genes were directly annotated against the whole bacterial genome database; in MD2, contigs were annotated against the whole bacterial genome database and the taxonomic information of contigs was assigned to the genes; in MD3, the most confident species from the contigs annotation results were taken as reference to annotate genes. Annotation tools, including BLAST, Kaiju, and Kraken2, were compared. Based on a synthetic microbial community of 12 species, it was found that Kaiju with the MD3 pipeline outperformed the others in the construction of protein sequence database from metagenomic data. Similar performance was also observed with a fecal sample, as well as in silico mixed datasets of the simulated microbial community and the fecal sample. Conclusions Overall, we developed an optimized pipeline for gene taxonomic annotation to construct protein sequence databases. Our study can tackle the current taxonomic annotation reliability problem in metagenomics-derived protein sequence database and can promote the in-depth metaproteomic analysis of microbiome. The unique metagenomic and metaproteomic datasets of the 12 bacterial species are publicly available as a standard benchmarking sample for evaluating various analysis pipelines. The code of ConDiGA is open access at GitHub for the analysis of microbiota samples. Video Abstract
first_indexed	2024-04-24T19:53:38Z
format	Article
id	doaj.art-3a43e16a5ddc479e9f90bb7b10f22577
institution	Directory Open Access Journal
issn	2049-2618
language	English
last_indexed	2024-04-24T19:53:38Z
publishDate	2024-03-01
publisher	BMC
record_format	Article
series	Microbiome
spelling	doaj.art-3a43e16a5ddc479e9f90bb7b10f225772024-03-24T12:27:39ZengBMCMicrobiome2049-26182024-03-0112111410.1186/s40168-024-01775-3Contigs directed gene annotation (ConDiGA) for accurate protein sequence database construction in metaproteomicsEnhui Wu0Vijini Mallawaarachchi1Jinzhi Zhao2Yi Yang3Hebin Liu4Xiaoqing Wang5Chengpin Shen6Yu Lin7Liang Qiao8Department of Chemistry, and Shanghai Stomatological Hospital, Fudan UniversitySchool of Computing, College of Engineering, Computing and Cybernetics, The Australian National UniversityDepartment of Chemistry, and Shanghai Stomatological Hospital, Fudan UniversityDepartment of Chemistry, and Shanghai Stomatological Hospital, Fudan UniversityShanghai Omicsolution Co., LtdShanghai Omicsolution Co., LtdShanghai Omicsolution Co., LtdSchool of Computing, College of Engineering, Computing and Cybernetics, The Australian National UniversityDepartment of Chemistry, and Shanghai Stomatological Hospital, Fudan UniversityAbstract Background Microbiota are closely associated with human health and disease. Metaproteomics can provide a direct means to identify microbial proteins in microbiota for compositional and functional characterization. However, in-depth and accurate metaproteomics is still limited due to the extreme complexity and high diversity of microbiota samples. It is generally recommended to use metagenomic data from the same samples to construct the protein sequence database for metaproteomic data analysis. Although different metagenomics-based database construction strategies have been developed, an optimization of gene taxonomic annotation has not been reported, which, however, is extremely important for accurate metaproteomic analysis. Results Herein, we proposed an accurate taxonomic annotation pipeline for genes from metagenomic data, namely contigs directed gene annotation (ConDiGA), and used the method to build a protein sequence database for metaproteomic analysis. We compared our pipeline (ConDiGA or MD3) with two other popular annotation pipelines (MD1 and MD2). In MD1, genes were directly annotated against the whole bacterial genome database; in MD2, contigs were annotated against the whole bacterial genome database and the taxonomic information of contigs was assigned to the genes; in MD3, the most confident species from the contigs annotation results were taken as reference to annotate genes. Annotation tools, including BLAST, Kaiju, and Kraken2, were compared. Based on a synthetic microbial community of 12 species, it was found that Kaiju with the MD3 pipeline outperformed the others in the construction of protein sequence database from metagenomic data. Similar performance was also observed with a fecal sample, as well as in silico mixed datasets of the simulated microbial community and the fecal sample. Conclusions Overall, we developed an optimized pipeline for gene taxonomic annotation to construct protein sequence databases. Our study can tackle the current taxonomic annotation reliability problem in metagenomics-derived protein sequence database and can promote the in-depth metaproteomic analysis of microbiome. The unique metagenomic and metaproteomic datasets of the 12 bacterial species are publicly available as a standard benchmarking sample for evaluating various analysis pipelines. The code of ConDiGA is open access at GitHub for the analysis of microbiota samples. Video Abstracthttps://doi.org/10.1186/s40168-024-01775-3Taxonomic annotationMetaproteomicsMetagenomicsMicrobiotaMass spectrometry
spellingShingle	Enhui Wu Vijini Mallawaarachchi Jinzhi Zhao Yi Yang Hebin Liu Xiaoqing Wang Chengpin Shen Yu Lin Liang Qiao Contigs directed gene annotation (ConDiGA) for accurate protein sequence database construction in metaproteomics Microbiome Taxonomic annotation Metaproteomics Metagenomics Microbiota Mass spectrometry
title	Contigs directed gene annotation (ConDiGA) for accurate protein sequence database construction in metaproteomics
title_full	Contigs directed gene annotation (ConDiGA) for accurate protein sequence database construction in metaproteomics
title_fullStr	Contigs directed gene annotation (ConDiGA) for accurate protein sequence database construction in metaproteomics
title_full_unstemmed	Contigs directed gene annotation (ConDiGA) for accurate protein sequence database construction in metaproteomics
title_short	Contigs directed gene annotation (ConDiGA) for accurate protein sequence database construction in metaproteomics
title_sort	contigs directed gene annotation condiga for accurate protein sequence database construction in metaproteomics
topic	Taxonomic annotation Metaproteomics Metagenomics Microbiota Mass spectrometry
url	https://doi.org/10.1186/s40168-024-01775-3
work_keys_str_mv	AT enhuiwu contigsdirectedgeneannotationcondigaforaccurateproteinsequencedatabaseconstructioninmetaproteomics AT vijinimallawaarachchi contigsdirectedgeneannotationcondigaforaccurateproteinsequencedatabaseconstructioninmetaproteomics AT jinzhizhao contigsdirectedgeneannotationcondigaforaccurateproteinsequencedatabaseconstructioninmetaproteomics AT yiyang contigsdirectedgeneannotationcondigaforaccurateproteinsequencedatabaseconstructioninmetaproteomics AT hebinliu contigsdirectedgeneannotationcondigaforaccurateproteinsequencedatabaseconstructioninmetaproteomics AT xiaoqingwang contigsdirectedgeneannotationcondigaforaccurateproteinsequencedatabaseconstructioninmetaproteomics AT chengpinshen contigsdirectedgeneannotationcondigaforaccurateproteinsequencedatabaseconstructioninmetaproteomics AT yulin contigsdirectedgeneannotationcondigaforaccurateproteinsequencedatabaseconstructioninmetaproteomics AT liangqiao contigsdirectedgeneannotationcondigaforaccurateproteinsequencedatabaseconstructioninmetaproteomics

Contigs directed gene annotation (ConDiGA) for accurate protein sequence database construction in metaproteomics

Similar Items