Estimating the total genome length of a metagenomic sample using k-mers

Abstract Background Metagenomic sequencing is a powerful technology for studying the mixture of microbes or the microbiomes on human and in the environment. One basic task of analyzing metagenomic data is to identify the component genomes in the community. This task is challenging due to the complex...

Full description

Bibliographic Details
Main Authors:	Kui Hua, Xuegong Zhang
Format:	Article
Language:	English
Published:	BMC 2019-04-01
Series:	BMC Genomics
Subjects:	Metagenomics Sequencing coverage Distinct k-mers Genome length
Online Access:	http://link.springer.com/article/10.1186/s12864-019-5467-x

_version_	1819063478400319488
author	Kui Hua Xuegong Zhang
author_facet	Kui Hua Xuegong Zhang
author_sort	Kui Hua
collection	DOAJ
description	Abstract Background Metagenomic sequencing is a powerful technology for studying the mixture of microbes or the microbiomes on human and in the environment. One basic task of analyzing metagenomic data is to identify the component genomes in the community. This task is challenging due to the complexity of microbiome composition, limited availability of known reference genomes, and usually insufficient sequencing coverage. Results As an initial step toward understanding the complete composition of a metagenomic sample, we studied the problem of estimating the total length of all distinct component genomes in a metagenomic sample. We showed that this problem can be solved by estimating the total number of distinct k-mers in all the metagenomic sequencing data. We proposed a method for this estimation based on the sequencing coverage distribution of observed k-mers, and introduced a k-mer redundancy index (KRI) to fill in the gap between the count of distinct k-mers and the total genome length. We showed the effectiveness of the proposed method on a set of carefully designed simulation data corresponding to multiple situations of true metagenomic data. Results on real data indicate that the uncaptured genomic information can vary dramatically across metagenomic samples, with the potential to mislead downstream analyses. Conclusions We proposed the question of how long the total genome length of all different species in a microbial community is and introduced a method to answer it.
first_indexed	2024-12-21T15:15:19Z
format	Article
id	doaj.art-3093d06f52d441bb894a300023ae07a1
institution	Directory Open Access Journal
issn	1471-2164
language	English
last_indexed	2024-12-21T15:15:19Z
publishDate	2019-04-01
publisher	BMC
record_format	Article
series	BMC Genomics
spelling	doaj.art-3093d06f52d441bb894a300023ae07a12022-12-21T18:59:11ZengBMCBMC Genomics1471-21642019-04-0120S29310110.1186/s12864-019-5467-xEstimating the total genome length of a metagenomic sample using k-mersKui Hua0Xuegong Zhang1MOE Key Laboratory of Bioinformatics Division and Center for Synthetic & System Biology, BNRISTMOE Key Laboratory of Bioinformatics Division and Center for Synthetic & System Biology, BNRISTAbstract Background Metagenomic sequencing is a powerful technology for studying the mixture of microbes or the microbiomes on human and in the environment. One basic task of analyzing metagenomic data is to identify the component genomes in the community. This task is challenging due to the complexity of microbiome composition, limited availability of known reference genomes, and usually insufficient sequencing coverage. Results As an initial step toward understanding the complete composition of a metagenomic sample, we studied the problem of estimating the total length of all distinct component genomes in a metagenomic sample. We showed that this problem can be solved by estimating the total number of distinct k-mers in all the metagenomic sequencing data. We proposed a method for this estimation based on the sequencing coverage distribution of observed k-mers, and introduced a k-mer redundancy index (KRI) to fill in the gap between the count of distinct k-mers and the total genome length. We showed the effectiveness of the proposed method on a set of carefully designed simulation data corresponding to multiple situations of true metagenomic data. Results on real data indicate that the uncaptured genomic information can vary dramatically across metagenomic samples, with the potential to mislead downstream analyses. Conclusions We proposed the question of how long the total genome length of all different species in a microbial community is and introduced a method to answer it.http://link.springer.com/article/10.1186/s12864-019-5467-xMetagenomicsSequencing coverageDistinct k-mersGenome length
spellingShingle	Kui Hua Xuegong Zhang Estimating the total genome length of a metagenomic sample using k-mers BMC Genomics Metagenomics Sequencing coverage Distinct k-mers Genome length
title	Estimating the total genome length of a metagenomic sample using k-mers
title_full	Estimating the total genome length of a metagenomic sample using k-mers
title_fullStr	Estimating the total genome length of a metagenomic sample using k-mers
title_full_unstemmed	Estimating the total genome length of a metagenomic sample using k-mers
title_short	Estimating the total genome length of a metagenomic sample using k-mers
title_sort	estimating the total genome length of a metagenomic sample using k mers
topic	Metagenomics Sequencing coverage Distinct k-mers Genome length
url	http://link.springer.com/article/10.1186/s12864-019-5467-x
work_keys_str_mv	AT kuihua estimatingthetotalgenomelengthofametagenomicsampleusingkmers AT xuegongzhang estimatingthetotalgenomelengthofametagenomicsampleusingkmers

Estimating the total genome length of a metagenomic sample using k-mers

Similar Items