A big data approach to metagenomics for all-food-sequencing
Abstract Background All-Food-Sequencing (AFS) is an untargeted metagenomic sequencing method that allows for the detection and quantification of food ingredients including animals, plants, and microbiota. While this approach avoids some of the shortcomings of targeted PCR-based methods, it requires...
Main Authors: | , , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2020-03-01
|
Series: | BMC Bioinformatics |
Subjects: | |
Online Access: | http://link.springer.com/article/10.1186/s12859-020-3429-6 |
_version_ | 1830382386777423872 |
---|---|
author | Robin Kobus José M. Abuín André Müller Sören Lukas Hellmann Juan C. Pichel Tomás F. Pena Andreas Hildebrandt Thomas Hankeln Bertil Schmidt |
author_facet | Robin Kobus José M. Abuín André Müller Sören Lukas Hellmann Juan C. Pichel Tomás F. Pena Andreas Hildebrandt Thomas Hankeln Bertil Schmidt |
author_sort | Robin Kobus |
collection | DOAJ |
description | Abstract Background All-Food-Sequencing (AFS) is an untargeted metagenomic sequencing method that allows for the detection and quantification of food ingredients including animals, plants, and microbiota. While this approach avoids some of the shortcomings of targeted PCR-based methods, it requires the comparison of sequence reads to large collections of reference genomes. The steadily increasing amount of available reference genomes establishes the need for efficient big data approaches. Results We introduce an alignment-free k-mer based method for detection and quantification of species composition in food and other complex biological matters. It is orders-of-magnitude faster than our previous alignment-based AFS pipeline. In comparison to the established tools CLARK, Kraken2, and Kraken2+Bracken it is superior in terms of false-positive rate and quantification accuracy. Furthermore, the usage of an efficient database partitioning scheme allows for the processing of massive collections of reference genomes with reduced memory requirements on a workstation (AFS-MetaCache) or on a Spark-based compute cluster (MetaCacheSpark). Conclusions We present a fast yet accurate screening method for whole genome shotgun sequencing-based biosurveillance applications such as food testing. By relying on a big data approach it can scale efficiently towards large-scale collections of complex eukaryotic and bacterial reference genomes. AFS-MetaCache and MetaCacheSpark are suitable tools for broad-scale metagenomic screening applications. They are available at https://muellan.github.io/metacache/afs.html (C++ version for a workstation) and https://github.com/jmabuin/MetaCacheSpark (Spark version for big data clusters). |
first_indexed | 2024-12-20T10:04:36Z |
format | Article |
id | doaj.art-52b8dda9f3684cd6a38d2bbc7c876b0f |
institution | Directory Open Access Journal |
issn | 1471-2105 |
language | English |
last_indexed | 2024-12-20T10:04:36Z |
publishDate | 2020-03-01 |
publisher | BMC |
record_format | Article |
series | BMC Bioinformatics |
spelling | doaj.art-52b8dda9f3684cd6a38d2bbc7c876b0f2022-12-21T19:44:15ZengBMCBMC Bioinformatics1471-21052020-03-0121111510.1186/s12859-020-3429-6A big data approach to metagenomics for all-food-sequencingRobin Kobus0José M. Abuín1André Müller2Sören Lukas Hellmann3Juan C. Pichel4Tomás F. Pena5Andreas Hildebrandt6Thomas Hankeln7Bertil Schmidt8Department of Computer Science, Johannes Gutenberg UniversityIPCA, Polytechnic Institute of Cávado and AveDepartment of Computer Science, Johannes Gutenberg UniversityMolecular Genetics and Genome Analysis, Institute of Organismal and Molecular Evolution, Johannes Gutenberg UniversityCiTIUS, Universidade de Santiago de CompostelaCiTIUS, Universidade de Santiago de CompostelaDepartment of Computer Science, Johannes Gutenberg UniversityMolecular Genetics and Genome Analysis, Institute of Organismal and Molecular Evolution, Johannes Gutenberg UniversityDepartment of Computer Science, Johannes Gutenberg UniversityAbstract Background All-Food-Sequencing (AFS) is an untargeted metagenomic sequencing method that allows for the detection and quantification of food ingredients including animals, plants, and microbiota. While this approach avoids some of the shortcomings of targeted PCR-based methods, it requires the comparison of sequence reads to large collections of reference genomes. The steadily increasing amount of available reference genomes establishes the need for efficient big data approaches. Results We introduce an alignment-free k-mer based method for detection and quantification of species composition in food and other complex biological matters. It is orders-of-magnitude faster than our previous alignment-based AFS pipeline. In comparison to the established tools CLARK, Kraken2, and Kraken2+Bracken it is superior in terms of false-positive rate and quantification accuracy. Furthermore, the usage of an efficient database partitioning scheme allows for the processing of massive collections of reference genomes with reduced memory requirements on a workstation (AFS-MetaCache) or on a Spark-based compute cluster (MetaCacheSpark). Conclusions We present a fast yet accurate screening method for whole genome shotgun sequencing-based biosurveillance applications such as food testing. By relying on a big data approach it can scale efficiently towards large-scale collections of complex eukaryotic and bacterial reference genomes. AFS-MetaCache and MetaCacheSpark are suitable tools for broad-scale metagenomic screening applications. They are available at https://muellan.github.io/metacache/afs.html (C++ version for a workstation) and https://github.com/jmabuin/MetaCacheSpark (Spark version for big data clusters).http://link.springer.com/article/10.1186/s12859-020-3429-6Next-generation sequencingMetagenomicsSpecies identificationEukaryotic genomesLocality sensitive hashingBig data |
spellingShingle | Robin Kobus José M. Abuín André Müller Sören Lukas Hellmann Juan C. Pichel Tomás F. Pena Andreas Hildebrandt Thomas Hankeln Bertil Schmidt A big data approach to metagenomics for all-food-sequencing BMC Bioinformatics Next-generation sequencing Metagenomics Species identification Eukaryotic genomes Locality sensitive hashing Big data |
title | A big data approach to metagenomics for all-food-sequencing |
title_full | A big data approach to metagenomics for all-food-sequencing |
title_fullStr | A big data approach to metagenomics for all-food-sequencing |
title_full_unstemmed | A big data approach to metagenomics for all-food-sequencing |
title_short | A big data approach to metagenomics for all-food-sequencing |
title_sort | big data approach to metagenomics for all food sequencing |
topic | Next-generation sequencing Metagenomics Species identification Eukaryotic genomes Locality sensitive hashing Big data |
url | http://link.springer.com/article/10.1186/s12859-020-3429-6 |
work_keys_str_mv | AT robinkobus abigdataapproachtometagenomicsforallfoodsequencing AT josemabuin abigdataapproachtometagenomicsforallfoodsequencing AT andremuller abigdataapproachtometagenomicsforallfoodsequencing AT sorenlukashellmann abigdataapproachtometagenomicsforallfoodsequencing AT juancpichel abigdataapproachtometagenomicsforallfoodsequencing AT tomasfpena abigdataapproachtometagenomicsforallfoodsequencing AT andreashildebrandt abigdataapproachtometagenomicsforallfoodsequencing AT thomashankeln abigdataapproachtometagenomicsforallfoodsequencing AT bertilschmidt abigdataapproachtometagenomicsforallfoodsequencing AT robinkobus bigdataapproachtometagenomicsforallfoodsequencing AT josemabuin bigdataapproachtometagenomicsforallfoodsequencing AT andremuller bigdataapproachtometagenomicsforallfoodsequencing AT sorenlukashellmann bigdataapproachtometagenomicsforallfoodsequencing AT juancpichel bigdataapproachtometagenomicsforallfoodsequencing AT tomasfpena bigdataapproachtometagenomicsforallfoodsequencing AT andreashildebrandt bigdataapproachtometagenomicsforallfoodsequencing AT thomashankeln bigdataapproachtometagenomicsforallfoodsequencing AT bertilschmidt bigdataapproachtometagenomicsforallfoodsequencing |