Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance

Background As next generation sequence technology has advanced, there have been parallel advances in genome-scale analysis programs for determining evolutionary relationships as proxies for epidemiological relationship in public health. Most new programs skip traditional steps of ortholog determinat...

Full description

Bibliographic Details
Main Authors: Ruth E. Timme, Hugh Rand, Martin Shumway, Eija K. Trees, Mustafa Simmons, Richa Agarwala, Steven Davis, Glenn E. Tillman, Stephanie Defibaugh-Chavez, Heather A. Carleton, William A. Klimke, Lee S. Katz
Format: Article
Language:English
Published: PeerJ Inc. 2017-10-01
Series:PeerJ
Subjects:
Online Access:https://peerj.com/articles/3893.pdf
_version_ 1827608116545454080
author Ruth E. Timme
Hugh Rand
Martin Shumway
Eija K. Trees
Mustafa Simmons
Richa Agarwala
Steven Davis
Glenn E. Tillman
Stephanie Defibaugh-Chavez
Heather A. Carleton
William A. Klimke
Lee S. Katz
author_facet Ruth E. Timme
Hugh Rand
Martin Shumway
Eija K. Trees
Mustafa Simmons
Richa Agarwala
Steven Davis
Glenn E. Tillman
Stephanie Defibaugh-Chavez
Heather A. Carleton
William A. Klimke
Lee S. Katz
author_sort Ruth E. Timme
collection DOAJ
description Background As next generation sequence technology has advanced, there have been parallel advances in genome-scale analysis programs for determining evolutionary relationships as proxies for epidemiological relationship in public health. Most new programs skip traditional steps of ortholog determination and multi-gene alignment, instead identifying variants across a set of genomes, then summarizing results in a matrix of single-nucleotide polymorphisms or alleles for standard phylogenetic analysis. However, public health authorities need to document the performance of these methods with appropriate and comprehensive datasets so they can be validated for specific purposes, e.g., outbreak surveillance. Here we propose a set of benchmark datasets to be used for comparison and validation of phylogenomic pipelines. Methods We identified four well-documented foodborne pathogen events in which the epidemiology was concordant with routine phylogenomic analyses (reference-based SNP and wgMLST approaches). These are ideal benchmark datasets, as the trees, WGS data, and epidemiological data for each are all in agreement. We have placed these sequence data, sample metadata, and “known” phylogenetic trees in publicly-accessible databases and developed a standard descriptive spreadsheet format describing each dataset. To facilitate easy downloading of these benchmarks, we developed an automated script that uses the standard descriptive spreadsheet format. Results Our “outbreak” benchmark datasets represent the four major foodborne bacterial pathogens (Listeria monocytogenes, Salmonella enterica, Escherichia coli, and Campylobacter jejuni) and one simulated dataset where the “known tree” can be accurately called the “true tree”. The downloading script and associated table files are available on GitHub: https://github.com/WGS-standards-and-analysis/datasets. Discussion These five benchmark datasets will help standardize comparison of current and future phylogenomic pipelines, and facilitate important cross-institutional collaborations. Our work is part of a global effort to provide collaborative infrastructure for sequence data and analytic tools—we welcome additional benchmark datasets in our recommended format, and, if relevant, we will add these on our GitHub site. Together, these datasets, dataset format, and the underlying GitHub infrastructure present a recommended path for worldwide standardization of phylogenomic pipelines.
first_indexed 2024-03-09T07:06:25Z
format Article
id doaj.art-898bd13fc2f14f64a0bc11fefdc84e24
institution Directory Open Access Journal
issn 2167-8359
language English
last_indexed 2024-03-09T07:06:25Z
publishDate 2017-10-01
publisher PeerJ Inc.
record_format Article
series PeerJ
spelling doaj.art-898bd13fc2f14f64a0bc11fefdc84e242023-12-03T09:31:12ZengPeerJ Inc.PeerJ2167-83592017-10-015e389310.7717/peerj.3893Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillanceRuth E. Timme0Hugh Rand1Martin Shumway2Eija K. Trees3Mustafa Simmons4Richa Agarwala5Steven Davis6Glenn E. Tillman7Stephanie Defibaugh-Chavez8Heather A. Carleton9William A. Klimke10Lee S. Katz11Center for Food Safety and Applied Nutrition, US Food and Drug Administration, College Park, MD, United States of AmericaCenter for Food Safety and Applied Nutrition, US Food and Drug Administration, College Park, MD, United States of AmericaNational Center for Biotechnology Information, National Institutes of Health, Bethesda, MD, United States of AmericaEnteric Diseases Laboratory Branch, Centers for Disease Control and Prevention, Atlanta, GA, United States of AmericaFood Safety and Inspection Service, US Department of Agriculture, Athens, GA, United States of AmericaNational Center for Biotechnology Information, National Institutes of Health, Bethesda, MD, United States of AmericaCenter for Food Safety and Applied Nutrition, US Food and Drug Administration, College Park, MD, United States of AmericaFood Safety and Inspection Service, US Department of Agriculture, Athens, GA, United States of AmericaFood Safety and Inspection Service, US Department of Agriculture, Wahington, D.C., United States of AmericaEnteric Diseases Laboratory Branch, Centers for Disease Control and Prevention, Atlanta, GA, United States of AmericaNational Center for Biotechnology Information, National Institutes of Health, Bethesda, MD, United States of AmericaEnteric Diseases Laboratory Branch, Centers for Disease Control and Prevention, Atlanta, GA, United States of AmericaBackground As next generation sequence technology has advanced, there have been parallel advances in genome-scale analysis programs for determining evolutionary relationships as proxies for epidemiological relationship in public health. Most new programs skip traditional steps of ortholog determination and multi-gene alignment, instead identifying variants across a set of genomes, then summarizing results in a matrix of single-nucleotide polymorphisms or alleles for standard phylogenetic analysis. However, public health authorities need to document the performance of these methods with appropriate and comprehensive datasets so they can be validated for specific purposes, e.g., outbreak surveillance. Here we propose a set of benchmark datasets to be used for comparison and validation of phylogenomic pipelines. Methods We identified four well-documented foodborne pathogen events in which the epidemiology was concordant with routine phylogenomic analyses (reference-based SNP and wgMLST approaches). These are ideal benchmark datasets, as the trees, WGS data, and epidemiological data for each are all in agreement. We have placed these sequence data, sample metadata, and “known” phylogenetic trees in publicly-accessible databases and developed a standard descriptive spreadsheet format describing each dataset. To facilitate easy downloading of these benchmarks, we developed an automated script that uses the standard descriptive spreadsheet format. Results Our “outbreak” benchmark datasets represent the four major foodborne bacterial pathogens (Listeria monocytogenes, Salmonella enterica, Escherichia coli, and Campylobacter jejuni) and one simulated dataset where the “known tree” can be accurately called the “true tree”. The downloading script and associated table files are available on GitHub: https://github.com/WGS-standards-and-analysis/datasets. Discussion These five benchmark datasets will help standardize comparison of current and future phylogenomic pipelines, and facilitate important cross-institutional collaborations. Our work is part of a global effort to provide collaborative infrastructure for sequence data and analytic tools—we welcome additional benchmark datasets in our recommended format, and, if relevant, we will add these on our GitHub site. Together, these datasets, dataset format, and the underlying GitHub infrastructure present a recommended path for worldwide standardization of phylogenomic pipelines.https://peerj.com/articles/3893.pdfBenchmark datasetsPhylogenomicsFood safetyFoodborne outbreakSalmonellaListeria
spellingShingle Ruth E. Timme
Hugh Rand
Martin Shumway
Eija K. Trees
Mustafa Simmons
Richa Agarwala
Steven Davis
Glenn E. Tillman
Stephanie Defibaugh-Chavez
Heather A. Carleton
William A. Klimke
Lee S. Katz
Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance
PeerJ
Benchmark datasets
Phylogenomics
Food safety
Foodborne outbreak
Salmonella
Listeria
title Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance
title_full Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance
title_fullStr Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance
title_full_unstemmed Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance
title_short Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance
title_sort benchmark datasets for phylogenomic pipeline validation applications for foodborne pathogen surveillance
topic Benchmark datasets
Phylogenomics
Food safety
Foodborne outbreak
Salmonella
Listeria
url https://peerj.com/articles/3893.pdf
work_keys_str_mv AT ruthetimme benchmarkdatasetsforphylogenomicpipelinevalidationapplicationsforfoodbornepathogensurveillance
AT hughrand benchmarkdatasetsforphylogenomicpipelinevalidationapplicationsforfoodbornepathogensurveillance
AT martinshumway benchmarkdatasetsforphylogenomicpipelinevalidationapplicationsforfoodbornepathogensurveillance
AT eijaktrees benchmarkdatasetsforphylogenomicpipelinevalidationapplicationsforfoodbornepathogensurveillance
AT mustafasimmons benchmarkdatasetsforphylogenomicpipelinevalidationapplicationsforfoodbornepathogensurveillance
AT richaagarwala benchmarkdatasetsforphylogenomicpipelinevalidationapplicationsforfoodbornepathogensurveillance
AT stevendavis benchmarkdatasetsforphylogenomicpipelinevalidationapplicationsforfoodbornepathogensurveillance
AT glennetillman benchmarkdatasetsforphylogenomicpipelinevalidationapplicationsforfoodbornepathogensurveillance
AT stephaniedefibaughchavez benchmarkdatasetsforphylogenomicpipelinevalidationapplicationsforfoodbornepathogensurveillance
AT heatheracarleton benchmarkdatasetsforphylogenomicpipelinevalidationapplicationsforfoodbornepathogensurveillance
AT williamaklimke benchmarkdatasetsforphylogenomicpipelinevalidationapplicationsforfoodbornepathogensurveillance
AT leeskatz benchmarkdatasetsforphylogenomicpipelinevalidationapplicationsforfoodbornepathogensurveillance