Reproducible big data science: A case study in continuous FAIRness.

Big biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code thr...

Full description

Bibliographic Details
Main Authors: Ravi Madduri, Kyle Chard, Mike D'Arcy, Segun C Jung, Alexis Rodriguez, Dinanath Sulakhe, Eric Deutsch, Cory Funk, Ben Heavner, Matthew Richards, Paul Shannon, Gustavo Glusman, Nathan Price, Carl Kesselman, Ian Foster
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2019-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0213013
_version_ 1797394435337617408
author Ravi Madduri
Kyle Chard
Mike D'Arcy
Segun C Jung
Alexis Rodriguez
Dinanath Sulakhe
Eric Deutsch
Cory Funk
Ben Heavner
Matthew Richards
Paul Shannon
Gustavo Glusman
Nathan Price
Carl Kesselman
Ian Foster
author_facet Ravi Madduri
Kyle Chard
Mike D'Arcy
Segun C Jung
Alexis Rodriguez
Dinanath Sulakhe
Eric Deutsch
Cory Funk
Ben Heavner
Matthew Richards
Paul Shannon
Gustavo Glusman
Nathan Price
Carl Kesselman
Ian Foster
author_sort Ravi Madduri
collection DOAJ
description Big biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code throughout the data lifecycle. We illustrate the use of these tools via a case study involving a multi-step analysis that creates an atlas of putative transcription factor binding sites from terabytes of ENCODE DNase I hypersensitive sites sequencing data. We show how the tools automate routine but complex tasks, capture analysis algorithms in understandable and reusable forms, and harness fast networks and powerful cloud computers to process data rapidly, all without sacrificing usability or reproducibility-thus ensuring that big data are not hard-to-(re)use data. We evaluate our approach via a user study, and show that 91% of participants were able to replicate a complex analysis involving considerable data volumes.
first_indexed 2024-03-09T00:18:45Z
format Article
id doaj.art-cf14101c167242bcb376929895f44270
institution Directory Open Access Journal
issn 1932-6203
language English
last_indexed 2024-03-09T00:18:45Z
publishDate 2019-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj.art-cf14101c167242bcb376929895f442702023-12-12T05:37:45ZengPublic Library of Science (PLoS)PLoS ONE1932-62032019-01-01144e021301310.1371/journal.pone.0213013Reproducible big data science: A case study in continuous FAIRness.Ravi MadduriKyle ChardMike D'ArcySegun C JungAlexis RodriguezDinanath SulakheEric DeutschCory FunkBen HeavnerMatthew RichardsPaul ShannonGustavo GlusmanNathan PriceCarl KesselmanIan FosterBig biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code throughout the data lifecycle. We illustrate the use of these tools via a case study involving a multi-step analysis that creates an atlas of putative transcription factor binding sites from terabytes of ENCODE DNase I hypersensitive sites sequencing data. We show how the tools automate routine but complex tasks, capture analysis algorithms in understandable and reusable forms, and harness fast networks and powerful cloud computers to process data rapidly, all without sacrificing usability or reproducibility-thus ensuring that big data are not hard-to-(re)use data. We evaluate our approach via a user study, and show that 91% of participants were able to replicate a complex analysis involving considerable data volumes.https://doi.org/10.1371/journal.pone.0213013
spellingShingle Ravi Madduri
Kyle Chard
Mike D'Arcy
Segun C Jung
Alexis Rodriguez
Dinanath Sulakhe
Eric Deutsch
Cory Funk
Ben Heavner
Matthew Richards
Paul Shannon
Gustavo Glusman
Nathan Price
Carl Kesselman
Ian Foster
Reproducible big data science: A case study in continuous FAIRness.
PLoS ONE
title Reproducible big data science: A case study in continuous FAIRness.
title_full Reproducible big data science: A case study in continuous FAIRness.
title_fullStr Reproducible big data science: A case study in continuous FAIRness.
title_full_unstemmed Reproducible big data science: A case study in continuous FAIRness.
title_short Reproducible big data science: A case study in continuous FAIRness.
title_sort reproducible big data science a case study in continuous fairness
url https://doi.org/10.1371/journal.pone.0213013
work_keys_str_mv AT ravimadduri reproduciblebigdatascienceacasestudyincontinuousfairness
AT kylechard reproduciblebigdatascienceacasestudyincontinuousfairness
AT mikedarcy reproduciblebigdatascienceacasestudyincontinuousfairness
AT seguncjung reproduciblebigdatascienceacasestudyincontinuousfairness
AT alexisrodriguez reproduciblebigdatascienceacasestudyincontinuousfairness
AT dinanathsulakhe reproduciblebigdatascienceacasestudyincontinuousfairness
AT ericdeutsch reproduciblebigdatascienceacasestudyincontinuousfairness
AT coryfunk reproduciblebigdatascienceacasestudyincontinuousfairness
AT benheavner reproduciblebigdatascienceacasestudyincontinuousfairness
AT matthewrichards reproduciblebigdatascienceacasestudyincontinuousfairness
AT paulshannon reproduciblebigdatascienceacasestudyincontinuousfairness
AT gustavoglusman reproduciblebigdatascienceacasestudyincontinuousfairness
AT nathanprice reproduciblebigdatascienceacasestudyincontinuousfairness
AT carlkesselman reproduciblebigdatascienceacasestudyincontinuousfairness
AT ianfoster reproduciblebigdatascienceacasestudyincontinuousfairness