Reproducible big data science: A case study in continuous FAIRness.
Big biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code thr...
Main Authors: | , , , , , , , , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Public Library of Science (PLoS)
2019-01-01
|
Series: | PLoS ONE |
Online Access: | https://doi.org/10.1371/journal.pone.0213013 |
_version_ | 1797394435337617408 |
---|---|
author | Ravi Madduri Kyle Chard Mike D'Arcy Segun C Jung Alexis Rodriguez Dinanath Sulakhe Eric Deutsch Cory Funk Ben Heavner Matthew Richards Paul Shannon Gustavo Glusman Nathan Price Carl Kesselman Ian Foster |
author_facet | Ravi Madduri Kyle Chard Mike D'Arcy Segun C Jung Alexis Rodriguez Dinanath Sulakhe Eric Deutsch Cory Funk Ben Heavner Matthew Richards Paul Shannon Gustavo Glusman Nathan Price Carl Kesselman Ian Foster |
author_sort | Ravi Madduri |
collection | DOAJ |
description | Big biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code throughout the data lifecycle. We illustrate the use of these tools via a case study involving a multi-step analysis that creates an atlas of putative transcription factor binding sites from terabytes of ENCODE DNase I hypersensitive sites sequencing data. We show how the tools automate routine but complex tasks, capture analysis algorithms in understandable and reusable forms, and harness fast networks and powerful cloud computers to process data rapidly, all without sacrificing usability or reproducibility-thus ensuring that big data are not hard-to-(re)use data. We evaluate our approach via a user study, and show that 91% of participants were able to replicate a complex analysis involving considerable data volumes. |
first_indexed | 2024-03-09T00:18:45Z |
format | Article |
id | doaj.art-cf14101c167242bcb376929895f44270 |
institution | Directory Open Access Journal |
issn | 1932-6203 |
language | English |
last_indexed | 2024-03-09T00:18:45Z |
publishDate | 2019-01-01 |
publisher | Public Library of Science (PLoS) |
record_format | Article |
series | PLoS ONE |
spelling | doaj.art-cf14101c167242bcb376929895f442702023-12-12T05:37:45ZengPublic Library of Science (PLoS)PLoS ONE1932-62032019-01-01144e021301310.1371/journal.pone.0213013Reproducible big data science: A case study in continuous FAIRness.Ravi MadduriKyle ChardMike D'ArcySegun C JungAlexis RodriguezDinanath SulakheEric DeutschCory FunkBen HeavnerMatthew RichardsPaul ShannonGustavo GlusmanNathan PriceCarl KesselmanIan FosterBig biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code throughout the data lifecycle. We illustrate the use of these tools via a case study involving a multi-step analysis that creates an atlas of putative transcription factor binding sites from terabytes of ENCODE DNase I hypersensitive sites sequencing data. We show how the tools automate routine but complex tasks, capture analysis algorithms in understandable and reusable forms, and harness fast networks and powerful cloud computers to process data rapidly, all without sacrificing usability or reproducibility-thus ensuring that big data are not hard-to-(re)use data. We evaluate our approach via a user study, and show that 91% of participants were able to replicate a complex analysis involving considerable data volumes.https://doi.org/10.1371/journal.pone.0213013 |
spellingShingle | Ravi Madduri Kyle Chard Mike D'Arcy Segun C Jung Alexis Rodriguez Dinanath Sulakhe Eric Deutsch Cory Funk Ben Heavner Matthew Richards Paul Shannon Gustavo Glusman Nathan Price Carl Kesselman Ian Foster Reproducible big data science: A case study in continuous FAIRness. PLoS ONE |
title | Reproducible big data science: A case study in continuous FAIRness. |
title_full | Reproducible big data science: A case study in continuous FAIRness. |
title_fullStr | Reproducible big data science: A case study in continuous FAIRness. |
title_full_unstemmed | Reproducible big data science: A case study in continuous FAIRness. |
title_short | Reproducible big data science: A case study in continuous FAIRness. |
title_sort | reproducible big data science a case study in continuous fairness |
url | https://doi.org/10.1371/journal.pone.0213013 |
work_keys_str_mv | AT ravimadduri reproduciblebigdatascienceacasestudyincontinuousfairness AT kylechard reproduciblebigdatascienceacasestudyincontinuousfairness AT mikedarcy reproduciblebigdatascienceacasestudyincontinuousfairness AT seguncjung reproduciblebigdatascienceacasestudyincontinuousfairness AT alexisrodriguez reproduciblebigdatascienceacasestudyincontinuousfairness AT dinanathsulakhe reproduciblebigdatascienceacasestudyincontinuousfairness AT ericdeutsch reproduciblebigdatascienceacasestudyincontinuousfairness AT coryfunk reproduciblebigdatascienceacasestudyincontinuousfairness AT benheavner reproduciblebigdatascienceacasestudyincontinuousfairness AT matthewrichards reproduciblebigdatascienceacasestudyincontinuousfairness AT paulshannon reproduciblebigdatascienceacasestudyincontinuousfairness AT gustavoglusman reproduciblebigdatascienceacasestudyincontinuousfairness AT nathanprice reproduciblebigdatascienceacasestudyincontinuousfairness AT carlkesselman reproduciblebigdatascienceacasestudyincontinuousfairness AT ianfoster reproduciblebigdatascienceacasestudyincontinuousfairness |