ReUseData: an R/Bioconductor tool for reusable and reproducible genomic data management

Abstract Background The increasing volume and complexity of genomic data pose significant challenges for effective data management and reuse. Public genomic data often undergo similar preprocessing across projects, leading to redundant or inconsistent datasets and inefficient use of computing resour...

Full description

Bibliographic Details
Main Authors: Qian Liu, Qiang Hu, Song Liu, Alan Hutson, Martin Morgan
Format: Article
Language:English
Published: BMC 2024-01-01
Series:BMC Bioinformatics
Subjects:
Online Access:https://doi.org/10.1186/s12859-023-05626-0
_version_ 1797362853814992896
author Qian Liu
Qiang Hu
Song Liu
Alan Hutson
Martin Morgan
author_facet Qian Liu
Qiang Hu
Song Liu
Alan Hutson
Martin Morgan
author_sort Qian Liu
collection DOAJ
description Abstract Background The increasing volume and complexity of genomic data pose significant challenges for effective data management and reuse. Public genomic data often undergo similar preprocessing across projects, leading to redundant or inconsistent datasets and inefficient use of computing resources. This is especially pertinent for bioinformaticians engaged in multiple projects. Tools have been created to address challenges in managing and accessing curated genomic datasets, however, the practical utility of such tools becomes especially beneficial for users who seek to work with specific types of data or are technically inclined toward a particular programming language. Currently, there exists a gap in the availability of an R-specific solution for efficient data management and versatile data reuse. Results Here we present ReUseData, an R software tool that overcomes some of the limitations of existing solutions and provides a versatile and reproducible approach to effective data management within R. ReUseData facilitates the transformation of ad hoc scripts for data preprocessing into Common Workflow Language (CWL)-based data recipes, allowing for the reproducible generation of curated data files in their generic formats. The data recipes are standardized and self-contained, enabling them to be easily portable and reproducible across various computing platforms. ReUseData also streamlines the reuse of curated data files and their integration into downstream analysis tools and workflows with different frameworks. Conclusions ReUseData provides a reliable and reproducible approach for genomic data management within the R environment to enhance the accessibility and reusability of genomic data. The package is available at Bioconductor ( https://bioconductor.org/packages/ReUseData/ ) with additional information on the project website ( https://rcwl.org/dataRecipes/ ).
first_indexed 2024-03-08T16:12:22Z
format Article
id doaj.art-75d97b47effe42c9a0fd7ad5d4c63929
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-03-08T16:12:22Z
publishDate 2024-01-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-75d97b47effe42c9a0fd7ad5d4c639292024-01-07T12:49:35ZengBMCBMC Bioinformatics1471-21052024-01-012511910.1186/s12859-023-05626-0ReUseData: an R/Bioconductor tool for reusable and reproducible genomic data managementQian Liu0Qiang Hu1Song Liu2Alan Hutson3Martin Morgan4Department of Biostatistics and Bioinformatics, Roswell Park Comprehensive Cancer CenterDepartment of Biostatistics and Bioinformatics, Roswell Park Comprehensive Cancer CenterDepartment of Biostatistics and Bioinformatics, Roswell Park Comprehensive Cancer CenterDepartment of Biostatistics and Bioinformatics, Roswell Park Comprehensive Cancer CenterDepartment of Biostatistics and Bioinformatics, Roswell Park Comprehensive Cancer CenterAbstract Background The increasing volume and complexity of genomic data pose significant challenges for effective data management and reuse. Public genomic data often undergo similar preprocessing across projects, leading to redundant or inconsistent datasets and inefficient use of computing resources. This is especially pertinent for bioinformaticians engaged in multiple projects. Tools have been created to address challenges in managing and accessing curated genomic datasets, however, the practical utility of such tools becomes especially beneficial for users who seek to work with specific types of data or are technically inclined toward a particular programming language. Currently, there exists a gap in the availability of an R-specific solution for efficient data management and versatile data reuse. Results Here we present ReUseData, an R software tool that overcomes some of the limitations of existing solutions and provides a versatile and reproducible approach to effective data management within R. ReUseData facilitates the transformation of ad hoc scripts for data preprocessing into Common Workflow Language (CWL)-based data recipes, allowing for the reproducible generation of curated data files in their generic formats. The data recipes are standardized and self-contained, enabling them to be easily portable and reproducible across various computing platforms. ReUseData also streamlines the reuse of curated data files and their integration into downstream analysis tools and workflows with different frameworks. Conclusions ReUseData provides a reliable and reproducible approach for genomic data management within the R environment to enhance the accessibility and reusability of genomic data. The package is available at Bioconductor ( https://bioconductor.org/packages/ReUseData/ ) with additional information on the project website ( https://rcwl.org/dataRecipes/ ).https://doi.org/10.1186/s12859-023-05626-0Genomic dataData reusabilityData reproducibilityData managementCommon Workflow Language
spellingShingle Qian Liu
Qiang Hu
Song Liu
Alan Hutson
Martin Morgan
ReUseData: an R/Bioconductor tool for reusable and reproducible genomic data management
BMC Bioinformatics
Genomic data
Data reusability
Data reproducibility
Data management
Common Workflow Language
title ReUseData: an R/Bioconductor tool for reusable and reproducible genomic data management
title_full ReUseData: an R/Bioconductor tool for reusable and reproducible genomic data management
title_fullStr ReUseData: an R/Bioconductor tool for reusable and reproducible genomic data management
title_full_unstemmed ReUseData: an R/Bioconductor tool for reusable and reproducible genomic data management
title_short ReUseData: an R/Bioconductor tool for reusable and reproducible genomic data management
title_sort reusedata an r bioconductor tool for reusable and reproducible genomic data management
topic Genomic data
Data reusability
Data reproducibility
Data management
Common Workflow Language
url https://doi.org/10.1186/s12859-023-05626-0
work_keys_str_mv AT qianliu reusedataanrbioconductortoolforreusableandreproduciblegenomicdatamanagement
AT qianghu reusedataanrbioconductortoolforreusableandreproduciblegenomicdatamanagement
AT songliu reusedataanrbioconductortoolforreusableandreproduciblegenomicdatamanagement
AT alanhutson reusedataanrbioconductortoolforreusableandreproduciblegenomicdatamanagement
AT martinmorgan reusedataanrbioconductortoolforreusableandreproduciblegenomicdatamanagement