Addressing the need for interactive, efficient, and reproducible data processing in ecology with the datacleanr R package

Ecological research, just as all Earth System Sciences, is becoming increasingly data-rich. Tools for processing of “big data” are continuously developed to meet corresponding technical and logistical challenges. However, even at smaller scales, data sets may be challenging when best practices in da...

Full description

Bibliographic Details
Main Authors: Alexander G. Hurley, Richard L. Peters, Christoforos Pappas, David N. Steger, Ingo Heinrich
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2022-01-01
Series:PLoS ONE
Online Access:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9098071/?tool=EBI
_version_ 1818009126641336320
author Alexander G. Hurley
Richard L. Peters
Christoforos Pappas
David N. Steger
Ingo Heinrich
author_facet Alexander G. Hurley
Richard L. Peters
Christoforos Pappas
David N. Steger
Ingo Heinrich
author_sort Alexander G. Hurley
collection DOAJ
description Ecological research, just as all Earth System Sciences, is becoming increasingly data-rich. Tools for processing of “big data” are continuously developed to meet corresponding technical and logistical challenges. However, even at smaller scales, data sets may be challenging when best practices in data exploration, quality control and reproducibility are to be met. This can occur when conventional methods, such as generating and assessing diagnostic visualizations or tables, become unfeasible due to time and practicality constraints. Interactive processing can alleviate this issue, and is increasingly utilized to ensure that large data sets are diligently handled. However, recent interactive tools rarely enable data manipulation, may not generate reproducible outputs, or are typically data/domain-specific. We developed datacleanr, an interactive tool that facilitates best practices in data exploration, quality control (e.g., outlier assessment) and flexible processing for multiple tabular data types, including time series and georeferenced data. The package is open-source, and based on the R programming language. A key functionality of datacleanr is the “reproducible recipe”—a translation of all interactive actions into R code, which can be integrated into existing analyses pipelines. This enables researchers experienced with script-based workflows to utilize the strengths of interactive processing without sacrificing their usual work style or functionalities from other (R) packages. We demonstrate the package’s utility by addressing two common issues during data analyses, namely 1) identifying problematic structures and artefacts in hierarchically nested data, and 2) preventing excessive loss of data from ‘coarse,’ code-based filtering of time series. Ultimately, with datacleanr we aim to improve researchers’ workflows and increase confidence in and reproducibility of their results.
first_indexed 2024-04-14T05:39:00Z
format Article
id doaj.art-0ae7efc1c08e44eea8e3d042dc4f650e
institution Directory Open Access Journal
issn 1932-6203
language English
last_indexed 2024-04-14T05:39:00Z
publishDate 2022-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj.art-0ae7efc1c08e44eea8e3d042dc4f650e2022-12-22T02:09:31ZengPublic Library of Science (PLoS)PLoS ONE1932-62032022-01-01175Addressing the need for interactive, efficient, and reproducible data processing in ecology with the datacleanr R packageAlexander G. HurleyRichard L. PetersChristoforos PappasDavid N. StegerIngo HeinrichEcological research, just as all Earth System Sciences, is becoming increasingly data-rich. Tools for processing of “big data” are continuously developed to meet corresponding technical and logistical challenges. However, even at smaller scales, data sets may be challenging when best practices in data exploration, quality control and reproducibility are to be met. This can occur when conventional methods, such as generating and assessing diagnostic visualizations or tables, become unfeasible due to time and practicality constraints. Interactive processing can alleviate this issue, and is increasingly utilized to ensure that large data sets are diligently handled. However, recent interactive tools rarely enable data manipulation, may not generate reproducible outputs, or are typically data/domain-specific. We developed datacleanr, an interactive tool that facilitates best practices in data exploration, quality control (e.g., outlier assessment) and flexible processing for multiple tabular data types, including time series and georeferenced data. The package is open-source, and based on the R programming language. A key functionality of datacleanr is the “reproducible recipe”—a translation of all interactive actions into R code, which can be integrated into existing analyses pipelines. This enables researchers experienced with script-based workflows to utilize the strengths of interactive processing without sacrificing their usual work style or functionalities from other (R) packages. We demonstrate the package’s utility by addressing two common issues during data analyses, namely 1) identifying problematic structures and artefacts in hierarchically nested data, and 2) preventing excessive loss of data from ‘coarse,’ code-based filtering of time series. Ultimately, with datacleanr we aim to improve researchers’ workflows and increase confidence in and reproducibility of their results.https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9098071/?tool=EBI
spellingShingle Alexander G. Hurley
Richard L. Peters
Christoforos Pappas
David N. Steger
Ingo Heinrich
Addressing the need for interactive, efficient, and reproducible data processing in ecology with the datacleanr R package
PLoS ONE
title Addressing the need for interactive, efficient, and reproducible data processing in ecology with the datacleanr R package
title_full Addressing the need for interactive, efficient, and reproducible data processing in ecology with the datacleanr R package
title_fullStr Addressing the need for interactive, efficient, and reproducible data processing in ecology with the datacleanr R package
title_full_unstemmed Addressing the need for interactive, efficient, and reproducible data processing in ecology with the datacleanr R package
title_short Addressing the need for interactive, efficient, and reproducible data processing in ecology with the datacleanr R package
title_sort addressing the need for interactive efficient and reproducible data processing in ecology with the datacleanr r package
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9098071/?tool=EBI
work_keys_str_mv AT alexanderghurley addressingtheneedforinteractiveefficientandreproducibledataprocessinginecologywiththedatacleanrrpackage
AT richardlpeters addressingtheneedforinteractiveefficientandreproducibledataprocessinginecologywiththedatacleanrrpackage
AT christoforospappas addressingtheneedforinteractiveefficientandreproducibledataprocessinginecologywiththedatacleanrrpackage
AT davidnsteger addressingtheneedforinteractiveefficientandreproducibledataprocessinginecologywiththedatacleanrrpackage
AT ingoheinrich addressingtheneedforinteractiveefficientandreproducibledataprocessinginecologywiththedatacleanrrpackage