Data Validation Infrastructure for R

Checking data quality against domain knowledge is a common activity that pervades statistical analysis from raw data to output. The R package validate facilitates this task by capturing and applying expert knowledge in the form of validation rules: logical restrictions on variables, records, or data...

Full description

Bibliographic Details
Main Authors: Mark P. J. van der Loo, Edwin de Jonge
Format: Article
Language:English
Published: Foundation for Open Access Statistics 2021-03-01
Series:Journal of Statistical Software
Subjects:
Online Access:https://www.jstatsoft.org/index.php/jss/article/view/3483
_version_ 1797813947917664256
author Mark P. J. van der Loo
Edwin de Jonge
author_facet Mark P. J. van der Loo
Edwin de Jonge
author_sort Mark P. J. van der Loo
collection DOAJ
description Checking data quality against domain knowledge is a common activity that pervades statistical analysis from raw data to output. The R package validate facilitates this task by capturing and applying expert knowledge in the form of validation rules: logical restrictions on variables, records, or data sets that should be satisfied before they are considered valid input for further analysis. In the validate package, validation rules are objects of computation that can be manipulated, investigated, and confronted with data or versions of a data set. The results of a confrontation are then available for further investigation, summarization or visualization. Validation rules can also be endowed with metadata and documentation and they may be stored or retrieved from external sources such as text files or tabular formats. This data validation infrastructure thus allows for systematic, user-defined definition of data quality requirements that can be reused for various versions of a data set or by data correction algorithms that are parameterized by validation rules.
first_indexed 2024-03-13T08:00:16Z
format Article
id doaj.art-91379f4791884e50b2dc2da72268f8bc
institution Directory Open Access Journal
issn 1548-7660
language English
last_indexed 2024-03-13T08:00:16Z
publishDate 2021-03-01
publisher Foundation for Open Access Statistics
record_format Article
series Journal of Statistical Software
spelling doaj.art-91379f4791884e50b2dc2da72268f8bc2023-06-01T18:41:10ZengFoundation for Open Access StatisticsJournal of Statistical Software1548-76602021-03-0197110.18637/jss.v097.i103332Data Validation Infrastructure for RMark P. J. van der LooEdwin de JongeChecking data quality against domain knowledge is a common activity that pervades statistical analysis from raw data to output. The R package validate facilitates this task by capturing and applying expert knowledge in the form of validation rules: logical restrictions on variables, records, or data sets that should be satisfied before they are considered valid input for further analysis. In the validate package, validation rules are objects of computation that can be manipulated, investigated, and confronted with data or versions of a data set. The results of a confrontation are then available for further investigation, summarization or visualization. Validation rules can also be endowed with metadata and documentation and they may be stored or retrieved from external sources such as text files or tabular formats. This data validation infrastructure thus allows for systematic, user-defined definition of data quality requirements that can be reused for various versions of a data set or by data correction algorithms that are parameterized by validation rules.https://www.jstatsoft.org/index.php/jss/article/view/3483data checkingdata qualitydata cleaningR
spellingShingle Mark P. J. van der Loo
Edwin de Jonge
Data Validation Infrastructure for R
Journal of Statistical Software
data checking
data quality
data cleaning
R
title Data Validation Infrastructure for R
title_full Data Validation Infrastructure for R
title_fullStr Data Validation Infrastructure for R
title_full_unstemmed Data Validation Infrastructure for R
title_short Data Validation Infrastructure for R
title_sort data validation infrastructure for r
topic data checking
data quality
data cleaning
R
url https://www.jstatsoft.org/index.php/jss/article/view/3483
work_keys_str_mv AT markpjvanderloo datavalidationinfrastructureforr
AT edwindejonge datavalidationinfrastructureforr