Data Quality Automation: a Generic Approach for Large Linked Research Datasets

Introduction When datasets are collected mainly for administrative rather than research purposes, data quality checks are necessary to ensure robust findings and to avoid biased results due to incomplete or inaccurate data. When done manually, data quality checks are time-consuming. We introduced...

Full description

Bibliographic Details
Main Authors: Muhammad A Elmessary, Daniel Thayer, Sarah Rees, Leticia ReesKemp, Arfon Rees
Format: Article
Language:English
Published: Swansea University 2018-09-01
Series:International Journal of Population Data Science
Online Access:https://ijpds.org/article/view/1000
_version_ 1827610733198704640
author Muhammad A Elmessary
Daniel Thayer
Sarah Rees
Leticia ReesKemp
Arfon Rees
author_facet Muhammad A Elmessary
Daniel Thayer
Sarah Rees
Leticia ReesKemp
Arfon Rees
author_sort Muhammad A Elmessary
collection DOAJ
description Introduction When datasets are collected mainly for administrative rather than research purposes, data quality checks are necessary to ensure robust findings and to avoid biased results due to incomplete or inaccurate data. When done manually, data quality checks are time-consuming. We introduced automation to speed up the process and save effort. Objectives and Approach We have devised a set of automated generic quality checks and reporting, which can be run on any dataset in a relational database without any dataset-specific knowledge or configuration. The code is written in Python. Checks include: linkage quality, agreement with a population data source, comparison with previous data version, duplication checks, null count, value distribution and range, etc. Where dataset metadata is available, checks for validity against lookup tables are included, and the output report includes documentation on data contents. An HTML report with dynamic datatables and interactive graphs, allowing easy exploration of the results, is produced using RMarkdown. Results The automation of the generic data quality check provides an easy and quick tool to report on data issues with minimal effort. It allows comparison with reference tables, lookups and previous versions of the same table to highlight differences. Moreover, this tool can be provided for researchers as a means to get more detailed understanding about their data. While other research data quality tools exist, this tool is distinguished by its features specific to linked data research, as well as implementation in a relational database environment. It has been successfully tested on datasets of over two billion rows. The tool was designed for use within the SAIL Databank, but could easily be adapted and used in other settings. Conclusion/Implications The effort spent on automating generic testing and reporting on data quality of research datasets is more than compensated by its outputs. Benefits include quick detection and scrutiny of many sources of invalid and incomplete data. This process can easily be expanded to accommodate more standard tests.
first_indexed 2024-03-09T07:55:53Z
format Article
id doaj.art-a003588ea5564167a7eec3c9fce932cf
institution Directory Open Access Journal
issn 2399-4908
language English
last_indexed 2024-03-09T07:55:53Z
publishDate 2018-09-01
publisher Swansea University
record_format Article
series International Journal of Population Data Science
spelling doaj.art-a003588ea5564167a7eec3c9fce932cf2023-12-03T01:08:18ZengSwansea UniversityInternational Journal of Population Data Science2399-49082018-09-013410.23889/ijpds.v3i4.10001000Data Quality Automation: a Generic Approach for Large Linked Research DatasetsMuhammad A Elmessary0Daniel Thayer1Sarah Rees2Leticia ReesKemp3Arfon Rees4Swansea UniversitySwansea UniversitySwansea UniversitySwansea UniversitySwansea UniversityIntroduction When datasets are collected mainly for administrative rather than research purposes, data quality checks are necessary to ensure robust findings and to avoid biased results due to incomplete or inaccurate data. When done manually, data quality checks are time-consuming. We introduced automation to speed up the process and save effort. Objectives and Approach We have devised a set of automated generic quality checks and reporting, which can be run on any dataset in a relational database without any dataset-specific knowledge or configuration. The code is written in Python. Checks include: linkage quality, agreement with a population data source, comparison with previous data version, duplication checks, null count, value distribution and range, etc. Where dataset metadata is available, checks for validity against lookup tables are included, and the output report includes documentation on data contents. An HTML report with dynamic datatables and interactive graphs, allowing easy exploration of the results, is produced using RMarkdown. Results The automation of the generic data quality check provides an easy and quick tool to report on data issues with minimal effort. It allows comparison with reference tables, lookups and previous versions of the same table to highlight differences. Moreover, this tool can be provided for researchers as a means to get more detailed understanding about their data. While other research data quality tools exist, this tool is distinguished by its features specific to linked data research, as well as implementation in a relational database environment. It has been successfully tested on datasets of over two billion rows. The tool was designed for use within the SAIL Databank, but could easily be adapted and used in other settings. Conclusion/Implications The effort spent on automating generic testing and reporting on data quality of research datasets is more than compensated by its outputs. Benefits include quick detection and scrutiny of many sources of invalid and incomplete data. This process can easily be expanded to accommodate more standard tests.https://ijpds.org/article/view/1000
spellingShingle Muhammad A Elmessary
Daniel Thayer
Sarah Rees
Leticia ReesKemp
Arfon Rees
Data Quality Automation: a Generic Approach for Large Linked Research Datasets
International Journal of Population Data Science
title Data Quality Automation: a Generic Approach for Large Linked Research Datasets
title_full Data Quality Automation: a Generic Approach for Large Linked Research Datasets
title_fullStr Data Quality Automation: a Generic Approach for Large Linked Research Datasets
title_full_unstemmed Data Quality Automation: a Generic Approach for Large Linked Research Datasets
title_short Data Quality Automation: a Generic Approach for Large Linked Research Datasets
title_sort data quality automation a generic approach for large linked research datasets
url https://ijpds.org/article/view/1000
work_keys_str_mv AT muhammadaelmessary dataqualityautomationagenericapproachforlargelinkedresearchdatasets
AT danielthayer dataqualityautomationagenericapproachforlargelinkedresearchdatasets
AT sarahrees dataqualityautomationagenericapproachforlargelinkedresearchdatasets
AT leticiareeskemp dataqualityautomationagenericapproachforlargelinkedresearchdatasets
AT arfonrees dataqualityautomationagenericapproachforlargelinkedresearchdatasets