Data Quality Automation: a Generic Approach for Large Linked Research Datasets
Introduction When datasets are collected mainly for administrative rather than research purposes, data quality checks are necessary to ensure robust findings and to avoid biased results due to incomplete or inaccurate data. When done manually, data quality checks are time-consuming. We introduced...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Swansea University
2018-09-01
|
Series: | International Journal of Population Data Science |
Online Access: | https://ijpds.org/article/view/1000 |
_version_ | 1827610733198704640 |
---|---|
author | Muhammad A Elmessary Daniel Thayer Sarah Rees Leticia ReesKemp Arfon Rees |
author_facet | Muhammad A Elmessary Daniel Thayer Sarah Rees Leticia ReesKemp Arfon Rees |
author_sort | Muhammad A Elmessary |
collection | DOAJ |
description | Introduction
When datasets are collected mainly for administrative rather than research purposes, data quality checks are necessary to ensure robust findings and to avoid biased results due to incomplete or inaccurate data.
When done manually, data quality checks are time-consuming. We introduced automation to speed up the process and save effort.
Objectives and Approach
We have devised a set of automated generic quality checks and reporting, which can be run on any dataset in a relational database without any dataset-specific knowledge or configuration.
The code is written in Python. Checks include: linkage quality, agreement with a population data source, comparison with previous data version, duplication checks, null count, value distribution and range, etc.
Where dataset metadata is available, checks for validity against lookup tables are included, and the output report includes documentation on data contents. An HTML report with dynamic datatables and interactive graphs, allowing easy exploration of the results, is produced using RMarkdown.
Results
The automation of the generic data quality check provides an easy and quick tool to report on data issues with minimal effort. It allows comparison with reference tables, lookups and previous versions of the same table to highlight differences. Moreover, this tool can be provided for researchers as a means to get more detailed understanding about their data.
While other research data quality tools exist, this tool is distinguished by its features specific to linked data research, as well as implementation in a relational database environment. It has been successfully tested on datasets of over two billion rows.
The tool was designed for use within the SAIL Databank, but could easily be adapted and used in other settings.
Conclusion/Implications
The effort spent on automating generic testing and reporting on data quality of research datasets is more than compensated by its outputs. Benefits include quick detection and scrutiny of many sources of invalid and incomplete data. This process can easily be expanded to accommodate more standard tests. |
first_indexed | 2024-03-09T07:55:53Z |
format | Article |
id | doaj.art-a003588ea5564167a7eec3c9fce932cf |
institution | Directory Open Access Journal |
issn | 2399-4908 |
language | English |
last_indexed | 2024-03-09T07:55:53Z |
publishDate | 2018-09-01 |
publisher | Swansea University |
record_format | Article |
series | International Journal of Population Data Science |
spelling | doaj.art-a003588ea5564167a7eec3c9fce932cf2023-12-03T01:08:18ZengSwansea UniversityInternational Journal of Population Data Science2399-49082018-09-013410.23889/ijpds.v3i4.10001000Data Quality Automation: a Generic Approach for Large Linked Research DatasetsMuhammad A Elmessary0Daniel Thayer1Sarah Rees2Leticia ReesKemp3Arfon Rees4Swansea UniversitySwansea UniversitySwansea UniversitySwansea UniversitySwansea UniversityIntroduction When datasets are collected mainly for administrative rather than research purposes, data quality checks are necessary to ensure robust findings and to avoid biased results due to incomplete or inaccurate data. When done manually, data quality checks are time-consuming. We introduced automation to speed up the process and save effort. Objectives and Approach We have devised a set of automated generic quality checks and reporting, which can be run on any dataset in a relational database without any dataset-specific knowledge or configuration. The code is written in Python. Checks include: linkage quality, agreement with a population data source, comparison with previous data version, duplication checks, null count, value distribution and range, etc. Where dataset metadata is available, checks for validity against lookup tables are included, and the output report includes documentation on data contents. An HTML report with dynamic datatables and interactive graphs, allowing easy exploration of the results, is produced using RMarkdown. Results The automation of the generic data quality check provides an easy and quick tool to report on data issues with minimal effort. It allows comparison with reference tables, lookups and previous versions of the same table to highlight differences. Moreover, this tool can be provided for researchers as a means to get more detailed understanding about their data. While other research data quality tools exist, this tool is distinguished by its features specific to linked data research, as well as implementation in a relational database environment. It has been successfully tested on datasets of over two billion rows. The tool was designed for use within the SAIL Databank, but could easily be adapted and used in other settings. Conclusion/Implications The effort spent on automating generic testing and reporting on data quality of research datasets is more than compensated by its outputs. Benefits include quick detection and scrutiny of many sources of invalid and incomplete data. This process can easily be expanded to accommodate more standard tests.https://ijpds.org/article/view/1000 |
spellingShingle | Muhammad A Elmessary Daniel Thayer Sarah Rees Leticia ReesKemp Arfon Rees Data Quality Automation: a Generic Approach for Large Linked Research Datasets International Journal of Population Data Science |
title | Data Quality Automation: a Generic Approach for Large Linked Research Datasets |
title_full | Data Quality Automation: a Generic Approach for Large Linked Research Datasets |
title_fullStr | Data Quality Automation: a Generic Approach for Large Linked Research Datasets |
title_full_unstemmed | Data Quality Automation: a Generic Approach for Large Linked Research Datasets |
title_short | Data Quality Automation: a Generic Approach for Large Linked Research Datasets |
title_sort | data quality automation a generic approach for large linked research datasets |
url | https://ijpds.org/article/view/1000 |
work_keys_str_mv | AT muhammadaelmessary dataqualityautomationagenericapproachforlargelinkedresearchdatasets AT danielthayer dataqualityautomationagenericapproachforlargelinkedresearchdatasets AT sarahrees dataqualityautomationagenericapproachforlargelinkedresearchdatasets AT leticiareeskemp dataqualityautomationagenericapproachforlargelinkedresearchdatasets AT arfonrees dataqualityautomationagenericapproachforlargelinkedresearchdatasets |