Investigation and reporting of Data Quality within and between linked SAIL datasets

ABSTRACT Objectives The SAIL databank brings together a range of datasets gathered primarily for administrative rather than research processes. These datasets contain information regarding different aspects of an individual’s contact with services which when combined form a detailed health record f...

Full description

Bibliographic Details
Main Authors: Sarah Rees, Arfon Rees
Format: Article
Language:English
Published: Swansea University 2017-04-01
Series:International Journal of Population Data Science
Online Access:https://ijpds.org/article/view/99
_version_ 1827610490748010496
author Sarah Rees
Arfon Rees
author_facet Sarah Rees
Arfon Rees
author_sort Sarah Rees
collection DOAJ
description ABSTRACT Objectives The SAIL databank brings together a range of datasets gathered primarily for administrative rather than research processes. These datasets contain information regarding different aspects of an individual’s contact with services which when combined form a detailed health record for individuals living (or deceased) in Wales. Understanding the quality of data in SAIL supports the research process by providing a level of assurance about the robustness of data, identifying and describing where there may be sources of potential bias due to invalid, incomplete, inconsistent or inaccurate data and therefore helping to increase the accuracy of research using these data. Designing processes to investigate and report on data quality within and between multiple datasets can be a time-consuming task to undertake; it requires a high degree of effort to ensure it is genuinely meaningful and useful to SAIL users and may require a range of different approaches. Approach Data quality tests for each dataset were written, considering a range of data quality dimensions including validity, consistency, accuracy and completeness. Tests were designed to capture not just the quality of data within each dataset, but also to assess consistency of data items between datasets. SQL scripts were written to test each of these aspects: in order to minimise repetition, automated processes were implemented where appropriate. Batch automation was used to called SQL stored procedures, which utilise metadata to generate dynamic SQL. The metadata (created as part of the data quality process) describes each dataset and the measurement parameters used to assess each field within the dataset. However automation on its own is insufficient and data quality process outputs require scrutiny and oversight to ensure they are actually capturing what they set out to do. SAIL users were consulted on the development of the data quality reports to ensure usability and appropriateness to support data utilisation for research. Results The data quality reporting process is beneficial to the SAIL databank as it provides additional information to support the research process and in some cases may act as a diagnostic tool, detecting problems with data which can then be rectified. Conclusion The development of data quality processes in SAIL is ongoing, and changes or developments in each dataset lead to new requirements for data quality measurement and reporting. A vital component of the process is the production of output that is genuinely meaningful and useful.
first_indexed 2024-03-09T07:50:54Z
format Article
id doaj.art-9eb7d590d9c44ab594d753efb0590e34
institution Directory Open Access Journal
issn 2399-4908
language English
last_indexed 2024-03-09T07:50:54Z
publishDate 2017-04-01
publisher Swansea University
record_format Article
series International Journal of Population Data Science
spelling doaj.art-9eb7d590d9c44ab594d753efb0590e342023-12-03T01:44:23ZengSwansea UniversityInternational Journal of Population Data Science2399-49082017-04-011110.23889/ijpds.v1i1.9999Investigation and reporting of Data Quality within and between linked SAIL datasetsSarah Rees0Arfon Rees1Swansea UniversitySwansea UniversityABSTRACT Objectives The SAIL databank brings together a range of datasets gathered primarily for administrative rather than research processes. These datasets contain information regarding different aspects of an individual’s contact with services which when combined form a detailed health record for individuals living (or deceased) in Wales. Understanding the quality of data in SAIL supports the research process by providing a level of assurance about the robustness of data, identifying and describing where there may be sources of potential bias due to invalid, incomplete, inconsistent or inaccurate data and therefore helping to increase the accuracy of research using these data. Designing processes to investigate and report on data quality within and between multiple datasets can be a time-consuming task to undertake; it requires a high degree of effort to ensure it is genuinely meaningful and useful to SAIL users and may require a range of different approaches. Approach Data quality tests for each dataset were written, considering a range of data quality dimensions including validity, consistency, accuracy and completeness. Tests were designed to capture not just the quality of data within each dataset, but also to assess consistency of data items between datasets. SQL scripts were written to test each of these aspects: in order to minimise repetition, automated processes were implemented where appropriate. Batch automation was used to called SQL stored procedures, which utilise metadata to generate dynamic SQL. The metadata (created as part of the data quality process) describes each dataset and the measurement parameters used to assess each field within the dataset. However automation on its own is insufficient and data quality process outputs require scrutiny and oversight to ensure they are actually capturing what they set out to do. SAIL users were consulted on the development of the data quality reports to ensure usability and appropriateness to support data utilisation for research. Results The data quality reporting process is beneficial to the SAIL databank as it provides additional information to support the research process and in some cases may act as a diagnostic tool, detecting problems with data which can then be rectified. Conclusion The development of data quality processes in SAIL is ongoing, and changes or developments in each dataset lead to new requirements for data quality measurement and reporting. A vital component of the process is the production of output that is genuinely meaningful and useful.https://ijpds.org/article/view/99
spellingShingle Sarah Rees
Arfon Rees
Investigation and reporting of Data Quality within and between linked SAIL datasets
International Journal of Population Data Science
title Investigation and reporting of Data Quality within and between linked SAIL datasets
title_full Investigation and reporting of Data Quality within and between linked SAIL datasets
title_fullStr Investigation and reporting of Data Quality within and between linked SAIL datasets
title_full_unstemmed Investigation and reporting of Data Quality within and between linked SAIL datasets
title_short Investigation and reporting of Data Quality within and between linked SAIL datasets
title_sort investigation and reporting of data quality within and between linked sail datasets
url https://ijpds.org/article/view/99
work_keys_str_mv AT sarahrees investigationandreportingofdataqualitywithinandbetweenlinkedsaildatasets
AT arfonrees investigationandreportingofdataqualitywithinandbetweenlinkedsaildatasets