Investigation and reporting of Data Quality within and between linked SAIL datasets

ABSTRACT Objectives The SAIL databank brings together a range of datasets gathered primarily for administrative rather than research processes. These datasets contain information regarding different aspects of an individual’s contact with services which when combined form a detailed health record f...

Full description

Bibliographic Details
Main Authors:	Sarah Rees, Arfon Rees
Format:	Article
Language:	English
Published:	Swansea University 2017-04-01
Series:	International Journal of Population Data Science
Online Access:	https://ijpds.org/article/view/99

_version_	1827610490748010496
author	Sarah Rees Arfon Rees
author_facet	Sarah Rees Arfon Rees
author_sort	Sarah Rees
collection	DOAJ
description	ABSTRACT Objectives The SAIL databank brings together a range of datasets gathered primarily for administrative rather than research processes. These datasets contain information regarding different aspects of an individual’s contact with services which when combined form a detailed health record for individuals living (or deceased) in Wales. Understanding the quality of data in SAIL supports the research process by providing a level of assurance about the robustness of data, identifying and describing where there may be sources of potential bias due to invalid, incomplete, inconsistent or inaccurate data and therefore helping to increase the accuracy of research using these data. Designing processes to investigate and report on data quality within and between multiple datasets can be a time-consuming task to undertake; it requires a high degree of effort to ensure it is genuinely meaningful and useful to SAIL users and may require a range of different approaches. Approach Data quality tests for each dataset were written, considering a range of data quality dimensions including validity, consistency, accuracy and completeness. Tests were designed to capture not just the quality of data within each dataset, but also to assess consistency of data items between datasets. SQL scripts were written to test each of these aspects: in order to minimise repetition, automated processes were implemented where appropriate. Batch automation was used to called SQL stored procedures, which utilise metadata to generate dynamic SQL. The metadata (created as part of the data quality process) describes each dataset and the measurement parameters used to assess each field within the dataset. However automation on its own is insufficient and data quality process outputs require scrutiny and oversight to ensure they are actually capturing what they set out to do. SAIL users were consulted on the development of the data quality reports to ensure usability and appropriateness to support data utilisation for research. Results The data quality reporting process is beneficial to the SAIL databank as it provides additional information to support the research process and in some cases may act as a diagnostic tool, detecting problems with data which can then be rectified. Conclusion The development of data quality processes in SAIL is ongoing, and changes or developments in each dataset lead to new requirements for data quality measurement and reporting. A vital component of the process is the production of output that is genuinely meaningful and useful.
first_indexed	2024-03-09T07:50:54Z
format	Article
id	doaj.art-9eb7d590d9c44ab594d753efb0590e34
institution	Directory Open Access Journal
issn	2399-4908
language	English
last_indexed	2024-03-09T07:50:54Z
publishDate	2017-04-01
publisher	Swansea University
record_format	Article
series	International Journal of Population Data Science
spelling	doaj.art-9eb7d590d9c44ab594d753efb0590e342023-12-03T01:44:23ZengSwansea UniversityInternational Journal of Population Data Science2399-49082017-04-011110.23889/ijpds.v1i1.9999Investigation and reporting of Data Quality within and between linked SAIL datasetsSarah Rees0Arfon Rees1Swansea UniversitySwansea UniversityABSTRACT Objectives The SAIL databank brings together a range of datasets gathered primarily for administrative rather than research processes. These datasets contain information regarding different aspects of an individual’s contact with services which when combined form a detailed health record for individuals living (or deceased) in Wales. Understanding the quality of data in SAIL supports the research process by providing a level of assurance about the robustness of data, identifying and describing where there may be sources of potential bias due to invalid, incomplete, inconsistent or inaccurate data and therefore helping to increase the accuracy of research using these data. Designing processes to investigate and report on data quality within and between multiple datasets can be a time-consuming task to undertake; it requires a high degree of effort to ensure it is genuinely meaningful and useful to SAIL users and may require a range of different approaches. Approach Data quality tests for each dataset were written, considering a range of data quality dimensions including validity, consistency, accuracy and completeness. Tests were designed to capture not just the quality of data within each dataset, but also to assess consistency of data items between datasets. SQL scripts were written to test each of these aspects: in order to minimise repetition, automated processes were implemented where appropriate. Batch automation was used to called SQL stored procedures, which utilise metadata to generate dynamic SQL. The metadata (created as part of the data quality process) describes each dataset and the measurement parameters used to assess each field within the dataset. However automation on its own is insufficient and data quality process outputs require scrutiny and oversight to ensure they are actually capturing what they set out to do. SAIL users were consulted on the development of the data quality reports to ensure usability and appropriateness to support data utilisation for research. Results The data quality reporting process is beneficial to the SAIL databank as it provides additional information to support the research process and in some cases may act as a diagnostic tool, detecting problems with data which can then be rectified. Conclusion The development of data quality processes in SAIL is ongoing, and changes or developments in each dataset lead to new requirements for data quality measurement and reporting. A vital component of the process is the production of output that is genuinely meaningful and useful.https://ijpds.org/article/view/99
spellingShingle	Sarah Rees Arfon Rees Investigation and reporting of Data Quality within and between linked SAIL datasets International Journal of Population Data Science
title	Investigation and reporting of Data Quality within and between linked SAIL datasets
title_full	Investigation and reporting of Data Quality within and between linked SAIL datasets
title_fullStr	Investigation and reporting of Data Quality within and between linked SAIL datasets
title_full_unstemmed	Investigation and reporting of Data Quality within and between linked SAIL datasets
title_short	Investigation and reporting of Data Quality within and between linked SAIL datasets
title_sort	investigation and reporting of data quality within and between linked sail datasets
url	https://ijpds.org/article/view/99
work_keys_str_mv	AT sarahrees investigationandreportingofdataqualitywithinandbetweenlinkedsaildatasets AT arfonrees investigationandreportingofdataqualitywithinandbetweenlinkedsaildatasets

Investigation and reporting of Data Quality within and between linked SAIL datasets

Similar Items