Investigation and reporting of Data Quality within and between linked SAIL datasets
ABSTRACT Objectives The SAIL databank brings together a range of datasets gathered primarily for administrative rather than research processes. These datasets contain information regarding different aspects of an individual’s contact with services which when combined form a detailed health record f...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Swansea University
2017-04-01
|
Series: | International Journal of Population Data Science |
Online Access: | https://ijpds.org/article/view/99 |
_version_ | 1827610490748010496 |
---|---|
author | Sarah Rees Arfon Rees |
author_facet | Sarah Rees Arfon Rees |
author_sort | Sarah Rees |
collection | DOAJ |
description | ABSTRACT
Objectives
The SAIL databank brings together a range of datasets gathered primarily for administrative rather than research processes. These datasets contain information regarding different aspects of an individual’s contact with services which when combined form a detailed health record for individuals living (or deceased) in Wales.
Understanding the quality of data in SAIL supports the research process by providing a level of assurance about the robustness of data, identifying and describing where there may be sources of potential bias due to invalid, incomplete, inconsistent or inaccurate data and therefore helping to increase the accuracy of research using these data.
Designing processes to investigate and report on data quality within and between multiple datasets can be a time-consuming task to undertake; it requires a high degree of effort to ensure it is genuinely meaningful and useful to SAIL users and may require a range of different approaches.
Approach
Data quality tests for each dataset were written, considering a range of data quality dimensions including validity, consistency, accuracy and completeness. Tests were designed to capture not just the quality of data within each dataset, but also to assess consistency of data items between datasets. SQL scripts were written to test each of these aspects: in order to minimise repetition, automated processes were implemented where appropriate.
Batch automation was used to called SQL stored procedures, which utilise metadata to generate dynamic SQL. The metadata (created as part of the data quality process) describes each dataset and the measurement parameters used to assess each field within the dataset. However automation on its own is insufficient and data quality process outputs require scrutiny and oversight to ensure they are actually capturing what they set out to do.
SAIL users were consulted on the development of the data quality reports to ensure usability and appropriateness to support data utilisation for research.
Results
The data quality reporting process is beneficial to the SAIL databank as it provides additional information to support the research process and in some cases may act as a diagnostic tool, detecting problems with data which can then be rectified.
Conclusion
The development of data quality processes in SAIL is ongoing, and changes or developments in each dataset lead to new requirements for data quality measurement and reporting. A vital component of the process is the production of output that is genuinely meaningful and useful. |
first_indexed | 2024-03-09T07:50:54Z |
format | Article |
id | doaj.art-9eb7d590d9c44ab594d753efb0590e34 |
institution | Directory Open Access Journal |
issn | 2399-4908 |
language | English |
last_indexed | 2024-03-09T07:50:54Z |
publishDate | 2017-04-01 |
publisher | Swansea University |
record_format | Article |
series | International Journal of Population Data Science |
spelling | doaj.art-9eb7d590d9c44ab594d753efb0590e342023-12-03T01:44:23ZengSwansea UniversityInternational Journal of Population Data Science2399-49082017-04-011110.23889/ijpds.v1i1.9999Investigation and reporting of Data Quality within and between linked SAIL datasetsSarah Rees0Arfon Rees1Swansea UniversitySwansea UniversityABSTRACT Objectives The SAIL databank brings together a range of datasets gathered primarily for administrative rather than research processes. These datasets contain information regarding different aspects of an individual’s contact with services which when combined form a detailed health record for individuals living (or deceased) in Wales. Understanding the quality of data in SAIL supports the research process by providing a level of assurance about the robustness of data, identifying and describing where there may be sources of potential bias due to invalid, incomplete, inconsistent or inaccurate data and therefore helping to increase the accuracy of research using these data. Designing processes to investigate and report on data quality within and between multiple datasets can be a time-consuming task to undertake; it requires a high degree of effort to ensure it is genuinely meaningful and useful to SAIL users and may require a range of different approaches. Approach Data quality tests for each dataset were written, considering a range of data quality dimensions including validity, consistency, accuracy and completeness. Tests were designed to capture not just the quality of data within each dataset, but also to assess consistency of data items between datasets. SQL scripts were written to test each of these aspects: in order to minimise repetition, automated processes were implemented where appropriate. Batch automation was used to called SQL stored procedures, which utilise metadata to generate dynamic SQL. The metadata (created as part of the data quality process) describes each dataset and the measurement parameters used to assess each field within the dataset. However automation on its own is insufficient and data quality process outputs require scrutiny and oversight to ensure they are actually capturing what they set out to do. SAIL users were consulted on the development of the data quality reports to ensure usability and appropriateness to support data utilisation for research. Results The data quality reporting process is beneficial to the SAIL databank as it provides additional information to support the research process and in some cases may act as a diagnostic tool, detecting problems with data which can then be rectified. Conclusion The development of data quality processes in SAIL is ongoing, and changes or developments in each dataset lead to new requirements for data quality measurement and reporting. A vital component of the process is the production of output that is genuinely meaningful and useful.https://ijpds.org/article/view/99 |
spellingShingle | Sarah Rees Arfon Rees Investigation and reporting of Data Quality within and between linked SAIL datasets International Journal of Population Data Science |
title | Investigation and reporting of Data Quality within and between linked SAIL datasets |
title_full | Investigation and reporting of Data Quality within and between linked SAIL datasets |
title_fullStr | Investigation and reporting of Data Quality within and between linked SAIL datasets |
title_full_unstemmed | Investigation and reporting of Data Quality within and between linked SAIL datasets |
title_short | Investigation and reporting of Data Quality within and between linked SAIL datasets |
title_sort | investigation and reporting of data quality within and between linked sail datasets |
url | https://ijpds.org/article/view/99 |
work_keys_str_mv | AT sarahrees investigationandreportingofdataqualitywithinandbetweenlinkedsaildatasets AT arfonrees investigationandreportingofdataqualitywithinandbetweenlinkedsaildatasets |