Data provenance tracking and reporting in a high-security digital research environment.

Objective To protect privacy, routinely-collected data are processed and anonymised by third parties before being used for research. However, the methods used to do this are rarely shared, leaving the resulting research difficult to evaluate and liable to undetected errors. Here, we present a proven...

Full description

Bibliographic Details
Main Authors:	Bernhard Scheliga, Milan Markovic, Helen Rowlands, Artur Wozniak, Katie Wilde, Jessica Butler
Format:	Article
Language:	English
Published:	Swansea University 2022-08-01
Series:	International Journal of Population Data Science
Subjects:	Improving data and linkage quality Software development Developing and improving data services Data provenance Knowledge graphs
Online Access:	https://ijpds.org/article/view/1909

_version_	1797422938481229824
author	Bernhard Scheliga Milan Markovic Helen Rowlands Artur Wozniak Katie Wilde Jessica Butler
author_facet	Bernhard Scheliga Milan Markovic Helen Rowlands Artur Wozniak Katie Wilde Jessica Butler
author_sort	Bernhard Scheliga
collection	DOAJ
description	Objective To protect privacy, routinely-collected data are processed and anonymised by third parties before being used for research. However, the methods used to do this are rarely shared, leaving the resulting research difficult to evaluate and liable to undetected errors. Here, we present a provenance-based approach for documenting and auditing such methods. Approach We designed the Safe Haven Provenance (SHP) ontology for representing provenance information about data, users, and activities within high-security environments as knowledge graphs. The work was based on a case study of the Grampian Data Safe Haven (DASH) which holds and processes medical records for 600,000 people in Scotland. The SHP ontology was designed as an extension to the standard W3C PROV-O ontology. The auditing capabilities of our approach were evaluated against a set of transparency requirements through a prototype interactive dashboard. Results We demonstrated the ability of the SHP ontology to document the workflow within DASH: capturing the extraction and anonymisation process using a structured vocabulary of entities (e.g. datasets), activities (e.g. linkage, anonymisation) and agents (e.g. analysts, data owners). Two provenance reporting templates were designed following interviews with DASH staff and clinical researchers: 1) a detailed report for use within DASH for quality assurance, and 2) a summary report for researchers that was safe for public release. Using a prototype data-linkage project, we formalised queries for report generation, and demonstrated use of automated rules for error detection (e.g., data discrepancies) using the structure of the SHP knowledge graphs. All of the project outputs are available under an open-source license. Conclusions This project lays a foundation for more transparent high-quality research using public data for health care and innovation. The SHP ontology is extendible for different domains and potentially represents a key component for further automation of provenance capture and reporting in high-security research environments.
first_indexed	2024-03-09T07:39:16Z
format	Article
id	doaj.art-615e2bab2a1c40e9b0a817e583b5ed6b
institution	Directory Open Access Journal
issn	2399-4908
language	English
last_indexed	2024-03-09T07:39:16Z
publishDate	2022-08-01
publisher	Swansea University
record_format	Article
series	International Journal of Population Data Science
spelling	doaj.art-615e2bab2a1c40e9b0a817e583b5ed6b2023-12-03T04:59:26ZengSwansea UniversityInternational Journal of Population Data Science2399-49082022-08-017310.23889/ijpds.v7i3.1909Data provenance tracking and reporting in a high-security digital research environment.Bernhard Scheliga0Milan Markovic1Helen Rowlands2Artur Wozniak3Katie Wilde4Jessica Butler5University of AberdeenUniversity of AberdeenUniversity of AberdeenUniversity of AberdeenUniversity of AberdeenUniversity of AberdeenObjective To protect privacy, routinely-collected data are processed and anonymised by third parties before being used for research. However, the methods used to do this are rarely shared, leaving the resulting research difficult to evaluate and liable to undetected errors. Here, we present a provenance-based approach for documenting and auditing such methods. Approach We designed the Safe Haven Provenance (SHP) ontology for representing provenance information about data, users, and activities within high-security environments as knowledge graphs. The work was based on a case study of the Grampian Data Safe Haven (DASH) which holds and processes medical records for 600,000 people in Scotland. The SHP ontology was designed as an extension to the standard W3C PROV-O ontology. The auditing capabilities of our approach were evaluated against a set of transparency requirements through a prototype interactive dashboard. Results We demonstrated the ability of the SHP ontology to document the workflow within DASH: capturing the extraction and anonymisation process using a structured vocabulary of entities (e.g. datasets), activities (e.g. linkage, anonymisation) and agents (e.g. analysts, data owners). Two provenance reporting templates were designed following interviews with DASH staff and clinical researchers: 1) a detailed report for use within DASH for quality assurance, and 2) a summary report for researchers that was safe for public release. Using a prototype data-linkage project, we formalised queries for report generation, and demonstrated use of automated rules for error detection (e.g., data discrepancies) using the structure of the SHP knowledge graphs. All of the project outputs are available under an open-source license. Conclusions This project lays a foundation for more transparent high-quality research using public data for health care and innovation. The SHP ontology is extendible for different domains and potentially represents a key component for further automation of provenance capture and reporting in high-security research environments. https://ijpds.org/article/view/1909Improving data and linkage qualitySoftware developmentDeveloping and improving data servicesData provenanceKnowledge graphs
spellingShingle	Bernhard Scheliga Milan Markovic Helen Rowlands Artur Wozniak Katie Wilde Jessica Butler Data provenance tracking and reporting in a high-security digital research environment. International Journal of Population Data Science Improving data and linkage quality Software development Developing and improving data services Data provenance Knowledge graphs
title	Data provenance tracking and reporting in a high-security digital research environment.
title_full	Data provenance tracking and reporting in a high-security digital research environment.
title_fullStr	Data provenance tracking and reporting in a high-security digital research environment.
title_full_unstemmed	Data provenance tracking and reporting in a high-security digital research environment.
title_short	Data provenance tracking and reporting in a high-security digital research environment.
title_sort	data provenance tracking and reporting in a high security digital research environment
topic	Improving data and linkage quality Software development Developing and improving data services Data provenance Knowledge graphs
url	https://ijpds.org/article/view/1909
work_keys_str_mv	AT bernhardscheliga dataprovenancetrackingandreportinginahighsecuritydigitalresearchenvironment AT milanmarkovic dataprovenancetrackingandreportinginahighsecuritydigitalresearchenvironment AT helenrowlands dataprovenancetrackingandreportinginahighsecuritydigitalresearchenvironment AT arturwozniak dataprovenancetrackingandreportinginahighsecuritydigitalresearchenvironment AT katiewilde dataprovenancetrackingandreportinginahighsecuritydigitalresearchenvironment AT jessicabutler dataprovenancetrackingandreportinginahighsecuritydigitalresearchenvironment

Data provenance tracking and reporting in a high-security digital research environment.

Similar Items