Data provenance tracking and reporting in a high-security digital research environment.

Objective To protect privacy, routinely-collected data are processed and anonymised by third parties before being used for research. However, the methods used to do this are rarely shared, leaving the resulting research difficult to evaluate and liable to undetected errors. Here, we present a proven...

Full description

Bibliographic Details
Main Authors: Bernhard Scheliga, Milan Markovic, Helen Rowlands, Artur Wozniak, Katie Wilde, Jessica Butler
Format: Article
Language:English
Published: Swansea University 2022-08-01
Series:International Journal of Population Data Science
Subjects:
Online Access:https://ijpds.org/article/view/1909
_version_ 1797422938481229824
author Bernhard Scheliga
Milan Markovic
Helen Rowlands
Artur Wozniak
Katie Wilde
Jessica Butler
author_facet Bernhard Scheliga
Milan Markovic
Helen Rowlands
Artur Wozniak
Katie Wilde
Jessica Butler
author_sort Bernhard Scheliga
collection DOAJ
description Objective To protect privacy, routinely-collected data are processed and anonymised by third parties before being used for research. However, the methods used to do this are rarely shared, leaving the resulting research difficult to evaluate and liable to undetected errors. Here, we present a provenance-based approach for documenting and auditing such methods. Approach We designed the Safe Haven Provenance (SHP) ontology for representing provenance information about data, users, and activities within high-security environments as knowledge graphs. The work was based on a case study of the Grampian Data Safe Haven (DASH) which holds and processes medical records for 600,000 people in Scotland. The SHP ontology was designed as an extension to the standard W3C PROV-O ontology. The auditing capabilities of our approach were evaluated against a set of transparency requirements through a prototype interactive dashboard. Results We demonstrated the ability of the SHP ontology to document the workflow within DASH: capturing the extraction and anonymisation process using a structured vocabulary of entities (e.g. datasets), activities (e.g. linkage, anonymisation) and agents (e.g. analysts, data owners). Two provenance reporting templates were designed following interviews with DASH staff and clinical researchers: 1) a detailed report for use within DASH for quality assurance, and 2) a summary report for researchers that was safe for public release. Using a prototype data-linkage project, we formalised queries for report generation, and demonstrated use of automated rules for error detection (e.g., data discrepancies) using the structure of the SHP knowledge graphs. All of the project outputs are available under an open-source license. Conclusions This project lays a foundation for more transparent high-quality research using public data for health care and innovation. The SHP ontology is extendible for different domains and potentially represents a key component for further automation of provenance capture and reporting in high-security research environments.
first_indexed 2024-03-09T07:39:16Z
format Article
id doaj.art-615e2bab2a1c40e9b0a817e583b5ed6b
institution Directory Open Access Journal
issn 2399-4908
language English
last_indexed 2024-03-09T07:39:16Z
publishDate 2022-08-01
publisher Swansea University
record_format Article
series International Journal of Population Data Science
spelling doaj.art-615e2bab2a1c40e9b0a817e583b5ed6b2023-12-03T04:59:26ZengSwansea UniversityInternational Journal of Population Data Science2399-49082022-08-017310.23889/ijpds.v7i3.1909Data provenance tracking and reporting in a high-security digital research environment.Bernhard Scheliga0Milan Markovic1Helen Rowlands2Artur Wozniak3Katie Wilde4Jessica Butler5University of AberdeenUniversity of AberdeenUniversity of AberdeenUniversity of AberdeenUniversity of AberdeenUniversity of AberdeenObjective To protect privacy, routinely-collected data are processed and anonymised by third parties before being used for research. However, the methods used to do this are rarely shared, leaving the resulting research difficult to evaluate and liable to undetected errors. Here, we present a provenance-based approach for documenting and auditing such methods. Approach We designed the Safe Haven Provenance (SHP) ontology for representing provenance information about data, users, and activities within high-security environments as knowledge graphs. The work was based on a case study of the Grampian Data Safe Haven (DASH) which holds and processes medical records for 600,000 people in Scotland. The SHP ontology was designed as an extension to the standard W3C PROV-O ontology. The auditing capabilities of our approach were evaluated against a set of transparency requirements through a prototype interactive dashboard. Results We demonstrated the ability of the SHP ontology to document the workflow within DASH: capturing the extraction and anonymisation process using a structured vocabulary of entities (e.g. datasets), activities (e.g. linkage, anonymisation) and agents (e.g. analysts, data owners). Two provenance reporting templates were designed following interviews with DASH staff and clinical researchers: 1) a detailed report for use within DASH for quality assurance, and 2) a summary report for researchers that was safe for public release. Using a prototype data-linkage project, we formalised queries for report generation, and demonstrated use of automated rules for error detection (e.g., data discrepancies) using the structure of the SHP knowledge graphs. All of the project outputs are available under an open-source license. Conclusions This project lays a foundation for more transparent high-quality research using public data for health care and innovation. The SHP ontology is extendible for different domains and potentially represents a key component for further automation of provenance capture and reporting in high-security research environments. https://ijpds.org/article/view/1909Improving data and linkage qualitySoftware developmentDeveloping and improving data servicesData provenanceKnowledge graphs
spellingShingle Bernhard Scheliga
Milan Markovic
Helen Rowlands
Artur Wozniak
Katie Wilde
Jessica Butler
Data provenance tracking and reporting in a high-security digital research environment.
International Journal of Population Data Science
Improving data and linkage quality
Software development
Developing and improving data services
Data provenance
Knowledge graphs
title Data provenance tracking and reporting in a high-security digital research environment.
title_full Data provenance tracking and reporting in a high-security digital research environment.
title_fullStr Data provenance tracking and reporting in a high-security digital research environment.
title_full_unstemmed Data provenance tracking and reporting in a high-security digital research environment.
title_short Data provenance tracking and reporting in a high-security digital research environment.
title_sort data provenance tracking and reporting in a high security digital research environment
topic Improving data and linkage quality
Software development
Developing and improving data services
Data provenance
Knowledge graphs
url https://ijpds.org/article/view/1909
work_keys_str_mv AT bernhardscheliga dataprovenancetrackingandreportinginahighsecuritydigitalresearchenvironment
AT milanmarkovic dataprovenancetrackingandreportinginahighsecuritydigitalresearchenvironment
AT helenrowlands dataprovenancetrackingandreportinginahighsecuritydigitalresearchenvironment
AT arturwozniak dataprovenancetrackingandreportinginahighsecuritydigitalresearchenvironment
AT katiewilde dataprovenancetrackingandreportinginahighsecuritydigitalresearchenvironment
AT jessicabutler dataprovenancetrackingandreportinginahighsecuritydigitalresearchenvironment