A process-driven platform to manage datasets for research

ABSTRACT Objectives • Accumulate, manage and control shared access to research data;  • Transform and maintain transformation state information about research data;  • Analyse and investigate data in related sets using open and bespoke tools;  • Publish extracted data to a secure safe h...

Full description

Bibliographic Details
Main Author: Gordon McAllister
Format: Article
Language:English
Published: Swansea University 2017-04-01
Series:International Journal of Population Data Science
Online Access:https://ijpds.org/article/view/292
_version_ 1797430194220302336
author Gordon McAllister
author_facet Gordon McAllister
author_sort Gordon McAllister
collection DOAJ
description ABSTRACT Objectives • Accumulate, manage and control shared access to research data;  • Transform and maintain transformation state information about research data;  • Analyse and investigate data in related sets using open and bespoke tools;  • Publish extracted data to a secure safe haven environment. Approach The Research Data Management Platform (RDMP) is a set of data structures and processes, sharing a core Catalogue, to manage electronic health records, genomic data and imaging data throughout their lifecycle from identification and acquisition to safe disposal or archival and retention in secured Safe Havens (SH). The architecture components of the RDMP consist of the Catalogue and five internal processes: Data Load, Catalogue Management, Data Quality, Data Summary, and Data Extraction. These are designed to enforce rigorous information governance standards relevant to the processing and anonymisation of personal identifiable data. The Catalogue serves as the single ‘source of truth’ about the datasets which all RDMP processes consult. This facilitates repeatable, reliable and auditable operations on the data. The novelty of the RDMP is that it dynamically and seamlessly captures and preserves data transformation processes along with the primary research data to promote reuse and curation of continuously accruing research data repositories in a secure SH environment. Thus, the RDMP brings transparency and reproducibility that benefits research programmes in a way that archival of static data objects does not. Results The RDMP has been in production use since July 1st 2014. There are 107 datasets configured in the Catalogue, with up to 67 dataset extractions for each of 48 research projects. It has provided data for 32 high-impact journal papers published in the last year. Improvements in turnaround time: • Research project data provision reduced from six months to two weeks; • Data loading reduced from two days to a few hours;  • Research query response reduced from days to within a day, due to improved and standardised metadata catalogue Conclusion The RDMP is a key component in automating the regular release of datasets and rationalising dataset changes over time to ensure reliable delivery of extracts to research projects. The tools and processes comprising the RDMP not only fulfil the RDM requirements of researchers, but also support seamless collaboration of data cleaning, data transformation, data summarisation and data quality assessment activities by different research groups.
first_indexed 2024-03-09T09:24:07Z
format Article
id doaj.art-8062d99b793645b9af50956eaed3e79b
institution Directory Open Access Journal
issn 2399-4908
language English
last_indexed 2024-03-09T09:24:07Z
publishDate 2017-04-01
publisher Swansea University
record_format Article
series International Journal of Population Data Science
spelling doaj.art-8062d99b793645b9af50956eaed3e79b2023-12-02T06:26:29ZengSwansea UniversityInternational Journal of Population Data Science2399-49082017-04-011110.23889/ijpds.v1i1.292292A process-driven platform to manage datasets for researchGordon McAllister0University of DundeeABSTRACT Objectives • Accumulate, manage and control shared access to research data;  • Transform and maintain transformation state information about research data;  • Analyse and investigate data in related sets using open and bespoke tools;  • Publish extracted data to a secure safe haven environment. Approach The Research Data Management Platform (RDMP) is a set of data structures and processes, sharing a core Catalogue, to manage electronic health records, genomic data and imaging data throughout their lifecycle from identification and acquisition to safe disposal or archival and retention in secured Safe Havens (SH). The architecture components of the RDMP consist of the Catalogue and five internal processes: Data Load, Catalogue Management, Data Quality, Data Summary, and Data Extraction. These are designed to enforce rigorous information governance standards relevant to the processing and anonymisation of personal identifiable data. The Catalogue serves as the single ‘source of truth’ about the datasets which all RDMP processes consult. This facilitates repeatable, reliable and auditable operations on the data. The novelty of the RDMP is that it dynamically and seamlessly captures and preserves data transformation processes along with the primary research data to promote reuse and curation of continuously accruing research data repositories in a secure SH environment. Thus, the RDMP brings transparency and reproducibility that benefits research programmes in a way that archival of static data objects does not. Results The RDMP has been in production use since July 1st 2014. There are 107 datasets configured in the Catalogue, with up to 67 dataset extractions for each of 48 research projects. It has provided data for 32 high-impact journal papers published in the last year. Improvements in turnaround time: • Research project data provision reduced from six months to two weeks; • Data loading reduced from two days to a few hours;  • Research query response reduced from days to within a day, due to improved and standardised metadata catalogue Conclusion The RDMP is a key component in automating the regular release of datasets and rationalising dataset changes over time to ensure reliable delivery of extracts to research projects. The tools and processes comprising the RDMP not only fulfil the RDM requirements of researchers, but also support seamless collaboration of data cleaning, data transformation, data summarisation and data quality assessment activities by different research groups.https://ijpds.org/article/view/292
spellingShingle Gordon McAllister
A process-driven platform to manage datasets for research
International Journal of Population Data Science
title A process-driven platform to manage datasets for research
title_full A process-driven platform to manage datasets for research
title_fullStr A process-driven platform to manage datasets for research
title_full_unstemmed A process-driven platform to manage datasets for research
title_short A process-driven platform to manage datasets for research
title_sort process driven platform to manage datasets for research
url https://ijpds.org/article/view/292
work_keys_str_mv AT gordonmcallister aprocessdrivenplatformtomanagedatasetsforresearch
AT gordonmcallister processdrivenplatformtomanagedatasetsforresearch