Orchestrating privacy-protected big data analyses of data from different resources with R and DataSHIELD.

Combined analysis of multiple, large datasets is a common objective in the health- and biosciences. Existing methods tend to require researchers to physically bring data together in one place or follow an analysis plan and share results. Developed over the last 10 years, the DataSHIELD platform is a...

Full description

Bibliographic Details
Main Authors: Yannick Marcon, Tom Bishop, Demetris Avraam, Xavier Escriba-Montagut, Patricia Ryser-Welch, Stuart Wheater, Paul Burton, Juan R González
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2021-03-01
Series:PLoS Computational Biology
Online Access:https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1008880&type=printable
_version_ 1826554798305968128
author Yannick Marcon
Tom Bishop
Demetris Avraam
Xavier Escriba-Montagut
Patricia Ryser-Welch
Stuart Wheater
Paul Burton
Juan R González
author_facet Yannick Marcon
Tom Bishop
Demetris Avraam
Xavier Escriba-Montagut
Patricia Ryser-Welch
Stuart Wheater
Paul Burton
Juan R González
author_sort Yannick Marcon
collection DOAJ
description Combined analysis of multiple, large datasets is a common objective in the health- and biosciences. Existing methods tend to require researchers to physically bring data together in one place or follow an analysis plan and share results. Developed over the last 10 years, the DataSHIELD platform is a collection of R packages that reduce the challenges of these methods. These include ethico-legal constraints which limit researchers' ability to physically bring data together and the analytical inflexibility associated with conventional approaches to sharing results. The key feature of DataSHIELD is that data from research studies stay on a server at each of the institutions that are responsible for the data. Each institution has control over who can access their data. The platform allows an analyst to pass commands to each server and the analyst receives results that do not disclose the individual-level data of any study participants. DataSHIELD uses Opal which is a data integration system used by epidemiological studies and developed by the OBiBa open source project in the domain of bioinformatics. However, until now the analysis of big data with DataSHIELD has been limited by the storage formats available in Opal and the analysis capabilities available in the DataSHIELD R packages. We present a new architecture ("resources") for DataSHIELD and Opal to allow large, complex datasets to be used at their original location, in their original format and with external computing facilities. We provide some real big data analysis examples in genomics and geospatial projects. For genomic data analyses, we also illustrate how to extend the resources concept to address specific big data infrastructures such as GA4GH or EGA, and make use of shell commands. Our new infrastructure will help researchers to perform data analyses in a privacy-protected way from existing data sharing initiatives or projects. To help researchers use this framework, we describe selected packages and present an online book (https://isglobal-brge.github.io/resource_bookdown).
first_indexed 2024-12-16T07:46:09Z
format Article
id doaj.art-29c1a98962674dd5be21f41b59c0eb6c
institution Directory Open Access Journal
issn 1553-734X
1553-7358
language English
last_indexed 2025-03-14T07:46:38Z
publishDate 2021-03-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS Computational Biology
spelling doaj.art-29c1a98962674dd5be21f41b59c0eb6c2025-03-03T05:31:32ZengPublic Library of Science (PLoS)PLoS Computational Biology1553-734X1553-73582021-03-01173e100888010.1371/journal.pcbi.1008880Orchestrating privacy-protected big data analyses of data from different resources with R and DataSHIELD.Yannick MarconTom BishopDemetris AvraamXavier Escriba-MontagutPatricia Ryser-WelchStuart WheaterPaul BurtonJuan R GonzálezCombined analysis of multiple, large datasets is a common objective in the health- and biosciences. Existing methods tend to require researchers to physically bring data together in one place or follow an analysis plan and share results. Developed over the last 10 years, the DataSHIELD platform is a collection of R packages that reduce the challenges of these methods. These include ethico-legal constraints which limit researchers' ability to physically bring data together and the analytical inflexibility associated with conventional approaches to sharing results. The key feature of DataSHIELD is that data from research studies stay on a server at each of the institutions that are responsible for the data. Each institution has control over who can access their data. The platform allows an analyst to pass commands to each server and the analyst receives results that do not disclose the individual-level data of any study participants. DataSHIELD uses Opal which is a data integration system used by epidemiological studies and developed by the OBiBa open source project in the domain of bioinformatics. However, until now the analysis of big data with DataSHIELD has been limited by the storage formats available in Opal and the analysis capabilities available in the DataSHIELD R packages. We present a new architecture ("resources") for DataSHIELD and Opal to allow large, complex datasets to be used at their original location, in their original format and with external computing facilities. We provide some real big data analysis examples in genomics and geospatial projects. For genomic data analyses, we also illustrate how to extend the resources concept to address specific big data infrastructures such as GA4GH or EGA, and make use of shell commands. Our new infrastructure will help researchers to perform data analyses in a privacy-protected way from existing data sharing initiatives or projects. To help researchers use this framework, we describe selected packages and present an online book (https://isglobal-brge.github.io/resource_bookdown).https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1008880&type=printable
spellingShingle Yannick Marcon
Tom Bishop
Demetris Avraam
Xavier Escriba-Montagut
Patricia Ryser-Welch
Stuart Wheater
Paul Burton
Juan R González
Orchestrating privacy-protected big data analyses of data from different resources with R and DataSHIELD.
PLoS Computational Biology
title Orchestrating privacy-protected big data analyses of data from different resources with R and DataSHIELD.
title_full Orchestrating privacy-protected big data analyses of data from different resources with R and DataSHIELD.
title_fullStr Orchestrating privacy-protected big data analyses of data from different resources with R and DataSHIELD.
title_full_unstemmed Orchestrating privacy-protected big data analyses of data from different resources with R and DataSHIELD.
title_short Orchestrating privacy-protected big data analyses of data from different resources with R and DataSHIELD.
title_sort orchestrating privacy protected big data analyses of data from different resources with r and datashield
url https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1008880&type=printable
work_keys_str_mv AT yannickmarcon orchestratingprivacyprotectedbigdataanalysesofdatafromdifferentresourceswithranddatashield
AT tombishop orchestratingprivacyprotectedbigdataanalysesofdatafromdifferentresourceswithranddatashield
AT demetrisavraam orchestratingprivacyprotectedbigdataanalysesofdatafromdifferentresourceswithranddatashield
AT xavierescribamontagut orchestratingprivacyprotectedbigdataanalysesofdatafromdifferentresourceswithranddatashield
AT patriciaryserwelch orchestratingprivacyprotectedbigdataanalysesofdatafromdifferentresourceswithranddatashield
AT stuartwheater orchestratingprivacyprotectedbigdataanalysesofdatafromdifferentresourceswithranddatashield
AT paulburton orchestratingprivacyprotectedbigdataanalysesofdatafromdifferentresourceswithranddatashield
AT juanrgonzalez orchestratingprivacyprotectedbigdataanalysesofdatafromdifferentresourceswithranddatashield