Privacy protected text analysis in DataSHIELD

ABSTRACT Objectives DataSHIELD (www.datashield.ac.uk) was born of the requirement in the biomedical and social sciences to co-analyse individual patient data (microdata) from different sources, without disclosing identity or sensitive information. Under DataSHIELD, raw data never leaves the data...

Full description

Bibliographic Details
Main Authors: Rebecca Wilson, Oliver Butters, Demetris Avraam, Andrew Turner, Paul Burton
Format: Article
Language:English
Published: Swansea University 2017-04-01
Series:International Journal of Population Data Science
Online Access:https://ijpds.org/article/view/289
_version_ 1797430470874497024
author Rebecca Wilson
Oliver Butters
Demetris Avraam
Andrew Turner
Paul Burton
author_facet Rebecca Wilson
Oliver Butters
Demetris Avraam
Andrew Turner
Paul Burton
author_sort Rebecca Wilson
collection DOAJ
description ABSTRACT Objectives DataSHIELD (www.datashield.ac.uk) was born of the requirement in the biomedical and social sciences to co-analyse individual patient data (microdata) from different sources, without disclosing identity or sensitive information. Under DataSHIELD, raw data never leaves the data provider and no microdata or disclosive information can be seen by the researcher. The analysis is taken to the data - not the data to the analysis. Text data can be very disclosive in the biomedical domain (patient records, GP letters etc). Similar, but different, issues are present in other domains - text could be copyrighted, or have a large IP value, making sharing impractical. Approach By treating text in an analogous way to individual patient data we assessed if DataSHIELD could be adapted and implemented for text analysis, and circumvent the key obstacles that currently prevent it. Results Using open digitised text data held by the British Library, a DataSHIELD proof-of-concept infrastructure and prototype DataSHIELD functions for free text analysis were developed. Conclusions Whilst it is possible to analyse free text within a DataSHIELD infrastructure, the challenge is creating generalised and resilient anti-disclosure methods for free text analysis. There are a range of biomedical and health sciences applications for DataSHIELD methods of privacy protected analysis of free text including analysis of electronic health records and analysis of qualitative data e.g. from social media.
first_indexed 2024-03-09T09:28:04Z
format Article
id doaj.art-7670295ffe8d4c058d7d066e42b3dd05
institution Directory Open Access Journal
issn 2399-4908
language English
last_indexed 2024-03-09T09:28:04Z
publishDate 2017-04-01
publisher Swansea University
record_format Article
series International Journal of Population Data Science
spelling doaj.art-7670295ffe8d4c058d7d066e42b3dd052023-12-02T05:24:48ZengSwansea UniversityInternational Journal of Population Data Science2399-49082017-04-011110.23889/ijpds.v1i1.289289Privacy protected text analysis in DataSHIELDRebecca Wilson0Oliver Butters1Demetris Avraam2Andrew Turner3Paul Burton4University of BristolUniversity of BristolUniversity of BristolUniversity of BristolUniversity of BristolABSTRACT Objectives DataSHIELD (www.datashield.ac.uk) was born of the requirement in the biomedical and social sciences to co-analyse individual patient data (microdata) from different sources, without disclosing identity or sensitive information. Under DataSHIELD, raw data never leaves the data provider and no microdata or disclosive information can be seen by the researcher. The analysis is taken to the data - not the data to the analysis. Text data can be very disclosive in the biomedical domain (patient records, GP letters etc). Similar, but different, issues are present in other domains - text could be copyrighted, or have a large IP value, making sharing impractical. Approach By treating text in an analogous way to individual patient data we assessed if DataSHIELD could be adapted and implemented for text analysis, and circumvent the key obstacles that currently prevent it. Results Using open digitised text data held by the British Library, a DataSHIELD proof-of-concept infrastructure and prototype DataSHIELD functions for free text analysis were developed. Conclusions Whilst it is possible to analyse free text within a DataSHIELD infrastructure, the challenge is creating generalised and resilient anti-disclosure methods for free text analysis. There are a range of biomedical and health sciences applications for DataSHIELD methods of privacy protected analysis of free text including analysis of electronic health records and analysis of qualitative data e.g. from social media.https://ijpds.org/article/view/289
spellingShingle Rebecca Wilson
Oliver Butters
Demetris Avraam
Andrew Turner
Paul Burton
Privacy protected text analysis in DataSHIELD
International Journal of Population Data Science
title Privacy protected text analysis in DataSHIELD
title_full Privacy protected text analysis in DataSHIELD
title_fullStr Privacy protected text analysis in DataSHIELD
title_full_unstemmed Privacy protected text analysis in DataSHIELD
title_short Privacy protected text analysis in DataSHIELD
title_sort privacy protected text analysis in datashield
url https://ijpds.org/article/view/289
work_keys_str_mv AT rebeccawilson privacyprotectedtextanalysisindatashield
AT oliverbutters privacyprotectedtextanalysisindatashield
AT demetrisavraam privacyprotectedtextanalysisindatashield
AT andrewturner privacyprotectedtextanalysisindatashield
AT paulburton privacyprotectedtextanalysisindatashield