Using blockchain to log genome dataset access: efficient storage and query

Abstract Background Genomic variants are considered sensitive information, revealing potentially private facts about individuals. Therefore, it is important to control access to such data. A key aspect of controlled access is secure storage and efficient query of access logs, for potential misuse. H...

Full description

Bibliographic Details
Main Authors: Gamze Gürsoy, Robert Bjornson, Molly E. Green, Mark Gerstein
Format: Article
Language:English
Published: BMC 2020-07-01
Series:BMC Medical Genomics
Subjects:
Online Access:http://link.springer.com/article/10.1186/s12920-020-0716-z
_version_ 1819122601173188608
author Gamze Gürsoy
Robert Bjornson
Molly E. Green
Mark Gerstein
author_facet Gamze Gürsoy
Robert Bjornson
Molly E. Green
Mark Gerstein
author_sort Gamze Gürsoy
collection DOAJ
description Abstract Background Genomic variants are considered sensitive information, revealing potentially private facts about individuals. Therefore, it is important to control access to such data. A key aspect of controlled access is secure storage and efficient query of access logs, for potential misuse. However, there are challenges to securing logs, such as designing against the consequences of “single points of failure”. A potential approach to circumvent these challenges is blockchain technology, which is currently popular in cryptocurrency due to its properties of security, immutability, and decentralization. One of the tasks of the iDASH (Integrating Data for Analysis, Anonymization, and Sharing) Secure Genome Analysis Competition in 2018 was to develop time- and space-efficient blockchain-based ledgering solutions to log and query user activity accessing genomic datasets across multiple sites, using MultiChain. Methods MultiChain is a specific blockchain platform that offers “data streams” embedded in the chain for rapid and secure data storage. We devised a storage protocol taking advantage of the keys in the MultiChain data streams and created a data frame from the chain allowing efficient query. Our solution to the iDASH competition was selected as the winner at a workshop held in San Diego, CA in October 2018. Although our solution worked well in the challenge, it has the drawback that it requires downloading all the data from the chain and keeping it locally in memory for fast query. To address this, we provide an alternate “bigmem” solution that uses indices rather than local storage for rapid queries. Results We profiled the performance of both of our solutions using logs with 100,000 to 600,000 entries, both for querying the chain and inserting data into it. The challenge solution requires 12 seconds time and 120 Mb of memory for querying from 100,000 entries. The memory requirement increases linearly and reaches 470 MB for a chain with 600,000 entries. Although our alternate bigmem solution is slower and requires more memory (408 seconds and 250 MB, respectively, for 100,000 entries), the memory requirement increases at a slower rate and reaches only 360 MB for 600,000 entries. Conclusion Overall, we demonstrate that genomic access log files can be stored and queried efficiently with blockchain. Beyond this, our protocol potentially could be applied to other types of health data such as electronic health records.
first_indexed 2024-12-22T06:55:03Z
format Article
id doaj.art-2af95aa6869241849e5d14083569447f
institution Directory Open Access Journal
issn 1755-8794
language English
last_indexed 2024-12-22T06:55:03Z
publishDate 2020-07-01
publisher BMC
record_format Article
series BMC Medical Genomics
spelling doaj.art-2af95aa6869241849e5d14083569447f2022-12-21T18:35:00ZengBMCBMC Medical Genomics1755-87942020-07-0113S71910.1186/s12920-020-0716-zUsing blockchain to log genome dataset access: efficient storage and queryGamze Gürsoy0Robert Bjornson1Molly E. Green2Mark Gerstein3Program in Computational Biology and Bioinformatics, Yale UniversityYale Center for Research ComputingProgram in Computational Biology and Bioinformatics, Yale UniversityProgram in Computational Biology and Bioinformatics, Yale UniversityAbstract Background Genomic variants are considered sensitive information, revealing potentially private facts about individuals. Therefore, it is important to control access to such data. A key aspect of controlled access is secure storage and efficient query of access logs, for potential misuse. However, there are challenges to securing logs, such as designing against the consequences of “single points of failure”. A potential approach to circumvent these challenges is blockchain technology, which is currently popular in cryptocurrency due to its properties of security, immutability, and decentralization. One of the tasks of the iDASH (Integrating Data for Analysis, Anonymization, and Sharing) Secure Genome Analysis Competition in 2018 was to develop time- and space-efficient blockchain-based ledgering solutions to log and query user activity accessing genomic datasets across multiple sites, using MultiChain. Methods MultiChain is a specific blockchain platform that offers “data streams” embedded in the chain for rapid and secure data storage. We devised a storage protocol taking advantage of the keys in the MultiChain data streams and created a data frame from the chain allowing efficient query. Our solution to the iDASH competition was selected as the winner at a workshop held in San Diego, CA in October 2018. Although our solution worked well in the challenge, it has the drawback that it requires downloading all the data from the chain and keeping it locally in memory for fast query. To address this, we provide an alternate “bigmem” solution that uses indices rather than local storage for rapid queries. Results We profiled the performance of both of our solutions using logs with 100,000 to 600,000 entries, both for querying the chain and inserting data into it. The challenge solution requires 12 seconds time and 120 Mb of memory for querying from 100,000 entries. The memory requirement increases linearly and reaches 470 MB for a chain with 600,000 entries. Although our alternate bigmem solution is slower and requires more memory (408 seconds and 250 MB, respectively, for 100,000 entries), the memory requirement increases at a slower rate and reaches only 360 MB for 600,000 entries. Conclusion Overall, we demonstrate that genomic access log files can be stored and queried efficiently with blockchain. Beyond this, our protocol potentially could be applied to other types of health data such as electronic health records.http://link.springer.com/article/10.1186/s12920-020-0716-zBlockchainSecure storageGenomic data access log
spellingShingle Gamze Gürsoy
Robert Bjornson
Molly E. Green
Mark Gerstein
Using blockchain to log genome dataset access: efficient storage and query
BMC Medical Genomics
Blockchain
Secure storage
Genomic data access log
title Using blockchain to log genome dataset access: efficient storage and query
title_full Using blockchain to log genome dataset access: efficient storage and query
title_fullStr Using blockchain to log genome dataset access: efficient storage and query
title_full_unstemmed Using blockchain to log genome dataset access: efficient storage and query
title_short Using blockchain to log genome dataset access: efficient storage and query
title_sort using blockchain to log genome dataset access efficient storage and query
topic Blockchain
Secure storage
Genomic data access log
url http://link.springer.com/article/10.1186/s12920-020-0716-z
work_keys_str_mv AT gamzegursoy usingblockchaintologgenomedatasetaccessefficientstorageandquery
AT robertbjornson usingblockchaintologgenomedatasetaccessefficientstorageandquery
AT mollyegreen usingblockchaintologgenomedatasetaccessefficientstorageandquery
AT markgerstein usingblockchaintologgenomedatasetaccessefficientstorageandquery