Efficient Versioning for Scientific Array Databases

In this paper, we describe a versioned database storage manager we are developing for the SciDB scientific database. The system is designed to efficiently store and retrieve array-oriented data, exposing a "no-overwrite" storage model in which each update creates a new "version"...

Full description

Bibliographic Details
Main Authors: Seering, Adam, Cudre-Mauroux, Philippe, Stonebraker, Michael, Madden, Samuel R.
Other Authors: Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Format: Article
Language:en_US
Published: Institute of Electrical and Electronics Engineers (IEEE) 2014
Online Access:http://hdl.handle.net/1721.1/90380
https://orcid.org/0000-0002-7470-3265
https://orcid.org/0000-0001-9184-9058
_version_ 1826207617468334080
author Seering, Adam
Cudre-Mauroux, Philippe
Stonebraker, Michael
Madden, Samuel R.
author2 Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
author_facet Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Seering, Adam
Cudre-Mauroux, Philippe
Stonebraker, Michael
Madden, Samuel R.
author_sort Seering, Adam
collection MIT
description In this paper, we describe a versioned database storage manager we are developing for the SciDB scientific database. The system is designed to efficiently store and retrieve array-oriented data, exposing a "no-overwrite" storage model in which each update creates a new "version" of an array. This makes it possible to perform comparisons of versions produced at different times or by different algorithms, and to create complex chains and trees of versions. We present algorithms to efficiently encode these versions, minimizing storage space while still providing efficient access to the data. Additionally, we present an optimal algorithm that, given a long sequence of versions, determines which versions to encode in terms of each other (using delta compression) to minimize total storage space or query execution cost. We compare the performance of these algorithms on real world data sets from the National Oceanic and Atmospheric Administration (NOAA), Open Street Maps, and several other sources. We show that our algorithms provide better performance than existing version control systems not optimized for array data, both in terms of storage size and access time, and that our delta-compression algorithms are able to substantially reduce the total storage space when versions exist with a high degree of similarity.
first_indexed 2024-09-23T13:52:21Z
format Article
id mit-1721.1/90380
institution Massachusetts Institute of Technology
language en_US
last_indexed 2024-09-23T13:52:21Z
publishDate 2014
publisher Institute of Electrical and Electronics Engineers (IEEE)
record_format dspace
spelling mit-1721.1/903802022-10-01T17:39:18Z Efficient Versioning for Scientific Array Databases Seering, Adam Cudre-Mauroux, Philippe Stonebraker, Michael Madden, Samuel R. Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Seering, Adam Cudre-Mauroux, Philippe Madden, Samuel R. Stonebraker, Michael In this paper, we describe a versioned database storage manager we are developing for the SciDB scientific database. The system is designed to efficiently store and retrieve array-oriented data, exposing a "no-overwrite" storage model in which each update creates a new "version" of an array. This makes it possible to perform comparisons of versions produced at different times or by different algorithms, and to create complex chains and trees of versions. We present algorithms to efficiently encode these versions, minimizing storage space while still providing efficient access to the data. Additionally, we present an optimal algorithm that, given a long sequence of versions, determines which versions to encode in terms of each other (using delta compression) to minimize total storage space or query execution cost. We compare the performance of these algorithms on real world data sets from the National Oceanic and Atmospheric Administration (NOAA), Open Street Maps, and several other sources. We show that our algorithms provide better performance than existing version control systems not optimized for array data, both in terms of storage size and access time, and that our delta-compression algorithms are able to substantially reduce the total storage space when versions exist with a high degree of similarity. National Science Foundation (U.S.) (Grant IIS/III-1111371) National Science Foundation (U.S.) (Grant SI2-1047955) 2014-09-26T12:50:15Z 2014-09-26T12:50:15Z 2012-04 Article http://purl.org/eprint/type/ConferencePaper 978-0-7695-4747-3 978-1-4673-0042-1 1063-6382 http://hdl.handle.net/1721.1/90380 Seering, Adam, Philippe Cudre-Mauroux, Samuel Madden, and Michael Stonebraker. “Efficient Versioning for Scientific Array Databases.” 2012 IEEE 28th International Conference on Data Engineering (April 2012). https://orcid.org/0000-0002-7470-3265 https://orcid.org/0000-0001-9184-9058 en_US http://dx.doi.org/10.1109/ICDE.2012.102 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/ application/pdf Institute of Electrical and Electronics Engineers (IEEE) MIT web domain
spellingShingle Seering, Adam
Cudre-Mauroux, Philippe
Stonebraker, Michael
Madden, Samuel R.
Efficient Versioning for Scientific Array Databases
title Efficient Versioning for Scientific Array Databases
title_full Efficient Versioning for Scientific Array Databases
title_fullStr Efficient Versioning for Scientific Array Databases
title_full_unstemmed Efficient Versioning for Scientific Array Databases
title_short Efficient Versioning for Scientific Array Databases
title_sort efficient versioning for scientific array databases
url http://hdl.handle.net/1721.1/90380
https://orcid.org/0000-0002-7470-3265
https://orcid.org/0000-0001-9184-9058
work_keys_str_mv AT seeringadam efficientversioningforscientificarraydatabases
AT cudremaurouxphilippe efficientversioningforscientificarraydatabases
AT stonebrakermichael efficientversioningforscientificarraydatabases
AT maddensamuelr efficientversioningforscientificarraydatabases