SubZero: A fine-grained lineage system for scientific databases

Data lineage is a key component of provenance that helps scientists track and query relationships between input and output data. While current systems readily support lineage relationships at the file or data array level, finer-grained support at an array-cell level is impractical due to the lack of...

Full description

Bibliographic Details
Main Authors: Wu, Eugene, Madden, Samuel R., Stonebraker, Michael
Other Authors: Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Format: Article
Language:en_US
Published: Institute of Electrical and Electronics Engineers (IEEE) 2014
Online Access:http://hdl.handle.net/1721.1/90854
https://orcid.org/0000-0001-9184-9058
https://orcid.org/0000-0002-7470-3265
Description
Summary:Data lineage is a key component of provenance that helps scientists track and query relationships between input and output data. While current systems readily support lineage relationships at the file or data array level, finer-grained support at an array-cell level is impractical due to the lack of support for user defined operators and the high runtime and storage overhead to store such lineage. We interviewed scientists in several domains to identify a set of common semantics that can be leveraged to efficiently store fine-grained lineage. We use the insights to define lineage representations that efficiently capture common locality properties in the lineage data, and a set of APIs so operator developers can easily export lineage information from user defined operators. Finally, we introduce two benchmarks derived from astronomy and genomics, and show that our techniques can reduce lineage query costs by up to 10× while incuring substantially less impact on workflow runtime and storage.