Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment

© 2019 IEEE. Data analysts often need to find datasets that are similar (i.e., have high overlap) or that are subsets of one another (i.e., one contains the other). Exactly computing such relationships is expensive because it entails an all-pairs comparison between all values in all datasets, an O(n...

Full description

Bibliographic Details
Main Authors: Castro Fernandez, Raul, Min, Jisoo, Nava, Demitri, Madden, Samuel
Other Authors: Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Format: Article
Language:English
Published: Institute of Electrical and Electronics Engineers (IEEE) 2022
Online Access:https://hdl.handle.net/1721.1/143769
_version_ 1811072699729969152
author Castro Fernandez, Raul
Min, Jisoo
Nava, Demitri
Madden, Samuel
author2 Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
author_facet Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Castro Fernandez, Raul
Min, Jisoo
Nava, Demitri
Madden, Samuel
author_sort Castro Fernandez, Raul
collection MIT
description © 2019 IEEE. Data analysts often need to find datasets that are similar (i.e., have high overlap) or that are subsets of one another (i.e., one contains the other). Exactly computing such relationships is expensive because it entails an all-pairs comparison between all values in all datasets, an O(n2) operation. Fortunately, it is possible to obtain approximate solutions much faster, using locality sensitive hashing (LSH). Unfortunately, LSH does not lend itself naturally to compute containment, and only returns results with a similarity beyond a pre-defined threshold; we want to know the specific similarity and containment score. The main contribution of this paper is LAZO, a method to simultaneously estimate both the similarity and containment of datasets, based on a redefinition of Jaccard similarity which takes into account the cardinality of each set. In addition, we show how to use the method to improve the quality of the original JS and JC estimates. Last, we implement LAZO as a new indexing structure that has these additional properties: i) it returns numerical scores to indicate the degree of similarity and containment between each candidate and the query - instead of only returning the candidate set; ii) it permits to query for a specific threshold on-the-fly, as opposed to LSH indexes that need to be configured with a pre-defined threshold a priori; iii) it works in a data-oblivious way, so it can be incrementally maintained. We evaluate LAZO on real-world datasets and show its ability to estimate containment and similarity better and faster than existing methods.
first_indexed 2024-09-23T09:10:27Z
format Article
id mit-1721.1/143769
institution Massachusetts Institute of Technology
language English
last_indexed 2024-09-23T09:10:27Z
publishDate 2022
publisher Institute of Electrical and Electronics Engineers (IEEE)
record_format dspace
spelling mit-1721.1/1437692023-03-28T20:10:51Z Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment Castro Fernandez, Raul Min, Jisoo Nava, Demitri Madden, Samuel Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory © 2019 IEEE. Data analysts often need to find datasets that are similar (i.e., have high overlap) or that are subsets of one another (i.e., one contains the other). Exactly computing such relationships is expensive because it entails an all-pairs comparison between all values in all datasets, an O(n2) operation. Fortunately, it is possible to obtain approximate solutions much faster, using locality sensitive hashing (LSH). Unfortunately, LSH does not lend itself naturally to compute containment, and only returns results with a similarity beyond a pre-defined threshold; we want to know the specific similarity and containment score. The main contribution of this paper is LAZO, a method to simultaneously estimate both the similarity and containment of datasets, based on a redefinition of Jaccard similarity which takes into account the cardinality of each set. In addition, we show how to use the method to improve the quality of the original JS and JC estimates. Last, we implement LAZO as a new indexing structure that has these additional properties: i) it returns numerical scores to indicate the degree of similarity and containment between each candidate and the query - instead of only returning the candidate set; ii) it permits to query for a specific threshold on-the-fly, as opposed to LSH indexes that need to be configured with a pre-defined threshold a priori; iii) it works in a data-oblivious way, so it can be incrementally maintained. We evaluate LAZO on real-world datasets and show its ability to estimate containment and similarity better and faster than existing methods. 2022-07-15T15:54:09Z 2022-07-15T15:54:09Z 2019 2022-07-15T15:27:26Z Article http://purl.org/eprint/type/ConferencePaper https://hdl.handle.net/1721.1/143769 Castro Fernandez, Raul, Min, Jisoo, Nava, Demitri and Madden, Samuel. 2019. "Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment." Proceedings - International Conference on Data Engineering, 2019-April. en 10.1109/ICDE.2019.00109 Proceedings - International Conference on Data Engineering Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/ application/pdf Institute of Electrical and Electronics Engineers (IEEE) MIT web domain
spellingShingle Castro Fernandez, Raul
Min, Jisoo
Nava, Demitri
Madden, Samuel
Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment
title Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment
title_full Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment
title_fullStr Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment
title_full_unstemmed Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment
title_short Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment
title_sort lazo a cardinality based method for coupled estimation of jaccard similarity and containment
url https://hdl.handle.net/1721.1/143769
work_keys_str_mv AT castrofernandezraul lazoacardinalitybasedmethodforcoupledestimationofjaccardsimilarityandcontainment
AT minjisoo lazoacardinalitybasedmethodforcoupledestimationofjaccardsimilarityandcontainment
AT navademitri lazoacardinalitybasedmethodforcoupledestimationofjaccardsimilarityandcontainment
AT maddensamuel lazoacardinalitybasedmethodforcoupledestimationofjaccardsimilarityandcontainment