Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment
© 2019 IEEE. Data analysts often need to find datasets that are similar (i.e., have high overlap) or that are subsets of one another (i.e., one contains the other). Exactly computing such relationships is expensive because it entails an all-pairs comparison between all values in all datasets, an O(n...
Main Authors: | , , , |
---|---|
Other Authors: | |
Format: | Article |
Language: | English |
Published: |
Institute of Electrical and Electronics Engineers (IEEE)
2022
|
Online Access: | https://hdl.handle.net/1721.1/143769 |
_version_ | 1811072699729969152 |
---|---|
author | Castro Fernandez, Raul Min, Jisoo Nava, Demitri Madden, Samuel |
author2 | Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory |
author_facet | Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory Castro Fernandez, Raul Min, Jisoo Nava, Demitri Madden, Samuel |
author_sort | Castro Fernandez, Raul |
collection | MIT |
description | © 2019 IEEE. Data analysts often need to find datasets that are similar (i.e., have high overlap) or that are subsets of one another (i.e., one contains the other). Exactly computing such relationships is expensive because it entails an all-pairs comparison between all values in all datasets, an O(n2) operation. Fortunately, it is possible to obtain approximate solutions much faster, using locality sensitive hashing (LSH). Unfortunately, LSH does not lend itself naturally to compute containment, and only returns results with a similarity beyond a pre-defined threshold; we want to know the specific similarity and containment score. The main contribution of this paper is LAZO, a method to simultaneously estimate both the similarity and containment of datasets, based on a redefinition of Jaccard similarity which takes into account the cardinality of each set. In addition, we show how to use the method to improve the quality of the original JS and JC estimates. Last, we implement LAZO as a new indexing structure that has these additional properties: i) it returns numerical scores to indicate the degree of similarity and containment between each candidate and the query - instead of only returning the candidate set; ii) it permits to query for a specific threshold on-the-fly, as opposed to LSH indexes that need to be configured with a pre-defined threshold a priori; iii) it works in a data-oblivious way, so it can be incrementally maintained. We evaluate LAZO on real-world datasets and show its ability to estimate containment and similarity better and faster than existing methods. |
first_indexed | 2024-09-23T09:10:27Z |
format | Article |
id | mit-1721.1/143769 |
institution | Massachusetts Institute of Technology |
language | English |
last_indexed | 2024-09-23T09:10:27Z |
publishDate | 2022 |
publisher | Institute of Electrical and Electronics Engineers (IEEE) |
record_format | dspace |
spelling | mit-1721.1/1437692023-03-28T20:10:51Z Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment Castro Fernandez, Raul Min, Jisoo Nava, Demitri Madden, Samuel Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory © 2019 IEEE. Data analysts often need to find datasets that are similar (i.e., have high overlap) or that are subsets of one another (i.e., one contains the other). Exactly computing such relationships is expensive because it entails an all-pairs comparison between all values in all datasets, an O(n2) operation. Fortunately, it is possible to obtain approximate solutions much faster, using locality sensitive hashing (LSH). Unfortunately, LSH does not lend itself naturally to compute containment, and only returns results with a similarity beyond a pre-defined threshold; we want to know the specific similarity and containment score. The main contribution of this paper is LAZO, a method to simultaneously estimate both the similarity and containment of datasets, based on a redefinition of Jaccard similarity which takes into account the cardinality of each set. In addition, we show how to use the method to improve the quality of the original JS and JC estimates. Last, we implement LAZO as a new indexing structure that has these additional properties: i) it returns numerical scores to indicate the degree of similarity and containment between each candidate and the query - instead of only returning the candidate set; ii) it permits to query for a specific threshold on-the-fly, as opposed to LSH indexes that need to be configured with a pre-defined threshold a priori; iii) it works in a data-oblivious way, so it can be incrementally maintained. We evaluate LAZO on real-world datasets and show its ability to estimate containment and similarity better and faster than existing methods. 2022-07-15T15:54:09Z 2022-07-15T15:54:09Z 2019 2022-07-15T15:27:26Z Article http://purl.org/eprint/type/ConferencePaper https://hdl.handle.net/1721.1/143769 Castro Fernandez, Raul, Min, Jisoo, Nava, Demitri and Madden, Samuel. 2019. "Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment." Proceedings - International Conference on Data Engineering, 2019-April. en 10.1109/ICDE.2019.00109 Proceedings - International Conference on Data Engineering Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/ application/pdf Institute of Electrical and Electronics Engineers (IEEE) MIT web domain |
spellingShingle | Castro Fernandez, Raul Min, Jisoo Nava, Demitri Madden, Samuel Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment |
title | Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment |
title_full | Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment |
title_fullStr | Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment |
title_full_unstemmed | Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment |
title_short | Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment |
title_sort | lazo a cardinality based method for coupled estimation of jaccard similarity and containment |
url | https://hdl.handle.net/1721.1/143769 |
work_keys_str_mv | AT castrofernandezraul lazoacardinalitybasedmethodforcoupledestimationofjaccardsimilarityandcontainment AT minjisoo lazoacardinalitybasedmethodforcoupledestimationofjaccardsimilarityandcontainment AT navademitri lazoacardinalitybasedmethodforcoupledestimationofjaccardsimilarityandcontainment AT maddensamuel lazoacardinalitybasedmethodforcoupledestimationofjaccardsimilarityandcontainment |