Accurate and Fast Approximate Graph Mining at Scale
Approximate graph pattern mining (A-GPM) is an important data analysis tool for numerous graph-based applications. There exist sampling-based A-GPM systems to provide automation and generalization over a wide variety of use cases. Despite improved usability, there are two major obstacles that preven...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Published: |
Massachusetts Institute of Technology
2024
|
Online Access: | https://hdl.handle.net/1721.1/156954 |
_version_ | 1824457983741919232 |
---|---|
author | Arpaci-Dusseau, Anna |
author2 | Chen, Xuhao |
author_facet | Chen, Xuhao Arpaci-Dusseau, Anna |
author_sort | Arpaci-Dusseau, Anna |
collection | MIT |
description | Approximate graph pattern mining (A-GPM) is an important data analysis tool for numerous graph-based applications. There exist sampling-based A-GPM systems to provide automation and generalization over a wide variety of use cases. Despite improved usability, there are two major obstacles that prevent existing A-GPM systems being adopted in practice. First, the termination mechanism that decides when to terminate sampling lacks theoretical backup on confidence, and performs significantly unstable and thus slow in practice. Second, they particularly suffer poor performance when dealing with the “needle-in-the-hay” cases, because a huge number of samples are required to converge, given the extremely low hit rate of their lazy-pruning strategy and fixed sampling schemes. We build ScaleGPM, an accurate and fast A-GPM system that removes the two obstacles. First, we propose a novel on-the-fly convergence detection mechanism to achieve stable termination and provide theoretical guarantee on the confidence, with negligible online overhead. Second, we propose two techniques to deal with the “needle-in-the-hay” problem, eager-verify and hybrid sampling. Our eager-verify method drastically improves sampling hit rate by pruning unpromising candidates as early as possible. Hybrid sampling further improves performance by automatically choosing the better scheme between fine-grained and coarse-grained sampling schemes. Experiments show that our online convergence detection mechanism can precisely detect convergence, and results in stable and rapid termination with theoretically guaranteed confidence. We also show the effectiveness of eager-verify in improving the hit rate, and the scheme-selection mechanism in correctly choosing the better scheme for various cases. Overall, ScaleGPM achieves an geomean average of 565× (up to 610,169×) speedup over the state-of-the-art A-GPM system, Arya. ScaleGPM is also four orders of magnitude faster than state-of-the-art exact GPM system, GraphZero. In particular, ScaleGPM handles billion-scale graphs in seconds, where existing systems either run out of memory or fail to complete in hours. |
first_indexed | 2025-02-19T04:18:40Z |
format | Thesis |
id | mit-1721.1/156954 |
institution | Massachusetts Institute of Technology |
last_indexed | 2025-02-19T04:18:40Z |
publishDate | 2024 |
publisher | Massachusetts Institute of Technology |
record_format | dspace |
spelling | mit-1721.1/1569542024-09-25T03:53:07Z Accurate and Fast Approximate Graph Mining at Scale Arpaci-Dusseau, Anna Chen, Xuhao Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Approximate graph pattern mining (A-GPM) is an important data analysis tool for numerous graph-based applications. There exist sampling-based A-GPM systems to provide automation and generalization over a wide variety of use cases. Despite improved usability, there are two major obstacles that prevent existing A-GPM systems being adopted in practice. First, the termination mechanism that decides when to terminate sampling lacks theoretical backup on confidence, and performs significantly unstable and thus slow in practice. Second, they particularly suffer poor performance when dealing with the “needle-in-the-hay” cases, because a huge number of samples are required to converge, given the extremely low hit rate of their lazy-pruning strategy and fixed sampling schemes. We build ScaleGPM, an accurate and fast A-GPM system that removes the two obstacles. First, we propose a novel on-the-fly convergence detection mechanism to achieve stable termination and provide theoretical guarantee on the confidence, with negligible online overhead. Second, we propose two techniques to deal with the “needle-in-the-hay” problem, eager-verify and hybrid sampling. Our eager-verify method drastically improves sampling hit rate by pruning unpromising candidates as early as possible. Hybrid sampling further improves performance by automatically choosing the better scheme between fine-grained and coarse-grained sampling schemes. Experiments show that our online convergence detection mechanism can precisely detect convergence, and results in stable and rapid termination with theoretically guaranteed confidence. We also show the effectiveness of eager-verify in improving the hit rate, and the scheme-selection mechanism in correctly choosing the better scheme for various cases. Overall, ScaleGPM achieves an geomean average of 565× (up to 610,169×) speedup over the state-of-the-art A-GPM system, Arya. ScaleGPM is also four orders of magnitude faster than state-of-the-art exact GPM system, GraphZero. In particular, ScaleGPM handles billion-scale graphs in seconds, where existing systems either run out of memory or fail to complete in hours. M.Eng. 2024-09-24T18:22:44Z 2024-09-24T18:22:44Z 2024-05 2024-07-11T14:37:38.876Z Thesis https://hdl.handle.net/1721.1/156954 Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/ application/pdf Massachusetts Institute of Technology |
spellingShingle | Arpaci-Dusseau, Anna Accurate and Fast Approximate Graph Mining at Scale |
title | Accurate and Fast Approximate Graph Mining at Scale |
title_full | Accurate and Fast Approximate Graph Mining at Scale |
title_fullStr | Accurate and Fast Approximate Graph Mining at Scale |
title_full_unstemmed | Accurate and Fast Approximate Graph Mining at Scale |
title_short | Accurate and Fast Approximate Graph Mining at Scale |
title_sort | accurate and fast approximate graph mining at scale |
url | https://hdl.handle.net/1721.1/156954 |
work_keys_str_mv | AT arpacidusseauanna accurateandfastapproximategraphminingatscale |