ParChain: a framework for parallel hierarchical agglomerative clustering using nearest-neighbor chain

<jats:p>This paper studies the hierarchical clustering problem, where the goal is to produce a dendrogram that represents clusters at varying scales of a data set. We propose the ParChain framework for designing parallel hierarchical agglomerative clustering (HAC) algorithms, and using the fra...

Full description

Bibliographic Details
Main Authors:	Yu, Shangdi, Wang, Yiqiu, Gu, Yan, Dhulipala, Laxman, Shun, Julian
Other Authors:	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Format:	Article
Language:	English
Published:	VLDB Endowment 2022
Online Access:	https://hdl.handle.net/1721.1/143883

_version_	1826214326202007552
author	Yu, Shangdi Wang, Yiqiu Gu, Yan Dhulipala, Laxman Shun, Julian
author2	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
author_facet	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory Yu, Shangdi Wang, Yiqiu Gu, Yan Dhulipala, Laxman Shun, Julian
author_sort	Yu, Shangdi
collection	MIT
description	<jats:p>This paper studies the hierarchical clustering problem, where the goal is to produce a dendrogram that represents clusters at varying scales of a data set. We propose the ParChain framework for designing parallel hierarchical agglomerative clustering (HAC) algorithms, and using the framework we obtain novel parallel algorithms for the complete linkage, average linkage, and Ward's linkage criteria. Compared to most previous parallel HAC algorithms, which require quadratic memory, our new algorithms require only linear memory, and are scalable to large data sets. ParChain is based on our parallelization of the nearest-neighbor chain algorithm, and enables multiple clusters to be merged on every round. We introduce two key optimizations that are critical for efficiency: a range query optimization that reduces the number of distance computations required when finding nearest neighbors of clusters, and a caching optimization that stores a subset of previously computed distances, which are likely to be reused.</jats:p> <jats:p>Experimentally, we show that our highly-optimized implementations using 48 cores with two-way hyper-threading achieve 5.8--110.1x speedup over state-of-the-art parallel HAC algorithms and achieve 13.75--54.23x self-relative speedup. Compared to state-of-the-art algorithms, our algorithms require up to 237.3x less space. Our algorithms are able to scale to data set sizes with tens of millions of points, which existing algorithms are not able to handle.</jats:p>
first_indexed	2024-09-23T16:03:35Z
format	Article
id	mit-1721.1/143883
institution	Massachusetts Institute of Technology
language	English
last_indexed	2024-09-23T16:03:35Z
publishDate	2022
publisher	VLDB Endowment
record_format	dspace
spelling	mit-1721.1/1438832023-01-17T20:06:17Z ParChain: a framework for parallel hierarchical agglomerative clustering using nearest-neighbor chain Yu, Shangdi Wang, Yiqiu Gu, Yan Dhulipala, Laxman Shun, Julian Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory <jats:p>This paper studies the hierarchical clustering problem, where the goal is to produce a dendrogram that represents clusters at varying scales of a data set. We propose the ParChain framework for designing parallel hierarchical agglomerative clustering (HAC) algorithms, and using the framework we obtain novel parallel algorithms for the complete linkage, average linkage, and Ward's linkage criteria. Compared to most previous parallel HAC algorithms, which require quadratic memory, our new algorithms require only linear memory, and are scalable to large data sets. ParChain is based on our parallelization of the nearest-neighbor chain algorithm, and enables multiple clusters to be merged on every round. We introduce two key optimizations that are critical for efficiency: a range query optimization that reduces the number of distance computations required when finding nearest neighbors of clusters, and a caching optimization that stores a subset of previously computed distances, which are likely to be reused.</jats:p> <jats:p>Experimentally, we show that our highly-optimized implementations using 48 cores with two-way hyper-threading achieve 5.8--110.1x speedup over state-of-the-art parallel HAC algorithms and achieve 13.75--54.23x self-relative speedup. Compared to state-of-the-art algorithms, our algorithms require up to 237.3x less space. Our algorithms are able to scale to data set sizes with tens of millions of points, which existing algorithms are not able to handle.</jats:p> 2022-07-20T15:02:25Z 2022-07-20T15:02:25Z 2021 2022-07-20T14:38:52Z Article http://purl.org/eprint/type/ConferencePaper https://hdl.handle.net/1721.1/143883 Yu, Shangdi, Wang, Yiqiu, Gu, Yan, Dhulipala, Laxman and Shun, Julian. 2021. "ParChain: a framework for parallel hierarchical agglomerative clustering using nearest-neighbor chain." Proceedings of the VLDB Endowment, 15 (2). en 10.14778/3489496.3489509 Proceedings of the VLDB Endowment Creative Commons Attribution-NonCommercial-NoDerivs License http://creativecommons.org/licenses/by-nc-nd/4.0/ application/pdf VLDB Endowment VLDB Endowment
spellingShingle	Yu, Shangdi Wang, Yiqiu Gu, Yan Dhulipala, Laxman Shun, Julian ParChain: a framework for parallel hierarchical agglomerative clustering using nearest-neighbor chain
title	ParChain: a framework for parallel hierarchical agglomerative clustering using nearest-neighbor chain
title_full	ParChain: a framework for parallel hierarchical agglomerative clustering using nearest-neighbor chain
title_fullStr	ParChain: a framework for parallel hierarchical agglomerative clustering using nearest-neighbor chain
title_full_unstemmed	ParChain: a framework for parallel hierarchical agglomerative clustering using nearest-neighbor chain
title_short	ParChain: a framework for parallel hierarchical agglomerative clustering using nearest-neighbor chain
title_sort	parchain a framework for parallel hierarchical agglomerative clustering using nearest neighbor chain
url	https://hdl.handle.net/1721.1/143883
work_keys_str_mv	AT yushangdi parchainaframeworkforparallelhierarchicalagglomerativeclusteringusingnearestneighborchain AT wangyiqiu parchainaframeworkforparallelhierarchicalagglomerativeclusteringusingnearestneighborchain AT guyan parchainaframeworkforparallelhierarchicalagglomerativeclusteringusingnearestneighborchain AT dhulipalalaxman parchainaframeworkforparallelhierarchicalagglomerativeclusteringusingnearestneighborchain AT shunjulian parchainaframeworkforparallelhierarchicalagglomerativeclusteringusingnearestneighborchain

ParChain: a framework for parallel hierarchical agglomerative clustering using nearest-neighbor chain

Similar Items