Summary: | Clustering is a fundamental unsupervised machine learning task of detecting groups of similar objects in data. Clustering can be used to identify the underlying substructures of data and can detect essential functional groups, such as people with similar interests, news articles on similar topics, or proteins with similar utilities, which can then be used for various downstream tasks. In this thesis, we focus on the efficient clustering algorithms for metric and graph data. Both types of data are common in various applications today. Moreover, in modern applications, the size of both types of datasets and the dimensionality of the metric data are scaling rapidly. In this thesis, we solve the challenge of clustering large datasets by designing algorithms with high parallelism that take advantage of modern shared-memory multi-core machines and dynamic algorithms that can efficiently update the result without re-computing from scratch. We also present approximate clustering algorithms that scale to high-dimensional data.
The first part of this thesis studies parallel clustering algorithms for low-dimensional metric data. Clustering algorithms are frequently expected to perform numerous similarity searches, as clustering entails grouping similar objects together. Although many algorithms have been designed for nearest neighbor search, many clustering algorithms require customized nearest neighbor search with special constraints, so we cannot use existing nearest neighbor search approaches of the shelf. We present examples of how to design customized similarity searches for hierarchical agglomerative clustering and density peaks clustering algorithms using optimized tree index data structures.
The second part of this thesis studies parallel clustering algorithms for high-dimensional metric data. In this thesis, we show two approaches for clustering high-dimensional data. The first is to design approximate similarity searches that are customized for a particular clustering algorithm. The second is to convert the metric data into a graph representation, and then run graph clustering algorithms on this derived graph. For the first approach, we present an approximate density peaks clustering framework for high-dimensional data using approximate similarity searches. We also show that the framework has good empirical performance with graph-based nearest neighbor search techniques on high-dimensional data. For the second approach, we present an algorithm that clusters high-dimensional metric data by converting data into a particular graph representation called the triangulated maximally fltered graph and then running the directed bubble hierarchical tree algorithm on the converted graph.
The final part of this thesis studies clustering algorithms for graph data. We present a dynamic graph clustering algorithm that can quickly update the output when the input changes instead of performing a slow recomputation from scratch. We also introduce a benchmarking suite for comprehensively evaluating the quality and speed of parallel graph clustering algorithms on both native graphs and k-nearest neighbor graphs converted from metric data. Our evaluation includes methods tailored to both weighted and unweighted graphs.
|