Cluster Analysis in High Dimensions: Robustness, Privacy, and Beyond

Cluster analysis focuses on understanding the cluster structure of data, and is perhaps one of the most important subfields in high-dimensional data analysis. Traditionally, cluster analysis focuses on partitioning data into closely related groups, such as in k-means clustering and learning mixture...

Full description

Bibliographic Details
Main Author:	Narayanan, Shyam
Other Authors:	Indyk, Piotr
Format:	Thesis
Published:	Massachusetts Institute of Technology 2024
Online Access:	https://hdl.handle.net/1721.1/156617

_version_	1826192899905159168
author	Narayanan, Shyam
author2	Indyk, Piotr
author_facet	Indyk, Piotr Narayanan, Shyam
author_sort	Narayanan, Shyam
collection	MIT
description	Cluster analysis focuses on understanding the cluster structure of data, and is perhaps one of the most important subfields in high-dimensional data analysis. Traditionally, cluster analysis focuses on partitioning data into closely related groups, such as in k-means clustering and learning mixture models. However, one sometimes overlooked part of cluster analysis is analyzing data from a single cluster: this encompasses problems such as mean estimation and covariance estimation, which correspond to learning the location and shape of a cluster, respectively. In this thesis, we study various classic problems in high-dimensional cluster analysis, relating to both identifying several clusters and learning a single cluster. We provide improved algorithms and lower bounds for problems including k-means and k-median clustering, Gaussian mean and covariance estimation, high-dimensional mean testing, and learning mixtures of Gaussians. Importantly, in this thesis we also focus on the socially motivated constraints of robustness, privacy, and explainability, and how they affect the complexity of these problems. In our quest to understand cluster analysis under such socially motivated constraints, we discover the first black-box transformation from robustness to privacy, as well as the first-known statistical separation between some natural models of robust statistics.
first_indexed	2024-09-23T09:30:52Z
format	Thesis
id	mit-1721.1/156617
institution	Massachusetts Institute of Technology
last_indexed	2024-09-23T09:30:52Z
publishDate	2024
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/1566172024-09-04T03:54:30Z Cluster Analysis in High Dimensions: Robustness, Privacy, and Beyond Narayanan, Shyam Indyk, Piotr Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Cluster analysis focuses on understanding the cluster structure of data, and is perhaps one of the most important subfields in high-dimensional data analysis. Traditionally, cluster analysis focuses on partitioning data into closely related groups, such as in k-means clustering and learning mixture models. However, one sometimes overlooked part of cluster analysis is analyzing data from a single cluster: this encompasses problems such as mean estimation and covariance estimation, which correspond to learning the location and shape of a cluster, respectively. In this thesis, we study various classic problems in high-dimensional cluster analysis, relating to both identifying several clusters and learning a single cluster. We provide improved algorithms and lower bounds for problems including k-means and k-median clustering, Gaussian mean and covariance estimation, high-dimensional mean testing, and learning mixtures of Gaussians. Importantly, in this thesis we also focus on the socially motivated constraints of robustness, privacy, and explainability, and how they affect the complexity of these problems. In our quest to understand cluster analysis under such socially motivated constraints, we discover the first black-box transformation from robustness to privacy, as well as the first-known statistical separation between some natural models of robust statistics. Sc.D. 2024-09-03T21:12:09Z 2024-09-03T21:12:09Z 2024-05 2024-07-10T13:01:49.493Z Thesis https://hdl.handle.net/1721.1/156617 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle	Narayanan, Shyam Cluster Analysis in High Dimensions: Robustness, Privacy, and Beyond
title	Cluster Analysis in High Dimensions: Robustness, Privacy, and Beyond
title_full	Cluster Analysis in High Dimensions: Robustness, Privacy, and Beyond
title_fullStr	Cluster Analysis in High Dimensions: Robustness, Privacy, and Beyond
title_full_unstemmed	Cluster Analysis in High Dimensions: Robustness, Privacy, and Beyond
title_short	Cluster Analysis in High Dimensions: Robustness, Privacy, and Beyond
title_sort	cluster analysis in high dimensions robustness privacy and beyond
url	https://hdl.handle.net/1721.1/156617
work_keys_str_mv	AT narayananshyam clusteranalysisinhighdimensionsrobustnessprivacyandbeyond

Cluster Analysis in High Dimensions: Robustness, Privacy, and Beyond

Similar Items