Multiscale Geometric Methods for Data Sets I: Multiscale SVD, Noise and Curvature

Large data sets are often modeled as being noisy samples from probability distributions in R^D, with D large. It has been noticed that oftentimes the support M of these probability distributions seems to be well-approximated by low-dimensional sets, perhaps even by manifolds. We shall consider sets...

Full description

Bibliographic Details
Main Authors: Little, Anna V., Maggioni, Mauro, Rosasco, Lorenzo
Other Authors: Tomaso Poggio
Published: 2012
Subjects:
Online Access:http://hdl.handle.net/1721.1/72597
_version_ 1826204287464636416
author Little, Anna V.
Maggioni, Mauro
Rosasco, Lorenzo
author2 Tomaso Poggio
author_facet Tomaso Poggio
Little, Anna V.
Maggioni, Mauro
Rosasco, Lorenzo
author_sort Little, Anna V.
collection MIT
description Large data sets are often modeled as being noisy samples from probability distributions in R^D, with D large. It has been noticed that oftentimes the support M of these probability distributions seems to be well-approximated by low-dimensional sets, perhaps even by manifolds. We shall consider sets that are locally well approximated by k-dimensional planes, with k << D, with k-dimensional manifolds isometrically embedded in R^D being a special case. Samples from this distribution; are furthermore corrupted by D-dimensional noise. Certain tools from multiscale geometric measure theory and harmonic analysis seem well-suited to be adapted to the study of samples from such probability distributions, in order to yield quantitative geometric information about them. In this paper we introduce and study multiscale covariance matrices, i.e. covariances corresponding to the distribution restricted to a ball of radius r, with a fixed center and varying r, and under rather general geometric assumptions we study how their empirical, noisy counterparts behave. We prove that in the range of scales where these covariance matrices are most informative, the empirical, noisy covariances are close to their expected, noiseless counterparts. In fact, this is true as soon as the number of samples in the balls where the covariance matrices are computed is linear in the intrinsic dimension of M. As an application, we present an algorithm for estimating the intrinsic dimension of M.
first_indexed 2024-09-23T12:51:54Z
id mit-1721.1/72597
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T12:51:54Z
publishDate 2012
record_format dspace
spelling mit-1721.1/725972019-04-12T11:13:40Z Multiscale Geometric Methods for Data Sets I: Multiscale SVD, Noise and Curvature Little, Anna V. Maggioni, Mauro Rosasco, Lorenzo Tomaso Poggio Center for Biological and Computational Learning (CBCL) machine learning high dimensional data Large data sets are often modeled as being noisy samples from probability distributions in R^D, with D large. It has been noticed that oftentimes the support M of these probability distributions seems to be well-approximated by low-dimensional sets, perhaps even by manifolds. We shall consider sets that are locally well approximated by k-dimensional planes, with k << D, with k-dimensional manifolds isometrically embedded in R^D being a special case. Samples from this distribution; are furthermore corrupted by D-dimensional noise. Certain tools from multiscale geometric measure theory and harmonic analysis seem well-suited to be adapted to the study of samples from such probability distributions, in order to yield quantitative geometric information about them. In this paper we introduce and study multiscale covariance matrices, i.e. covariances corresponding to the distribution restricted to a ball of radius r, with a fixed center and varying r, and under rather general geometric assumptions we study how their empirical, noisy counterparts behave. We prove that in the range of scales where these covariance matrices are most informative, the empirical, noisy covariances are close to their expected, noiseless counterparts. In fact, this is true as soon as the number of samples in the balls where the covariance matrices are computed is linear in the intrinsic dimension of M. As an application, we present an algorithm for estimating the intrinsic dimension of M. 2012-09-10T18:00:08Z 2012-09-10T18:00:08Z 2012-09-08 http://hdl.handle.net/1721.1/72597 MIT-CSAIL-TR-2012-029 CBCL-310 59 p. application/pdf
spellingShingle machine learning
high dimensional data
Little, Anna V.
Maggioni, Mauro
Rosasco, Lorenzo
Multiscale Geometric Methods for Data Sets I: Multiscale SVD, Noise and Curvature
title Multiscale Geometric Methods for Data Sets I: Multiscale SVD, Noise and Curvature
title_full Multiscale Geometric Methods for Data Sets I: Multiscale SVD, Noise and Curvature
title_fullStr Multiscale Geometric Methods for Data Sets I: Multiscale SVD, Noise and Curvature
title_full_unstemmed Multiscale Geometric Methods for Data Sets I: Multiscale SVD, Noise and Curvature
title_short Multiscale Geometric Methods for Data Sets I: Multiscale SVD, Noise and Curvature
title_sort multiscale geometric methods for data sets i multiscale svd noise and curvature
topic machine learning
high dimensional data
url http://hdl.handle.net/1721.1/72597
work_keys_str_mv AT littleannav multiscalegeometricmethodsfordatasetsimultiscalesvdnoiseandcurvature
AT maggionimauro multiscalegeometricmethodsfordatasetsimultiscalesvdnoiseandcurvature
AT rosascolorenzo multiscalegeometricmethodsfordatasetsimultiscalesvdnoiseandcurvature