Low-degree learning and the metric entropy of polynomials

Low-degree learning and the metric entropy of polynomials, Discrete Analysis 2023:17, 23 pp. This paper is a follow-up to a paper by the first two authors on the general problem of designing algorithms that can efficiently learn functions defined on the Boolean cube $\{-1,1\}^n$, given suitable co...

Full description

Bibliographic Details
Main Authors: Alexandros Eskenazis, Paata Ivanisvili, Lauritz Streck
Format: Article
Language:English
Published: Diamond Open Access Journals 2023-11-01
Series:Discrete Analysis
Online Access:https://doi.org/10.19086/da.88507
_version_ 1797450795056103424
author Alexandros Eskenazis
Paata Ivanisvili
Lauritz Streck
author_facet Alexandros Eskenazis
Paata Ivanisvili
Lauritz Streck
author_sort Alexandros Eskenazis
collection DOAJ
description Low-degree learning and the metric entropy of polynomials, Discrete Analysis 2023:17, 23 pp. This paper is a follow-up to a paper by the first two authors on the general problem of designing algorithms that can efficiently learn functions defined on the Boolean cube $\{-1,1\}^n$, given suitable conditions on the functions. Given a function $f$, the rough idea is to take a series of sample points, calculate $f$ at those points, and guess a function $g$ that is close to $f$. Obviously, without strong information about $f$, nothing non-trivial can be done. If, for example, the values of $f$ are chosen uniformly and independently from $[-1,1]$, then one learns nothing from the sample values other than the sample values themselves. However, many functions that one actually wants to learn are far from arbitrary, so one can try to identify suitable properties that make them amenable to learning from a far smaller number of samples. One set-up considered by the authors is where $f:\{-1,1\}^n\to[-1,1]$ is a low-degree polynomial and the samples are taken randomly. Since $x^2=x$ when $x\in\{-1,1\}$, every polynomial function $f$ of degree $d$ defined on the Boolean cube has a formula of the form $f(x)=\sum_{|A|\leq d}\lambda_A\prod_{i\in A}x_i$, where $A$ ranges over subsets of $\{1,2,\dots,n\}$. The functions $x\mapsto\prod_{i\in A}x_i$ are the Walsh functions, and the $\lambda_A$ are therefore the Fourier coefficients of $f$ that correspond to sets $A$ of size at most $d$ -- the size of $A$ is often referred to as the degree of the Walsh function, and one often refers informally to "the Fourier coefficients of degree at most $d$" or "the degree-$d$ part of the Fourier expansion". From simple dimension considerations, one sees easily that one cannot hope to determine a degree-$d$ polynomial $f$ exactly from fewer than $\sum_{r=0}^d\binom nr$ samples (even if it is required to take values in $[-1,1]$), so the aim is to determine it _approximately_. In this paper, and in the previous paper of the first two authors, the approximation considered is an $L_2$ one. Thus, the aim is to find an algorithm that does the following: given a function $f:\{-1,1\}\to[-1,1]$ of degree at most $d$, it takes $m$ random samples $(x,f(x))$, and outputs a function $g$ such that $\|g-f\|_2<\epsilon$ with probability at least $1-\delta$. The important question is then how $m$ depends on $d, \epsilon$ and $\delta$. An obvious quantity of interest is the size $M(\epsilon)$ of the largest set of degree-$d$ functions $f:\{-1,1\}\to[-1,1]$ that are $\epsilon$-separated. This, the _metric entropy_ of the set of functions under consideration (more precisely, the metric entropy of the set is the dependence of this quantity on $\epsilon$), is in some sense "the size of the set, up to $\epsilon$-approximation". Heuristically, one can think of each sample as providing a constant number of bits of information, and therefore on information-theoretic grounds one would expect to need at least $c\log M(\epsilon)$ samples to obtain an $\epsilon$-approximation to $f$ (where $c$ will have some dependence on $\delta$). Since the space of polynomials of degree at most $d$ has dimension $\sum_{r=0}^d\binom nr$, which is of order of magnitude $n^d$, one might naively expect $\log M(\epsilon)$ to be of order of magnitude $n^d$ as well. And indeed, algorithms using roughly $n^d$ samples were the state of the art for the problem for quite a while. However, it is not obvious how the degree constraint interacts with the constraint that $f$ should take values in $[-1,1]$: in principle, this could greatly reduce the "effective dimension" of the set. And indeed, this turns out to be the case. The earlier article, quite surprisingly, obtained an algorithm with $m=C(\epsilon,\delta)\log n$. In broad terms, the authors noted that an inequality of Bohnenblust and Hille implies that the Fourier spectrum of $f$ must be “analytically sparse”, and then they used this sparsity to argue that instead of estimating all the non-zero Fourier coefficients to high accuracy, one can afford to ignore the small coefficients and in this way achieve much better sample complexity. One of the main purposes of the current article is to prove that the logarithmic bound in the earlier article is sharp, by obtaining a suitable lower bound for the metric entropy $M(\epsilon)$. They also show that an exponential dependence of the number of samples on the degree $d$, obtained in the earlier article, is necessary. A second purpose is to obtain "robust" versions of the results of the earlier article, meaning that the class of functions to be approximated is enlarged from exact polynomials of degree at most $d$ to _approximate_ ones -- that is, functions that are only approximately spanned by the Walsh functions of low degree. The do this by combining a junta theorem of Dinur-Friedgut-Kindler-O’Donnell (roughly, that approximately low-degree bounded functions approximately depend on just a few coordinates) with a general result they prove on the sample complexity of learning approximately low-degree bounded functions over the cube whose Fourier spectrum is concentrated around a small set of frequencies. The paper concludes with a number of applications of these results.
first_indexed 2024-03-09T14:44:47Z
format Article
id doaj.art-bb0396089e3a4a9cbbbee2fca36f2803
institution Directory Open Access Journal
issn 2397-3129
language English
last_indexed 2024-03-09T14:44:47Z
publishDate 2023-11-01
publisher Diamond Open Access Journals
record_format Article
series Discrete Analysis
spelling doaj.art-bb0396089e3a4a9cbbbee2fca36f28032023-11-27T09:18:17ZengDiamond Open Access JournalsDiscrete Analysis2397-31292023-11-01Low-degree learning and the metric entropy of polynomialsAlexandros EskenazisPaata IvanisviliLauritz StreckLow-degree learning and the metric entropy of polynomials, Discrete Analysis 2023:17, 23 pp. This paper is a follow-up to a paper by the first two authors on the general problem of designing algorithms that can efficiently learn functions defined on the Boolean cube $\{-1,1\}^n$, given suitable conditions on the functions. Given a function $f$, the rough idea is to take a series of sample points, calculate $f$ at those points, and guess a function $g$ that is close to $f$. Obviously, without strong information about $f$, nothing non-trivial can be done. If, for example, the values of $f$ are chosen uniformly and independently from $[-1,1]$, then one learns nothing from the sample values other than the sample values themselves. However, many functions that one actually wants to learn are far from arbitrary, so one can try to identify suitable properties that make them amenable to learning from a far smaller number of samples. One set-up considered by the authors is where $f:\{-1,1\}^n\to[-1,1]$ is a low-degree polynomial and the samples are taken randomly. Since $x^2=x$ when $x\in\{-1,1\}$, every polynomial function $f$ of degree $d$ defined on the Boolean cube has a formula of the form $f(x)=\sum_{|A|\leq d}\lambda_A\prod_{i\in A}x_i$, where $A$ ranges over subsets of $\{1,2,\dots,n\}$. The functions $x\mapsto\prod_{i\in A}x_i$ are the Walsh functions, and the $\lambda_A$ are therefore the Fourier coefficients of $f$ that correspond to sets $A$ of size at most $d$ -- the size of $A$ is often referred to as the degree of the Walsh function, and one often refers informally to "the Fourier coefficients of degree at most $d$" or "the degree-$d$ part of the Fourier expansion". From simple dimension considerations, one sees easily that one cannot hope to determine a degree-$d$ polynomial $f$ exactly from fewer than $\sum_{r=0}^d\binom nr$ samples (even if it is required to take values in $[-1,1]$), so the aim is to determine it _approximately_. In this paper, and in the previous paper of the first two authors, the approximation considered is an $L_2$ one. Thus, the aim is to find an algorithm that does the following: given a function $f:\{-1,1\}\to[-1,1]$ of degree at most $d$, it takes $m$ random samples $(x,f(x))$, and outputs a function $g$ such that $\|g-f\|_2<\epsilon$ with probability at least $1-\delta$. The important question is then how $m$ depends on $d, \epsilon$ and $\delta$. An obvious quantity of interest is the size $M(\epsilon)$ of the largest set of degree-$d$ functions $f:\{-1,1\}\to[-1,1]$ that are $\epsilon$-separated. This, the _metric entropy_ of the set of functions under consideration (more precisely, the metric entropy of the set is the dependence of this quantity on $\epsilon$), is in some sense "the size of the set, up to $\epsilon$-approximation". Heuristically, one can think of each sample as providing a constant number of bits of information, and therefore on information-theoretic grounds one would expect to need at least $c\log M(\epsilon)$ samples to obtain an $\epsilon$-approximation to $f$ (where $c$ will have some dependence on $\delta$). Since the space of polynomials of degree at most $d$ has dimension $\sum_{r=0}^d\binom nr$, which is of order of magnitude $n^d$, one might naively expect $\log M(\epsilon)$ to be of order of magnitude $n^d$ as well. And indeed, algorithms using roughly $n^d$ samples were the state of the art for the problem for quite a while. However, it is not obvious how the degree constraint interacts with the constraint that $f$ should take values in $[-1,1]$: in principle, this could greatly reduce the "effective dimension" of the set. And indeed, this turns out to be the case. The earlier article, quite surprisingly, obtained an algorithm with $m=C(\epsilon,\delta)\log n$. In broad terms, the authors noted that an inequality of Bohnenblust and Hille implies that the Fourier spectrum of $f$ must be “analytically sparse”, and then they used this sparsity to argue that instead of estimating all the non-zero Fourier coefficients to high accuracy, one can afford to ignore the small coefficients and in this way achieve much better sample complexity. One of the main purposes of the current article is to prove that the logarithmic bound in the earlier article is sharp, by obtaining a suitable lower bound for the metric entropy $M(\epsilon)$. They also show that an exponential dependence of the number of samples on the degree $d$, obtained in the earlier article, is necessary. A second purpose is to obtain "robust" versions of the results of the earlier article, meaning that the class of functions to be approximated is enlarged from exact polynomials of degree at most $d$ to _approximate_ ones -- that is, functions that are only approximately spanned by the Walsh functions of low degree. The do this by combining a junta theorem of Dinur-Friedgut-Kindler-O’Donnell (roughly, that approximately low-degree bounded functions approximately depend on just a few coordinates) with a general result they prove on the sample complexity of learning approximately low-degree bounded functions over the cube whose Fourier spectrum is concentrated around a small set of frequencies. The paper concludes with a number of applications of these results.https://doi.org/10.19086/da.88507
spellingShingle Alexandros Eskenazis
Paata Ivanisvili
Lauritz Streck
Low-degree learning and the metric entropy of polynomials
Discrete Analysis
title Low-degree learning and the metric entropy of polynomials
title_full Low-degree learning and the metric entropy of polynomials
title_fullStr Low-degree learning and the metric entropy of polynomials
title_full_unstemmed Low-degree learning and the metric entropy of polynomials
title_short Low-degree learning and the metric entropy of polynomials
title_sort low degree learning and the metric entropy of polynomials
url https://doi.org/10.19086/da.88507
work_keys_str_mv AT alexandroseskenazis lowdegreelearningandthemetricentropyofpolynomials
AT paataivanisvili lowdegreelearningandthemetricentropyofpolynomials
AT lauritzstreck lowdegreelearningandthemetricentropyofpolynomials