Deep learning: a statistical viewpoint

<jats:p>The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite giving a near-perfect fit to training data with...

Full description

Bibliographic Details
Main Authors:	Bartlett, Peter L, Montanari, Andrea, Rakhlin, Alexander
Other Authors:	Massachusetts Institute of Technology. Department of Brain and Cognitive Sciences
Format:	Article
Language:	English
Published:	Cambridge University Press (CUP) 2021
Online Access:	https://hdl.handle.net/1721.1/138312

_version_	1826199672830558208
author	Bartlett, Peter L Montanari, Andrea Rakhlin, Alexander
author2	Massachusetts Institute of Technology. Department of Brain and Cognitive Sciences
author_facet	Massachusetts Institute of Technology. Department of Brain and Cognitive Sciences Bartlett, Peter L Montanari, Andrea Rakhlin, Alexander
author_sort	Bartlett, Peter L
collection	MIT
description	<jats:p>The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite giving a near-perfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting, that is, accurate predictions despite overfitting training data. In this article, we survey recent progress in statistical learning theory that provides examples illustrating these principles in simpler settings. We first review classical uniform convergence results and why they fall short of explaining aspects of the behaviour of deep learning methods. We give examples of implicit regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a simple component that is useful for prediction and a spiky component that is useful for overfitting but, in a favourable setting, does not harm prediction accuracy. We focus specifically on the linear regime for neural networks, where the network can be approximated by a linear model. In this regime, we demonstrate the success of gradient flow, and we consider benign overfitting with two-layer networks, giving an exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude by highlighting the key challenges that arise in extending these insights to realistic deep learning settings.</jats:p>
first_indexed	2024-09-23T11:24:00Z
format	Article
id	mit-1721.1/138312
institution	Massachusetts Institute of Technology
language	English
last_indexed	2024-09-23T11:24:00Z
publishDate	2021
publisher	Cambridge University Press (CUP)
record_format	dspace
spelling	mit-1721.1/1383122023-12-21T22:05:27Z Deep learning: a statistical viewpoint Bartlett, Peter L Montanari, Andrea Rakhlin, Alexander Massachusetts Institute of Technology. Department of Brain and Cognitive Sciences Statistics and Data Science Center (Massachusetts Institute of Technology) <jats:p>The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite giving a near-perfect fit to training data without any explicit effort to control model complexity, these methods exhibit excellent predictive accuracy. We conjecture that specific principles underlie these phenomena: that overparametrization allows gradient methods to find interpolating solutions, that these methods implicitly impose regularization, and that overparametrization leads to benign overfitting, that is, accurate predictions despite overfitting training data. In this article, we survey recent progress in statistical learning theory that provides examples illustrating these principles in simpler settings. We first review classical uniform convergence results and why they fall short of explaining aspects of the behaviour of deep learning methods. We give examples of implicit regularization in simple settings, where gradient methods lead to minimal norm functions that perfectly fit the training data. Then we review prediction methods that exhibit benign overfitting, focusing on regression problems with quadratic loss. For these methods, we can decompose the prediction rule into a simple component that is useful for prediction and a spiky component that is useful for overfitting but, in a favourable setting, does not harm prediction accuracy. We focus specifically on the linear regime for neural networks, where the network can be approximated by a linear model. In this regime, we demonstrate the success of gradient flow, and we consider benign overfitting with two-layer networks, giving an exact asymptotic analysis that precisely demonstrates the impact of overparametrization. We conclude by highlighting the key challenges that arise in extending these insights to realistic deep learning settings.</jats:p> 2021-12-03T16:28:58Z 2021-12-03T16:28:58Z 2021-05 2021-12-03T16:24:47Z Article http://purl.org/eprint/type/JournalArticle https://hdl.handle.net/1721.1/138312 Bartlett, Peter L, Montanari, Andrea and Rakhlin, Alexander. 2021. "Deep learning: a statistical viewpoint." Acta Numerica, 30. en 10.1017/s0962492921000027 Acta Numerica Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/ application/pdf Cambridge University Press (CUP) arXiv
spellingShingle	Bartlett, Peter L Montanari, Andrea Rakhlin, Alexander Deep learning: a statistical viewpoint
title	Deep learning: a statistical viewpoint
title_full	Deep learning: a statistical viewpoint
title_fullStr	Deep learning: a statistical viewpoint
title_full_unstemmed	Deep learning: a statistical viewpoint
title_short	Deep learning: a statistical viewpoint
title_sort	deep learning a statistical viewpoint
url	https://hdl.handle.net/1721.1/138312
work_keys_str_mv	AT bartlettpeterl deeplearningastatisticalviewpoint AT montanariandrea deeplearningastatisticalviewpoint AT rakhlinalexander deeplearningastatisticalviewpoint

Deep learning: a statistical viewpoint

Similar Items