Scaling Laws for Deep Learning

Running faster will only get you so far — it is generally advisable to first understand where the roads lead, then get a car ... The renaissance of machine learning (ML) and deep learning (DL) over the last decade is accompanied by an unscalable computational cost, limiting its advancement and we...

Full description

Bibliographic Details
Main Author:	Rosenfeld, Jonathan S.
Other Authors:	Shavit, Nir
Format:	Thesis
Published:	Massachusetts Institute of Technology 2022
Online Access:	https://hdl.handle.net/1721.1/139897

_version_	1826210277980372992
author	Rosenfeld, Jonathan S.
author2	Shavit, Nir
author_facet	Shavit, Nir Rosenfeld, Jonathan S.
author_sort	Rosenfeld, Jonathan S.
collection	MIT
description	Running faster will only get you so far — it is generally advisable to first understand where the roads lead, then get a car ... The renaissance of machine learning (ML) and deep learning (DL) over the last decade is accompanied by an unscalable computational cost, limiting its advancement and weighing on the field in practice. In this thesis we take a systematic approach to address the algorithmic and methodological limitations at the root of these costs. We first demonstrate that DL training and pruning are predictable and governed by scaling laws — for state of the art models and tasks, spanning image classification and language modeling, as well as for state of the art model compression via iterative pruning. Predictability, via the establishment of these scaling laws, provides the path for principled design and trade-off reasoning, currently largely lacking in the field. We then continue to analyze the sources of the scaling laws, offering an approximation-theoretic view and showing through the exploration of a noiseless realizable case that DL is in fact dominated by error sources very far from the lower error limit. We conclude by building on the gained theoretical understanding of the scaling laws’ origins. We present a conjectural path to eliminate one of the current dominant error sources — through a data bandwidth limiting hypothesis and the introduction of Nyquist learners — which can, in principle, reach the generalization error lower limit (e.g. 0 in the noiseless case), at finite dataset size.
first_indexed	2024-09-23T14:47:15Z
format	Thesis
id	mit-1721.1/139897
institution	Massachusetts Institute of Technology
last_indexed	2024-09-23T14:47:15Z
publishDate	2022
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/1398972022-02-08T03:17:50Z Scaling Laws for Deep Learning Rosenfeld, Jonathan S. Shavit, Nir Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Running faster will only get you so far — it is generally advisable to first understand where the roads lead, then get a car ... The renaissance of machine learning (ML) and deep learning (DL) over the last decade is accompanied by an unscalable computational cost, limiting its advancement and weighing on the field in practice. In this thesis we take a systematic approach to address the algorithmic and methodological limitations at the root of these costs. We first demonstrate that DL training and pruning are predictable and governed by scaling laws — for state of the art models and tasks, spanning image classification and language modeling, as well as for state of the art model compression via iterative pruning. Predictability, via the establishment of these scaling laws, provides the path for principled design and trade-off reasoning, currently largely lacking in the field. We then continue to analyze the sources of the scaling laws, offering an approximation-theoretic view and showing through the exploration of a noiseless realizable case that DL is in fact dominated by error sources very far from the lower error limit. We conclude by building on the gained theoretical understanding of the scaling laws’ origins. We present a conjectural path to eliminate one of the current dominant error sources — through a data bandwidth limiting hypothesis and the introduction of Nyquist learners — which can, in principle, reach the generalization error lower limit (e.g. 0 in the noiseless case), at finite dataset size. Ph.D. 2022-02-07T15:11:26Z 2022-02-07T15:11:26Z 2021-09 2021-09-21T19:31:02.813Z Thesis https://hdl.handle.net/1721.1/139897 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle	Rosenfeld, Jonathan S. Scaling Laws for Deep Learning
title	Scaling Laws for Deep Learning
title_full	Scaling Laws for Deep Learning
title_fullStr	Scaling Laws for Deep Learning
title_full_unstemmed	Scaling Laws for Deep Learning
title_short	Scaling Laws for Deep Learning
title_sort	scaling laws for deep learning
url	https://hdl.handle.net/1721.1/139897
work_keys_str_mv	AT rosenfeldjonathans scalinglawsfordeeplearning

Scaling Laws for Deep Learning

Similar Items