Optimization for Deep Learning: Bridging the Theory-Practice Gap

The success of deep learning has shown impressive empirical breakthroughs, but many theoretical questions still remain unsolved. For example, despite the nonconvexity of training objectives, deep neural networks can be reliably trained to fully memorize training datasets, yet perform very well on un...

Full description

Bibliographic Details
Main Author: Yun, Chulhee
Other Authors: Sra, Suvrit
Format: Thesis
Published: Massachusetts Institute of Technology 2022
Online Access:https://hdl.handle.net/1721.1/139962
https://orcid.org/0000-0002-3504-4690
Description
Summary:The success of deep learning has shown impressive empirical breakthroughs, but many theoretical questions still remain unsolved. For example, despite the nonconvexity of training objectives, deep neural networks can be reliably trained to fully memorize training datasets, yet perform very well on unseen test data. Although progress has been made over the recent years, the gap between theory and practice remains wide open; this thesis takes a step towards closing this gap. First, we discuss the optimization landscape of neural networks. We prove the existence of spurious local minima for general datasets and activation functions, which suggests that the convergence of optimization methods on neural networks cannot be explained solely via the training objectives. Next, we establish tight bounds on memorization capacity of ReLU networks. We present results showing that width-$\Theta(\sqrt{n})$ ReLU networks can memorize arbitrary $n$ data points, which brings down the existing width requirement $n$ to a much more realistic number. The third part discusses implicit bias in training neural networks, which is an area to understand generalization via the optimization trajectory. Through a unified analysis using a tensor representation of neural networks, we show how different architectures in linear neural networks lead to different global minima in overparameterized regimes. Lastly, we address a major theory-practice gap in stochastic finite-sum optimization: practical algorithms shuffle and iterate through component indices, while most theoretical analyses assume uniformly sampling the indices. To close this gap, we develop tight convergence rates of shuffling-based SGD faster than the rate of its uniform-sampling counterpart.