Optimization for Deep Learning: Bridging the Theory-Practice Gap
The success of deep learning has shown impressive empirical breakthroughs, but many theoretical questions still remain unsolved. For example, despite the nonconvexity of training objectives, deep neural networks can be reliably trained to fully memorize training datasets, yet perform very well on un...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Published: |
Massachusetts Institute of Technology
2022
|
Online Access: | https://hdl.handle.net/1721.1/139962 https://orcid.org/0000-0002-3504-4690 |
Summary: | The success of deep learning has shown impressive empirical breakthroughs, but many theoretical questions still remain unsolved. For example, despite the nonconvexity of training objectives, deep neural networks can be reliably trained to fully memorize training datasets, yet perform very well on unseen test data. Although progress has been made over the recent years, the gap between theory and practice remains wide open; this thesis takes a step towards closing this gap.
First, we discuss the optimization landscape of neural networks. We prove the existence of spurious local minima for general datasets and activation functions, which suggests that the convergence of optimization methods on neural networks cannot be explained solely via the training objectives.
Next, we establish tight bounds on memorization capacity of ReLU networks. We present results showing that width-$\Theta(\sqrt{n})$ ReLU networks can memorize arbitrary $n$ data points, which brings down the existing width requirement $n$ to a much more realistic number. The third part discusses implicit bias in training neural networks, which is an area to understand generalization via the optimization trajectory. Through a unified analysis using a tensor representation of neural networks, we show how different architectures in linear neural networks lead to different global minima in overparameterized regimes. Lastly, we address a major theory-practice gap in stochastic finite-sum optimization: practical algorithms shuffle and iterate through component indices, while most theoretical analyses assume uniformly sampling the indices. To close this gap, we develop tight convergence rates of shuffling-based SGD faster than the rate of its uniform-sampling counterpart. |
---|