Summary: | <p>We investigate the asymptotic properties of deep residual networks as the number of layers increases. We first show the existence of scaling regimes for trained weights markedly different from those implicitly assumed in the neural ODE literature. We study the convergence of the hidden state dynamics in these scaling regimes, showing that one may obtain an ODE, a stochastic differential equation (SDE) or neither. Furthermore, we derive the corresponding scaling limits for the backpropagation dynamics. Finally, we prove that in the case of a smooth activation function, the scaling regime arises as a consequence of using gradient descent. In particular, we prove linear convergence of gradient descent to a global minimum for the training of deep residual networks. We also show that if the trained weights, as a function of the layer index, admit a scaling limit as the depth increases, then the limit has finite 2-variation.</p>
<p>This work also investigate the mean-field limit of path-homogeneous neural architectures. We prove convergence of the Wasserstein gradient flow to a global minimum, and we derive a generalization bound based on the stability of the optimization algorithm for 2-layer neural networks with ReLU activation.</p>
|