Loss landscape: SGD can have a better view than GD

Consider a loss function L = 􏰀ni=1 l2i with li = f(xi) − yi, where f(x) is a deep feedforward network with R layers, no bias terms and scalar output. Assume the network is overparametrized that is, d >> n, where d is the number of parameters and n is the number of data points. The networks are...

Full description

Bibliographic Details
Main Authors: Poggio, Tomaso, Cooper, Yaim
Format: Technical Report
Published: Center for Brains, Minds and Machines (CBMM) 2020
Online Access:https://hdl.handle.net/1721.1/126041