Loss landscape: SGD can have a better view than GD
Consider a loss function L = ni=1 l2i with li = f(xi) − yi, where f(x) is a deep feedforward network with R layers, no bias terms and scalar output. Assume the network is overparametrized that is, d >> n, where d is the number of parameters and n is the number of data points. The networks are...
Main Authors: | , |
---|---|
Format: | Technical Report |
Published: |
Center for Brains, Minds and Machines (CBMM)
2020
|
Online Access: | https://hdl.handle.net/1721.1/126041 |