Is SGD a Bayesian sampler? Well, almost

Deep neural networks (DNNs) generalise remarkably well in the overparameterised regime, suggesting a strong inductive bias towards functions with low generalisation error. We empirically investigate this bias by calculating, for a range of architectures and datasets, the probability PSGD(f∣S) that a...

Full description

Bibliographic Details
Main Authors:	Mingard, C, Valle-Perez, G, Skalse, J, Louis, AA
Format:	Journal article
Language:	English
Published:	Journal of Machine Learning Research 2021

_version_	1797072177250435072
author	Mingard, C Valle-Perez, G Skalse, J Louis, AA
author_facet	Mingard, C Valle-Perez, G Skalse, J Louis, AA
author_sort	Mingard, C
collection	OXFORD
description	Deep neural networks (DNNs) generalise remarkably well in the overparameterised regime, suggesting a strong inductive bias towards functions with low generalisation error. We empirically investigate this bias by calculating, for a range of architectures and datasets, the probability PSGD(f∣S) that an overparameterised DNN, trained with stochastic gradient descent (SGD) or one of its variants, converges on a function f consistent with a training set S. We also use Gaussian processes to estimate the Bayesian posterior probability PB(f∣S) that the DNN expresses f upon random sampling of its parameters, conditioned on S. Our main findings are that PSGD(f∣S) correlates remarkably well with PB(f∣S) and that PB(f∣S) is strongly biased towards low-error and low complexity functions. These results imply that strong inductive bias in the parameter-function map (which determines PB(f∣S)), rather than a special property of SGD, is the primary explanation for why DNNs generalise so well in the overparameterised regime. While our results suggest that the Bayesian posterior PB(f∣S) is the first order determinant of PSGD(f∣S), there remain second order differences that are sensitive to hyperparameter tuning. A function probability picture, based on PSGD(f∣S) and/or PB(f∣S), can shed light on the way that variations in architecture or hyperparameter settings such as batch size, learning rate, and optimiser choice, affect DNN performance.
first_indexed	2024-03-06T23:03:56Z
format	Journal article
id	oxford-uuid:632426cc-9a46-4a06-af4c-7b2d392bce12
institution	University of Oxford
language	English
last_indexed	2024-03-06T23:03:56Z
publishDate	2021
publisher	Journal of Machine Learning Research
record_format	dspace
spelling	oxford-uuid:632426cc-9a46-4a06-af4c-7b2d392bce122022-03-26T18:10:50ZIs SGD a Bayesian sampler? Well, almostJournal articlehttp://purl.org/coar/resource_type/c_dcae04bcuuid:632426cc-9a46-4a06-af4c-7b2d392bce12EnglishSymplectic ElementsJournal of Machine Learning Research2021Mingard, CValle-Perez, GSkalse, JLouis, AADeep neural networks (DNNs) generalise remarkably well in the overparameterised regime, suggesting a strong inductive bias towards functions with low generalisation error. We empirically investigate this bias by calculating, for a range of architectures and datasets, the probability PSGD(f∣S) that an overparameterised DNN, trained with stochastic gradient descent (SGD) or one of its variants, converges on a function f consistent with a training set S. We also use Gaussian processes to estimate the Bayesian posterior probability PB(f∣S) that the DNN expresses f upon random sampling of its parameters, conditioned on S. Our main findings are that PSGD(f∣S) correlates remarkably well with PB(f∣S) and that PB(f∣S) is strongly biased towards low-error and low complexity functions. These results imply that strong inductive bias in the parameter-function map (which determines PB(f∣S)), rather than a special property of SGD, is the primary explanation for why DNNs generalise so well in the overparameterised regime. While our results suggest that the Bayesian posterior PB(f∣S) is the first order determinant of PSGD(f∣S), there remain second order differences that are sensitive to hyperparameter tuning. A function probability picture, based on PSGD(f∣S) and/or PB(f∣S), can shed light on the way that variations in architecture or hyperparameter settings such as batch size, learning rate, and optimiser choice, affect DNN performance.
spellingShingle	Mingard, C Valle-Perez, G Skalse, J Louis, AA Is SGD a Bayesian sampler? Well, almost
title	Is SGD a Bayesian sampler? Well, almost
title_full	Is SGD a Bayesian sampler? Well, almost
title_fullStr	Is SGD a Bayesian sampler? Well, almost
title_full_unstemmed	Is SGD a Bayesian sampler? Well, almost
title_short	Is SGD a Bayesian sampler? Well, almost
title_sort	is sgd a bayesian sampler well almost
work_keys_str_mv	AT mingardc issgdabayesiansamplerwellalmost AT valleperezg issgdabayesiansamplerwellalmost AT skalsej issgdabayesiansamplerwellalmost AT louisaa issgdabayesiansamplerwellalmost

Is SGD a Bayesian sampler? Well, almost

Similar Items