Training neural networks for and by interpolation

In modern supervised learning, many deep neural networks are able to interpolate the data: the empirical loss can be driven to near zero on all samples simultaneously. In this work, we explicitly exploit this interpolation property for the design of a new optimization algorithm for deep learning, wh...

Cur síos iomlán

Sonraí bibleagrafaíochta
Príomhchruthaitheoirí:	Berrada, L, Zisserman, A, Kumar, MP
Formáid:	Conference item
Teanga:	English
Foilsithe / Cruthaithe:	Journal of Machine Learning Research 2020

_version_	1826263525345984512
author	Berrada, L Zisserman, A Kumar, MP
author_facet	Berrada, L Zisserman, A Kumar, MP
author_sort	Berrada, L
collection	OXFORD
description	In modern supervised learning, many deep neural networks are able to interpolate the data: the empirical loss can be driven to near zero on all samples simultaneously. In this work, we explicitly exploit this interpolation property for the design of a new optimization algorithm for deep learning, which we term Adaptive Learning-rates for Interpolation with Gradients (ALI-G). ALI-G retains the two main advantages of Stochastic Gradient Descent (SGD), which are (i) a low computational cost per iteration and (ii) good generalization performance in practice. At each iteration, ALI-G exploits the interpolation property to compute an adaptive learning-rate in closed form. In addition, ALI-G clips the learning-rate to a maximal value, which we prove to be helpful for non-convex problems. Crucially, in contrast to the learning-rate of SGD, the maximal learning-rate of ALI-G does not require a decay schedule. This makes ALI-G considerably easier to tune than SGD. We prove the convergence of ALI-G in various stochastic settings. Notably, we tackle the realistic case where the interpolation property is satisfied up to some tolerance. We also provide experiments on a variety of deep learning architectures and tasks: (i) learning a differentiable neural computer; (ii) training a wide residual network on the SVHN data set; (iii) training a Bi-LSTM on the SNLI data set; and (iv) training wide residual networks and densely connected networks on the CIFAR data sets. ALI-G produces state-of-the-art results among adaptive methods, and even yields comparable performance with SGD, which requires manually tuned learning-rate schedules. Furthermore, ALI-G is simple to implement in any standard deep learning framework and can be used as a drop-in replacement in existing code.
first_indexed	2024-03-06T19:53:09Z
format	Conference item
id	oxford-uuid:24a6d04e-85c9-4e47-beb8-c1c59daae1b8
institution	University of Oxford
language	English
last_indexed	2024-03-06T19:53:09Z
publishDate	2020
publisher	Journal of Machine Learning Research
record_format	dspace
spelling	oxford-uuid:24a6d04e-85c9-4e47-beb8-c1c59daae1b82022-03-26T11:51:15ZTraining neural networks for and by interpolationConference itemhttp://purl.org/coar/resource_type/c_5794uuid:24a6d04e-85c9-4e47-beb8-c1c59daae1b8EnglishSymplectic ElementsJournal of Machine Learning Research2020Berrada, LZisserman, AKumar, MPIn modern supervised learning, many deep neural networks are able to interpolate the data: the empirical loss can be driven to near zero on all samples simultaneously. In this work, we explicitly exploit this interpolation property for the design of a new optimization algorithm for deep learning, which we term Adaptive Learning-rates for Interpolation with Gradients (ALI-G). ALI-G retains the two main advantages of Stochastic Gradient Descent (SGD), which are (i) a low computational cost per iteration and (ii) good generalization performance in practice. At each iteration, ALI-G exploits the interpolation property to compute an adaptive learning-rate in closed form. In addition, ALI-G clips the learning-rate to a maximal value, which we prove to be helpful for non-convex problems. Crucially, in contrast to the learning-rate of SGD, the maximal learning-rate of ALI-G does not require a decay schedule. This makes ALI-G considerably easier to tune than SGD. We prove the convergence of ALI-G in various stochastic settings. Notably, we tackle the realistic case where the interpolation property is satisfied up to some tolerance. We also provide experiments on a variety of deep learning architectures and tasks: (i) learning a differentiable neural computer; (ii) training a wide residual network on the SVHN data set; (iii) training a Bi-LSTM on the SNLI data set; and (iv) training wide residual networks and densely connected networks on the CIFAR data sets. ALI-G produces state-of-the-art results among adaptive methods, and even yields comparable performance with SGD, which requires manually tuned learning-rate schedules. Furthermore, ALI-G is simple to implement in any standard deep learning framework and can be used as a drop-in replacement in existing code.
spellingShingle	Berrada, L Zisserman, A Kumar, MP Training neural networks for and by interpolation
title	Training neural networks for and by interpolation
title_full	Training neural networks for and by interpolation
title_fullStr	Training neural networks for and by interpolation
title_full_unstemmed	Training neural networks for and by interpolation
title_short	Training neural networks for and by interpolation
title_sort	training neural networks for and by interpolation
work_keys_str_mv	AT berradal trainingneuralnetworksforandbyinterpolation AT zissermana trainingneuralnetworksforandbyinterpolation AT kumarmp trainingneuralnetworksforandbyinterpolation

Training neural networks for and by interpolation

Míreanna comhchosúla