Deep Frank-Wolfe for neural network optimization

Learning a deep neural network requires solving a challenging optimization problem: it is a high-dimensional, non-convex and non-smooth minimization problem with a large number of terms. The current practice in neural network optimization is to rely on the stochastic gradient descent (SGD) algorithm...

Full description

Bibliographic Details
Main Authors:	Berrada, L, Zisserman, A, Kumar, MP
Format:	Internet publication
Language:	English
Published:	arXiv 2018

_version_	1826317644225052672
author	Berrada, L Zisserman, A Kumar, MP
author_facet	Berrada, L Zisserman, A Kumar, MP
author_sort	Berrada, L
collection	OXFORD
description	Learning a deep neural network requires solving a challenging optimization problem: it is a high-dimensional, non-convex and non-smooth minimization problem with a large number of terms. The current practice in neural network optimization is to rely on the stochastic gradient descent (SGD) algorithm or its adaptive variants. However, SGD requires a hand-designed schedule for the learning rate. In addition, its adaptive variants tend to produce solutions that generalize less well on unseen data than SGD with a hand-designed schedule. We present an optimization method that offers empirically the best of both worlds: our algorithm yields good generalization performance while requiring only one hyper-parameter. Our approach is based on a composite proximal framework, which exploits the compositional nature of deep neural networks and can leverage powerful convex optimization algorithms by design. Specifically, we employ the Frank-Wolfe (FW) algorithm for SVM, which computes an optimal step-size in closed-form at each time-step. We further show that the descent direction is given by a simple backward pass in the network, yielding the same computational cost per iteration as SGD. We present experiments on the CIFAR and SNLI data sets, where we demonstrate the significant superiority of our method over Adam, Adagrad, as well as the recently proposed BPGrad and AMSGrad. Furthermore, we compare our algorithm to SGD with a hand-designed learning rate schedule, and show that it provides similar generalization while converging faster.
first_indexed	2025-03-11T16:57:10Z
format	Internet publication
id	oxford-uuid:b341a28d-1f6c-4289-9efa-c82f47d6879f
institution	University of Oxford
language	English
last_indexed	2025-03-11T16:57:10Z
publishDate	2018
publisher	arXiv
record_format	dspace
spelling	oxford-uuid:b341a28d-1f6c-4289-9efa-c82f47d6879f2025-02-27T15:27:58ZDeep Frank-Wolfe for neural network optimizationInternet publicationhttp://purl.org/coar/resource_type/c_7ad9uuid:b341a28d-1f6c-4289-9efa-c82f47d6879fEnglishSymplectic ElementsarXiv2018Berrada, LZisserman, AKumar, MPLearning a deep neural network requires solving a challenging optimization problem: it is a high-dimensional, non-convex and non-smooth minimization problem with a large number of terms. The current practice in neural network optimization is to rely on the stochastic gradient descent (SGD) algorithm or its adaptive variants. However, SGD requires a hand-designed schedule for the learning rate. In addition, its adaptive variants tend to produce solutions that generalize less well on unseen data than SGD with a hand-designed schedule. We present an optimization method that offers empirically the best of both worlds: our algorithm yields good generalization performance while requiring only one hyper-parameter. Our approach is based on a composite proximal framework, which exploits the compositional nature of deep neural networks and can leverage powerful convex optimization algorithms by design. Specifically, we employ the Frank-Wolfe (FW) algorithm for SVM, which computes an optimal step-size in closed-form at each time-step. We further show that the descent direction is given by a simple backward pass in the network, yielding the same computational cost per iteration as SGD. We present experiments on the CIFAR and SNLI data sets, where we demonstrate the significant superiority of our method over Adam, Adagrad, as well as the recently proposed BPGrad and AMSGrad. Furthermore, we compare our algorithm to SGD with a hand-designed learning rate schedule, and show that it provides similar generalization while converging faster.
spellingShingle	Berrada, L Zisserman, A Kumar, MP Deep Frank-Wolfe for neural network optimization
title	Deep Frank-Wolfe for neural network optimization
title_full	Deep Frank-Wolfe for neural network optimization
title_fullStr	Deep Frank-Wolfe for neural network optimization
title_full_unstemmed	Deep Frank-Wolfe for neural network optimization
title_short	Deep Frank-Wolfe for neural network optimization
title_sort	deep frank wolfe for neural network optimization
work_keys_str_mv	AT berradal deepfrankwolfeforneuralnetworkoptimization AT zissermana deepfrankwolfeforneuralnetworkoptimization AT kumarmp deepfrankwolfeforneuralnetworkoptimization

Deep Frank-Wolfe for neural network optimization

Similar Items