A Zeroth-Order Adaptive Learning Rate Method to Reduce Cost of Hyperparameter Tuning for Deep Learning

Due to powerful data representation ability, deep learning has dramatically improved the state-of-the-art in many practical applications. However, the utility highly depends on fine-tuning of hyper-parameters, including learning rate, batch size, and network initialization. Although many first-order...

Full description

Bibliographic Details
Main Authors:	Yanan Li, Xuebin Ren, Fangyuan Zhao, Shusen Yang
Format:	Article
Language:	English
Published:	MDPI AG 2021-10-01
Series:	Applied Sciences
Subjects:	deep learning adaptive learning rate robustness stochastic gradient descent
Online Access:	https://www.mdpi.com/2076-3417/11/21/10184

_version_	1797512824785731584
author	Yanan Li Xuebin Ren Fangyuan Zhao Shusen Yang
author_facet	Yanan Li Xuebin Ren Fangyuan Zhao Shusen Yang
author_sort	Yanan Li
collection	DOAJ
description	Due to powerful data representation ability, deep learning has dramatically improved the state-of-the-art in many practical applications. However, the utility highly depends on fine-tuning of hyper-parameters, including learning rate, batch size, and network initialization. Although many first-order adaptive methods (e.g., Adam, Adagrad) have been proposed to adjust learning rate based on gradients, they are susceptible to the initial learning rate and network architecture. Therefore, the main challenge of using deep learning in practice is how to reduce the cost of tuning hyper-parameters. To address this, we propose a heuristic zeroth-order learning rate method, <i>Adacomp</i>, which adaptively adjusts the learning rate based only on values of the loss function. The main idea is that Adacomp penalizes large learning rates to ensure the convergence and compensates small learning rates to accelerate the training process. Therefore, Adacomp is robust to the initial learning rate. Extensive experiments, including comparison to six typically adaptive methods (Momentum, Adagrad, RMSprop, Adadelta, Adam, and Adamax) on several benchmark datasets for image classification tasks (MNIST, KMNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100), were conducted. Experimental results show that Adacomp is not only robust to the initial learning rate but also to the network architecture, network initialization, and batch size.
first_indexed	2024-03-10T06:07:07Z
format	Article
id	doaj.art-fea104c85f094c74b56e338e92eeb8ae
institution	Directory Open Access Journal
issn	2076-3417
language	English
last_indexed	2024-03-10T06:07:07Z
publishDate	2021-10-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj.art-fea104c85f094c74b56e338e92eeb8ae2023-11-22T20:29:03ZengMDPI AGApplied Sciences2076-34172021-10-0111211018410.3390/app112110184A Zeroth-Order Adaptive Learning Rate Method to Reduce Cost of Hyperparameter Tuning for Deep LearningYanan Li0Xuebin Ren1Fangyuan Zhao2Shusen Yang3School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an 710049, ChinaNational Engineering Laboratory for Big Data Analytics, Xi’an Jiaotong University, Xi’an 710049, ChinaNational Engineering Laboratory for Big Data Analytics, Xi’an Jiaotong University, Xi’an 710049, ChinaSchool of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an 710049, ChinaDue to powerful data representation ability, deep learning has dramatically improved the state-of-the-art in many practical applications. However, the utility highly depends on fine-tuning of hyper-parameters, including learning rate, batch size, and network initialization. Although many first-order adaptive methods (e.g., Adam, Adagrad) have been proposed to adjust learning rate based on gradients, they are susceptible to the initial learning rate and network architecture. Therefore, the main challenge of using deep learning in practice is how to reduce the cost of tuning hyper-parameters. To address this, we propose a heuristic zeroth-order learning rate method, <i>Adacomp</i>, which adaptively adjusts the learning rate based only on values of the loss function. The main idea is that Adacomp penalizes large learning rates to ensure the convergence and compensates small learning rates to accelerate the training process. Therefore, Adacomp is robust to the initial learning rate. Extensive experiments, including comparison to six typically adaptive methods (Momentum, Adagrad, RMSprop, Adadelta, Adam, and Adamax) on several benchmark datasets for image classification tasks (MNIST, KMNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100), were conducted. Experimental results show that Adacomp is not only robust to the initial learning rate but also to the network architecture, network initialization, and batch size.https://www.mdpi.com/2076-3417/11/21/10184deep learningadaptive learning raterobustnessstochastic gradient descent
spellingShingle	Yanan Li Xuebin Ren Fangyuan Zhao Shusen Yang A Zeroth-Order Adaptive Learning Rate Method to Reduce Cost of Hyperparameter Tuning for Deep Learning Applied Sciences deep learning adaptive learning rate robustness stochastic gradient descent
title	A Zeroth-Order Adaptive Learning Rate Method to Reduce Cost of Hyperparameter Tuning for Deep Learning
title_full	A Zeroth-Order Adaptive Learning Rate Method to Reduce Cost of Hyperparameter Tuning for Deep Learning
title_fullStr	A Zeroth-Order Adaptive Learning Rate Method to Reduce Cost of Hyperparameter Tuning for Deep Learning
title_full_unstemmed	A Zeroth-Order Adaptive Learning Rate Method to Reduce Cost of Hyperparameter Tuning for Deep Learning
title_short	A Zeroth-Order Adaptive Learning Rate Method to Reduce Cost of Hyperparameter Tuning for Deep Learning
title_sort	zeroth order adaptive learning rate method to reduce cost of hyperparameter tuning for deep learning
topic	deep learning adaptive learning rate robustness stochastic gradient descent
url	https://www.mdpi.com/2076-3417/11/21/10184
work_keys_str_mv	AT yananli azerothorderadaptivelearningratemethodtoreducecostofhyperparametertuningfordeeplearning AT xuebinren azerothorderadaptivelearningratemethodtoreducecostofhyperparametertuningfordeeplearning AT fangyuanzhao azerothorderadaptivelearningratemethodtoreducecostofhyperparametertuningfordeeplearning AT shusenyang azerothorderadaptivelearningratemethodtoreducecostofhyperparametertuningfordeeplearning AT yananli zerothorderadaptivelearningratemethodtoreducecostofhyperparametertuningfordeeplearning AT xuebinren zerothorderadaptivelearningratemethodtoreducecostofhyperparametertuningfordeeplearning AT fangyuanzhao zerothorderadaptivelearningratemethodtoreducecostofhyperparametertuningfordeeplearning AT shusenyang zerothorderadaptivelearningratemethodtoreducecostofhyperparametertuningfordeeplearning

A Zeroth-Order Adaptive Learning Rate Method to Reduce Cost of Hyperparameter Tuning for Deep Learning

Similar Items