Escaping Saddle Points with Adaptive Gradient Methods

© 2019 International Machine Learning Society (IMLS). Adaptive methods such as Adam and RMSProp are widely used in deep learning but are not well understood. In this paper, we seek a crisp, clean and precise characterization of their behavior in nonconvex settings. To this end, we first provide a no...

Full description

Bibliographic Details
Main Authors:	Staib, Matthew, Reddi, Sashank, Kale, Satyen, Kumar, Sanjiv, Sra, Suvrit
Other Authors:	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Format:	Article
Language:	English
Published:	2021
Online Access:	https://hdl.handle.net/1721.1/137532

_version_	1826199754871144448
author	Staib, Matthew Reddi, Sashank Kale, Satyen Kumar, Sanjiv Sra, Suvrit
author2	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
author_facet	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Staib, Matthew Reddi, Sashank Kale, Satyen Kumar, Sanjiv Sra, Suvrit
author_sort	Staib, Matthew
collection	MIT
description	© 2019 International Machine Learning Society (IMLS). Adaptive methods such as Adam and RMSProp are widely used in deep learning but are not well understood. In this paper, we seek a crisp, clean and precise characterization of their behavior in nonconvex settings. To this end, we first provide a novel view of adaptive methods as preconditioned SGD, where the preconditioner is estimated in an online manner. By studying the preconditioner on its own, we elucidate its purpose: it rescales the stochastic gradient noise to be isotropic near stationary points, which helps escape saddle points. Furthermore, we show that adaptive methods can efficiently estimate the aforementioned preconditioner. By gluing together these two components, we provide the first (to our knowledge) second-order convergence result for any adaptive method. The key insight from our analysis is that, compared to SGD, adaptive methods escape saddle points faster, and can converge faster overall to second-order stationary points.
first_indexed	2024-09-23T11:25:34Z
format	Article
id	mit-1721.1/137532
institution	Massachusetts Institute of Technology
language	English
last_indexed	2024-09-23T11:25:34Z
publishDate	2021
record_format	dspace
spelling	mit-1721.1/1375322022-10-01T03:31:18Z Escaping Saddle Points with Adaptive Gradient Methods Staib, Matthew Reddi, Sashank Kale, Satyen Kumar, Sanjiv Sra, Suvrit Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science © 2019 International Machine Learning Society (IMLS). Adaptive methods such as Adam and RMSProp are widely used in deep learning but are not well understood. In this paper, we seek a crisp, clean and precise characterization of their behavior in nonconvex settings. To this end, we first provide a novel view of adaptive methods as preconditioned SGD, where the preconditioner is estimated in an online manner. By studying the preconditioner on its own, we elucidate its purpose: it rescales the stochastic gradient noise to be isotropic near stationary points, which helps escape saddle points. Furthermore, we show that adaptive methods can efficiently estimate the aforementioned preconditioner. By gluing together these two components, we provide the first (to our knowledge) second-order convergence result for any adaptive method. The key insight from our analysis is that, compared to SGD, adaptive methods escape saddle points faster, and can converge faster overall to second-order stationary points. 2021-11-05T16:11:50Z 2021-11-05T16:11:50Z 2019 2021-04-12T17:37:59Z Article http://purl.org/eprint/type/ConferencePaper https://hdl.handle.net/1721.1/137532 Staib, Matthew, Reddi, Sashank, Kale, Satyen, Kumar, Sanjiv and Sra, Suvrit. 2019. "Escaping Saddle Points with Adaptive Gradient Methods." 36th International Conference on Machine Learning, ICML 2019, 2019-June. en http://proceedings.mlr.press/v97/staib19a.html 36th International Conference on Machine Learning, ICML 2019 Article is made available in accordance with the publisher's policy and may be subject to US copyright law. Please refer to the publisher's site for terms of use. application/pdf Proceedings of Machine Learning Research
spellingShingle	Staib, Matthew Reddi, Sashank Kale, Satyen Kumar, Sanjiv Sra, Suvrit Escaping Saddle Points with Adaptive Gradient Methods
title	Escaping Saddle Points with Adaptive Gradient Methods
title_full	Escaping Saddle Points with Adaptive Gradient Methods
title_fullStr	Escaping Saddle Points with Adaptive Gradient Methods
title_full_unstemmed	Escaping Saddle Points with Adaptive Gradient Methods
title_short	Escaping Saddle Points with Adaptive Gradient Methods
title_sort	escaping saddle points with adaptive gradient methods
url	https://hdl.handle.net/1721.1/137532
work_keys_str_mv	AT staibmatthew escapingsaddlepointswithadaptivegradientmethods AT reddisashank escapingsaddlepointswithadaptivegradientmethods AT kalesatyen escapingsaddlepointswithadaptivegradientmethods AT kumarsanjiv escapingsaddlepointswithadaptivegradientmethods AT srasuvrit escapingsaddlepointswithadaptivegradientmethods

Escaping Saddle Points with Adaptive Gradient Methods

Similar Items