Natural gradient algorithms for training deep neural networks

<p>In this thesis, we reduce the computation cost of a particular class of learning algorithms for Deep Neural Networks (DNNs; human-brain inspired artificial intelligence models). Natural Gradient Descent (NGD) algorithms have favorable properties as compared to Stochastic Gradient methods. K...

Full description

Bibliographic Details
Main Author:	Puiu, CO
Other Authors:	Cartis, C
Format:	Thesis
Language:	English
Published:	2023
Subjects:	Practical Implementation of Natural Gradient for DNNs Optimization for Deep Neural Networks

_version_	1826313546527408128
author	Puiu, CO
author2	Cartis, C
author_facet	Cartis, C Puiu, CO
author_sort	Puiu, CO
collection	OXFORD
description	<p>In this thesis, we reduce the computation cost of a particular class of learning algorithms for Deep Neural Networks (DNNs; human-brain inspired artificial intelligence models). Natural Gradient Descent (NGD) algorithms have favorable properties as compared to Stochastic Gradient methods. K-FAC (Martens and Grosse, 2015) is a feasible implementation of NGD for DNNs which retains some desirable properties of NGD. Still, K-FAC has cubic time-complexity in layer size, due to the computation of Kronecker factors (K-factors) inverses. This results in slow run-times for large K-factors. The thesis solves the time-complexity issue in two steps.</p> <p>First, we propose low-rank inverse representations of K-factors, obtainable in quadratic-time with randomized Numerical Linear Algebra (NLA). Our theory shows that at least for sufficiently large layers, we can also control the K-factors representation error, owing to the exponentially averaged (EA) construction paradigm. Numerical results show that theory is pessimistic: the error can be controlled better in practical settings. By improving the inverse-application subroutine to scale quadratically as well, we propose two quadratic time-complexity K-FAC variants: R-KFAC and SRE-KFAC.</p> <p>Secondly, by using Brand's algorithm for online NLA to exploit the EA construction paradigm of K-factors, we propose to maintain online low-rank inverse representations of K-factors, which only takes linear time. The approach is limited to sufficiently large layers, which in practice is typically satisfied by fully connected layers. By further improving the inverse-application subroutine to scale linearly as well in this case, we propose a linear time-complexity K-FAC variant: B-KFAC. B-KFAC also has linear memory-complexity in layer size, unlike quadratic for K-FAC. We also propose to algorithmically control the B-KFAC representation error, at the price of having quadratic time costs on some iterations (most iterations still take linear time). This yields three other algorithms.</p> <p>Our improved-complexity K-FAC variants speed up K-FAC 3 - 4 times for most investigated settings, in both single-GPU and distributed settings.</p>
first_indexed	2024-09-25T04:16:27Z
format	Thesis
id	oxford-uuid:0b7ef53b-2192-4332-8641-3b53a7870a98
institution	University of Oxford
language	English
last_indexed	2024-09-25T04:16:27Z
publishDate	2023
record_format	dspace
spelling	oxford-uuid:0b7ef53b-2192-4332-8641-3b53a7870a982024-07-23T15:13:18ZNatural gradient algorithms for training deep neural networksThesishttp://purl.org/coar/resource_type/c_db06uuid:0b7ef53b-2192-4332-8641-3b53a7870a98Practical Implementation of Natural Gradient for DNNsOptimization for Deep Neural NetworksEnglishHyrax Deposit2023Puiu, COCartis, CBai, SHauser, RMartens, J<p>In this thesis, we reduce the computation cost of a particular class of learning algorithms for Deep Neural Networks (DNNs; human-brain inspired artificial intelligence models). Natural Gradient Descent (NGD) algorithms have favorable properties as compared to Stochastic Gradient methods. K-FAC (Martens and Grosse, 2015) is a feasible implementation of NGD for DNNs which retains some desirable properties of NGD. Still, K-FAC has cubic time-complexity in layer size, due to the computation of Kronecker factors (K-factors) inverses. This results in slow run-times for large K-factors. The thesis solves the time-complexity issue in two steps.</p> <p>First, we propose low-rank inverse representations of K-factors, obtainable in quadratic-time with randomized Numerical Linear Algebra (NLA). Our theory shows that at least for sufficiently large layers, we can also control the K-factors representation error, owing to the exponentially averaged (EA) construction paradigm. Numerical results show that theory is pessimistic: the error can be controlled better in practical settings. By improving the inverse-application subroutine to scale quadratically as well, we propose two quadratic time-complexity K-FAC variants: R-KFAC and SRE-KFAC.</p> <p>Secondly, by using Brand's algorithm for online NLA to exploit the EA construction paradigm of K-factors, we propose to maintain online low-rank inverse representations of K-factors, which only takes linear time. The approach is limited to sufficiently large layers, which in practice is typically satisfied by fully connected layers. By further improving the inverse-application subroutine to scale linearly as well in this case, we propose a linear time-complexity K-FAC variant: B-KFAC. B-KFAC also has linear memory-complexity in layer size, unlike quadratic for K-FAC. We also propose to algorithmically control the B-KFAC representation error, at the price of having quadratic time costs on some iterations (most iterations still take linear time). This yields three other algorithms.</p> <p>Our improved-complexity K-FAC variants speed up K-FAC 3 - 4 times for most investigated settings, in both single-GPU and distributed settings.</p>
spellingShingle	Practical Implementation of Natural Gradient for DNNs Optimization for Deep Neural Networks Puiu, CO Natural gradient algorithms for training deep neural networks
title	Natural gradient algorithms for training deep neural networks
title_full	Natural gradient algorithms for training deep neural networks
title_fullStr	Natural gradient algorithms for training deep neural networks
title_full_unstemmed	Natural gradient algorithms for training deep neural networks
title_short	Natural gradient algorithms for training deep neural networks
title_sort	natural gradient algorithms for training deep neural networks
topic	Practical Implementation of Natural Gradient for DNNs Optimization for Deep Neural Networks
work_keys_str_mv	AT puiuco naturalgradientalgorithmsfortrainingdeepneuralnetworks

Natural gradient algorithms for training deep neural networks

Similar Items