Natural gradient algorithms for training deep neural networks

<p>In this thesis, we reduce the computation cost of a particular class of learning algorithms for Deep Neural Networks (DNNs; human-brain inspired artificial intelligence models). Natural Gradient Descent (NGD) algorithms have favorable properties as compared to Stochastic Gradient methods. K...

Full description

Bibliographic Details
Main Author: Puiu, CO
Other Authors: Cartis, C
Format: Thesis
Language:English
Published: 2023
Subjects:
_version_ 1826313546527408128
author Puiu, CO
author2 Cartis, C
author_facet Cartis, C
Puiu, CO
author_sort Puiu, CO
collection OXFORD
description <p>In this thesis, we reduce the computation cost of a particular class of learning algorithms for Deep Neural Networks (DNNs; human-brain inspired artificial intelligence models). Natural Gradient Descent (NGD) algorithms have favorable properties as compared to Stochastic Gradient methods. K-FAC (Martens and Grosse, 2015) is a feasible implementation of NGD for DNNs which retains some desirable properties of NGD. Still, K-FAC has cubic time-complexity in layer size, due to the computation of Kronecker factors (K-factors) inverses. This results in slow run-times for large K-factors. The thesis solves the time-complexity issue in two steps.</p> <p>First, we propose low-rank inverse representations of K-factors, obtainable in quadratic-time with randomized Numerical Linear Algebra (NLA). Our theory shows that at least for sufficiently large layers, we can also control the K-factors representation error, owing to the exponentially averaged (EA) construction paradigm. Numerical results show that theory is pessimistic: the error can be controlled better in practical settings. By improving the inverse-application subroutine to scale quadratically as well, we propose two quadratic time-complexity K-FAC variants: R-KFAC and SRE-KFAC.</p> <p>Secondly, by using Brand's algorithm for online NLA to exploit the EA construction paradigm of K-factors, we propose to maintain online low-rank inverse representations of K-factors, which only takes linear time. The approach is limited to sufficiently large layers, which in practice is typically satisfied by fully connected layers. By further improving the inverse-application subroutine to scale linearly as well in this case, we propose a linear time-complexity K-FAC variant: B-KFAC. B-KFAC also has linear memory-complexity in layer size, unlike quadratic for K-FAC. We also propose to algorithmically control the B-KFAC representation error, at the price of having quadratic time costs on some iterations (most iterations still take linear time). This yields three other algorithms.</p> <p>Our improved-complexity K-FAC variants speed up K-FAC 3 - 4 times for most investigated settings, in both single-GPU and distributed settings.</p>
first_indexed 2024-09-25T04:16:27Z
format Thesis
id oxford-uuid:0b7ef53b-2192-4332-8641-3b53a7870a98
institution University of Oxford
language English
last_indexed 2024-09-25T04:16:27Z
publishDate 2023
record_format dspace
spelling oxford-uuid:0b7ef53b-2192-4332-8641-3b53a7870a982024-07-23T15:13:18ZNatural gradient algorithms for training deep neural networksThesishttp://purl.org/coar/resource_type/c_db06uuid:0b7ef53b-2192-4332-8641-3b53a7870a98Practical Implementation of Natural Gradient for DNNsOptimization for Deep Neural NetworksEnglishHyrax Deposit2023Puiu, COCartis, CBai, SHauser, RMartens, J<p>In this thesis, we reduce the computation cost of a particular class of learning algorithms for Deep Neural Networks (DNNs; human-brain inspired artificial intelligence models). Natural Gradient Descent (NGD) algorithms have favorable properties as compared to Stochastic Gradient methods. K-FAC (Martens and Grosse, 2015) is a feasible implementation of NGD for DNNs which retains some desirable properties of NGD. Still, K-FAC has cubic time-complexity in layer size, due to the computation of Kronecker factors (K-factors) inverses. This results in slow run-times for large K-factors. The thesis solves the time-complexity issue in two steps.</p> <p>First, we propose low-rank inverse representations of K-factors, obtainable in quadratic-time with randomized Numerical Linear Algebra (NLA). Our theory shows that at least for sufficiently large layers, we can also control the K-factors representation error, owing to the exponentially averaged (EA) construction paradigm. Numerical results show that theory is pessimistic: the error can be controlled better in practical settings. By improving the inverse-application subroutine to scale quadratically as well, we propose two quadratic time-complexity K-FAC variants: R-KFAC and SRE-KFAC.</p> <p>Secondly, by using Brand's algorithm for online NLA to exploit the EA construction paradigm of K-factors, we propose to maintain online low-rank inverse representations of K-factors, which only takes linear time. The approach is limited to sufficiently large layers, which in practice is typically satisfied by fully connected layers. By further improving the inverse-application subroutine to scale linearly as well in this case, we propose a linear time-complexity K-FAC variant: B-KFAC. B-KFAC also has linear memory-complexity in layer size, unlike quadratic for K-FAC. We also propose to algorithmically control the B-KFAC representation error, at the price of having quadratic time costs on some iterations (most iterations still take linear time). This yields three other algorithms.</p> <p>Our improved-complexity K-FAC variants speed up K-FAC 3 - 4 times for most investigated settings, in both single-GPU and distributed settings.</p>
spellingShingle Practical Implementation of Natural Gradient for DNNs
Optimization for Deep Neural Networks
Puiu, CO
Natural gradient algorithms for training deep neural networks
title Natural gradient algorithms for training deep neural networks
title_full Natural gradient algorithms for training deep neural networks
title_fullStr Natural gradient algorithms for training deep neural networks
title_full_unstemmed Natural gradient algorithms for training deep neural networks
title_short Natural gradient algorithms for training deep neural networks
title_sort natural gradient algorithms for training deep neural networks
topic Practical Implementation of Natural Gradient for DNNs
Optimization for Deep Neural Networks
work_keys_str_mv AT puiuco naturalgradientalgorithmsfortrainingdeepneuralnetworks