Natural gradient algorithms for training deep neural networks
<p>In this thesis, we reduce the computation cost of a particular class of learning algorithms for Deep Neural Networks (DNNs; human-brain inspired artificial intelligence models). Natural Gradient Descent (NGD) algorithms have favorable properties as compared to Stochastic Gradient methods. K...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Language: | English |
Published: |
2023
|
Subjects: |
_version_ | 1826313546527408128 |
---|---|
author | Puiu, CO |
author2 | Cartis, C |
author_facet | Cartis, C Puiu, CO |
author_sort | Puiu, CO |
collection | OXFORD |
description | <p>In this thesis, we reduce the computation cost of a particular class of learning algorithms for Deep Neural Networks (DNNs; human-brain inspired artificial intelligence models). Natural Gradient Descent (NGD) algorithms have favorable properties as compared to Stochastic Gradient methods. K-FAC (Martens and Grosse, 2015) is a feasible implementation of NGD for DNNs which retains some desirable properties of NGD. Still, K-FAC has cubic time-complexity in layer size, due to the computation of Kronecker factors (K-factors) inverses. This results in slow run-times for large K-factors. The thesis solves the time-complexity issue in two steps.</p>
<p>First, we propose low-rank inverse representations of K-factors, obtainable in quadratic-time with randomized Numerical Linear Algebra (NLA). Our theory shows that at least for sufficiently large layers, we can also control the K-factors representation error, owing to the exponentially averaged (EA) construction paradigm. Numerical results show that theory is pessimistic: the error can be controlled better in practical settings. By improving the inverse-application subroutine to scale quadratically as well, we propose two quadratic time-complexity K-FAC variants: R-KFAC and SRE-KFAC.</p>
<p>Secondly, by using Brand's algorithm for online NLA to exploit the EA construction paradigm of K-factors, we propose to maintain online low-rank inverse representations of K-factors, which only takes linear time. The approach is limited to sufficiently large layers, which in practice is typically satisfied by fully connected layers. By further improving the inverse-application subroutine to scale linearly as well in this case, we propose a linear time-complexity K-FAC variant: B-KFAC. B-KFAC also has linear memory-complexity in layer size, unlike quadratic for K-FAC. We also propose to algorithmically control the B-KFAC representation error, at the price of having quadratic time costs on some iterations (most iterations still take linear time). This yields three other algorithms.</p>
<p>Our improved-complexity K-FAC variants speed up K-FAC 3 - 4 times for most investigated settings, in both single-GPU and distributed settings.</p> |
first_indexed | 2024-09-25T04:16:27Z |
format | Thesis |
id | oxford-uuid:0b7ef53b-2192-4332-8641-3b53a7870a98 |
institution | University of Oxford |
language | English |
last_indexed | 2024-09-25T04:16:27Z |
publishDate | 2023 |
record_format | dspace |
spelling | oxford-uuid:0b7ef53b-2192-4332-8641-3b53a7870a982024-07-23T15:13:18ZNatural gradient algorithms for training deep neural networksThesishttp://purl.org/coar/resource_type/c_db06uuid:0b7ef53b-2192-4332-8641-3b53a7870a98Practical Implementation of Natural Gradient for DNNsOptimization for Deep Neural NetworksEnglishHyrax Deposit2023Puiu, COCartis, CBai, SHauser, RMartens, J<p>In this thesis, we reduce the computation cost of a particular class of learning algorithms for Deep Neural Networks (DNNs; human-brain inspired artificial intelligence models). Natural Gradient Descent (NGD) algorithms have favorable properties as compared to Stochastic Gradient methods. K-FAC (Martens and Grosse, 2015) is a feasible implementation of NGD for DNNs which retains some desirable properties of NGD. Still, K-FAC has cubic time-complexity in layer size, due to the computation of Kronecker factors (K-factors) inverses. This results in slow run-times for large K-factors. The thesis solves the time-complexity issue in two steps.</p> <p>First, we propose low-rank inverse representations of K-factors, obtainable in quadratic-time with randomized Numerical Linear Algebra (NLA). Our theory shows that at least for sufficiently large layers, we can also control the K-factors representation error, owing to the exponentially averaged (EA) construction paradigm. Numerical results show that theory is pessimistic: the error can be controlled better in practical settings. By improving the inverse-application subroutine to scale quadratically as well, we propose two quadratic time-complexity K-FAC variants: R-KFAC and SRE-KFAC.</p> <p>Secondly, by using Brand's algorithm for online NLA to exploit the EA construction paradigm of K-factors, we propose to maintain online low-rank inverse representations of K-factors, which only takes linear time. The approach is limited to sufficiently large layers, which in practice is typically satisfied by fully connected layers. By further improving the inverse-application subroutine to scale linearly as well in this case, we propose a linear time-complexity K-FAC variant: B-KFAC. B-KFAC also has linear memory-complexity in layer size, unlike quadratic for K-FAC. We also propose to algorithmically control the B-KFAC representation error, at the price of having quadratic time costs on some iterations (most iterations still take linear time). This yields three other algorithms.</p> <p>Our improved-complexity K-FAC variants speed up K-FAC 3 - 4 times for most investigated settings, in both single-GPU and distributed settings.</p> |
spellingShingle | Practical Implementation of Natural Gradient for DNNs Optimization for Deep Neural Networks Puiu, CO Natural gradient algorithms for training deep neural networks |
title | Natural gradient algorithms for training deep neural networks |
title_full | Natural gradient algorithms for training deep neural networks |
title_fullStr | Natural gradient algorithms for training deep neural networks |
title_full_unstemmed | Natural gradient algorithms for training deep neural networks |
title_short | Natural gradient algorithms for training deep neural networks |
title_sort | natural gradient algorithms for training deep neural networks |
topic | Practical Implementation of Natural Gradient for DNNs Optimization for Deep Neural Networks |
work_keys_str_mv | AT puiuco naturalgradientalgorithmsfortrainingdeepneuralnetworks |