Leveraging structure for optimization in deep learning
<p>In the past decade, neural networks have demonstrated impressive performance in supervised learning. They now power many applications ranging from real- time medical diagnosis to human-sounding virtual assistants through wild animal monitoring. Despite their increasing importance however, t...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Language: | English |
Published: |
2019
|
Subjects: |
_version_ | 1797076964397285376 |
---|---|
author | Berrada, L |
author2 | Mudigonda, P |
author_facet | Mudigonda, P Berrada, L |
author_sort | Berrada, L |
collection | OXFORD |
description | <p>In the past decade, neural networks have demonstrated impressive performance in supervised learning. They now power many applications ranging from real- time medical diagnosis to human-sounding virtual assistants through wild animal monitoring. Despite their increasing importance however, they remain difficult to train due to a complex interplay between the learning objective, the optimization algorithm and generalization performance. Indeed, using different loss functions and optimization algorithms lead to trained models with significantly different performances on unseen data.</p>
<p>In this thesis, we focus first on the loss function, for which using a task-specific approach can improve the generalization performance in the small or noisy data setting. Specifically, we consider the top-k classification setting. We show that traditional piecewise-linear top-k loss functions require smoothing to work well with neural networks. However, it is computationally challenging to evaluate the resulting smoothed loss function and its gradient. Indeed, using a naive approach would result in a runtime proportional to the number of combinations of possible k-predictions. Thanks to a connection to polynomial algebra, we develop computationally efficient algorithms to evaluate the smoothed loss function and its gradient. This allows us to train models with stochastic gradient descent (SGD) using the smooth top-k loss function. We show that doing so is more robust to over-fitting than using the standard cross-entropy loss function.</p>
<p>Second, we turn our attention to optimization algorithms. Indeed, while SGD empirically provides good generalization, it requires a manually tuned learning-rate schedule. Obtaining a suitable learning-rate schedule for a given network and data set is a time-consuming and computationally expensive task. In this thesis, we propose novel optimization algorithms to alleviate this issue. In particular, we propose to exploit the structure in three different ways – each one leading to a new optimization algorithm. First, we exploit the piecewise linearity of the activation and loss functions, which results in a difference-of-convex programming approach. Second, we use the compositionality of the model and the loss function with the help of a proximal approach. Third, we exploit the property of interpolating models to derive an adaptive learning-rate for SGD. Empirically, we compare the performance of the three algorithms on various deep learning tasks, and we demonstrate their advantages over state-of-the-art methods while avoiding the need for manual learning rate schedules.</p> |
first_indexed | 2024-03-07T00:10:57Z |
format | Thesis |
id | oxford-uuid:79360a95-a6e0-4acc-ba3a-07598f52ea39 |
institution | University of Oxford |
language | English |
last_indexed | 2024-03-07T00:10:57Z |
publishDate | 2019 |
record_format | dspace |
spelling | oxford-uuid:79360a95-a6e0-4acc-ba3a-07598f52ea392022-03-26T20:35:58ZLeveraging structure for optimization in deep learningThesishttp://purl.org/coar/resource_type/c_db06uuid:79360a95-a6e0-4acc-ba3a-07598f52ea39optimizationdeep learningmachine learningEnglishHyrax Deposit2019Berrada, LMudigonda, PZisserman, A<p>In the past decade, neural networks have demonstrated impressive performance in supervised learning. They now power many applications ranging from real- time medical diagnosis to human-sounding virtual assistants through wild animal monitoring. Despite their increasing importance however, they remain difficult to train due to a complex interplay between the learning objective, the optimization algorithm and generalization performance. Indeed, using different loss functions and optimization algorithms lead to trained models with significantly different performances on unseen data.</p> <p>In this thesis, we focus first on the loss function, for which using a task-specific approach can improve the generalization performance in the small or noisy data setting. Specifically, we consider the top-k classification setting. We show that traditional piecewise-linear top-k loss functions require smoothing to work well with neural networks. However, it is computationally challenging to evaluate the resulting smoothed loss function and its gradient. Indeed, using a naive approach would result in a runtime proportional to the number of combinations of possible k-predictions. Thanks to a connection to polynomial algebra, we develop computationally efficient algorithms to evaluate the smoothed loss function and its gradient. This allows us to train models with stochastic gradient descent (SGD) using the smooth top-k loss function. We show that doing so is more robust to over-fitting than using the standard cross-entropy loss function.</p> <p>Second, we turn our attention to optimization algorithms. Indeed, while SGD empirically provides good generalization, it requires a manually tuned learning-rate schedule. Obtaining a suitable learning-rate schedule for a given network and data set is a time-consuming and computationally expensive task. In this thesis, we propose novel optimization algorithms to alleviate this issue. In particular, we propose to exploit the structure in three different ways – each one leading to a new optimization algorithm. First, we exploit the piecewise linearity of the activation and loss functions, which results in a difference-of-convex programming approach. Second, we use the compositionality of the model and the loss function with the help of a proximal approach. Third, we exploit the property of interpolating models to derive an adaptive learning-rate for SGD. Empirically, we compare the performance of the three algorithms on various deep learning tasks, and we demonstrate their advantages over state-of-the-art methods while avoiding the need for manual learning rate schedules.</p> |
spellingShingle | optimization deep learning machine learning Berrada, L Leveraging structure for optimization in deep learning |
title | Leveraging structure for optimization in deep learning |
title_full | Leveraging structure for optimization in deep learning |
title_fullStr | Leveraging structure for optimization in deep learning |
title_full_unstemmed | Leveraging structure for optimization in deep learning |
title_short | Leveraging structure for optimization in deep learning |
title_sort | leveraging structure for optimization in deep learning |
topic | optimization deep learning machine learning |
work_keys_str_mv | AT berradal leveragingstructureforoptimizationindeeplearning |