Leveraging structure for optimization in deep learning

<p>In the past decade, neural networks have demonstrated impressive performance in supervised learning. They now power many applications ranging from real- time medical diagnosis to human-sounding virtual assistants through wild animal monitoring. Despite their increasing importance however, t...

Full description

Bibliographic Details
Main Author: Berrada, L
Other Authors: Mudigonda, P
Format: Thesis
Language:English
Published: 2019
Subjects:
_version_ 1797076964397285376
author Berrada, L
author2 Mudigonda, P
author_facet Mudigonda, P
Berrada, L
author_sort Berrada, L
collection OXFORD
description <p>In the past decade, neural networks have demonstrated impressive performance in supervised learning. They now power many applications ranging from real- time medical diagnosis to human-sounding virtual assistants through wild animal monitoring. Despite their increasing importance however, they remain difficult to train due to a complex interplay between the learning objective, the optimization algorithm and generalization performance. Indeed, using different loss functions and optimization algorithms lead to trained models with significantly different performances on unseen data.</p> <p>In this thesis, we focus first on the loss function, for which using a task-specific approach can improve the generalization performance in the small or noisy data setting. Specifically, we consider the top-k classification setting. We show that traditional piecewise-linear top-k loss functions require smoothing to work well with neural networks. However, it is computationally challenging to evaluate the resulting smoothed loss function and its gradient. Indeed, using a naive approach would result in a runtime proportional to the number of combinations of possible k-predictions. Thanks to a connection to polynomial algebra, we develop computationally efficient algorithms to evaluate the smoothed loss function and its gradient. This allows us to train models with stochastic gradient descent (SGD) using the smooth top-k loss function. We show that doing so is more robust to over-fitting than using the standard cross-entropy loss function.</p> <p>Second, we turn our attention to optimization algorithms. Indeed, while SGD empirically provides good generalization, it requires a manually tuned learning-rate schedule. Obtaining a suitable learning-rate schedule for a given network and data set is a time-consuming and computationally expensive task. In this thesis, we propose novel optimization algorithms to alleviate this issue. In particular, we propose to exploit the structure in three different ways – each one leading to a new optimization algorithm. First, we exploit the piecewise linearity of the activation and loss functions, which results in a difference-of-convex programming approach. Second, we use the compositionality of the model and the loss function with the help of a proximal approach. Third, we exploit the property of interpolating models to derive an adaptive learning-rate for SGD. Empirically, we compare the performance of the three algorithms on various deep learning tasks, and we demonstrate their advantages over state-of-the-art methods while avoiding the need for manual learning rate schedules.</p>
first_indexed 2024-03-07T00:10:57Z
format Thesis
id oxford-uuid:79360a95-a6e0-4acc-ba3a-07598f52ea39
institution University of Oxford
language English
last_indexed 2024-03-07T00:10:57Z
publishDate 2019
record_format dspace
spelling oxford-uuid:79360a95-a6e0-4acc-ba3a-07598f52ea392022-03-26T20:35:58ZLeveraging structure for optimization in deep learningThesishttp://purl.org/coar/resource_type/c_db06uuid:79360a95-a6e0-4acc-ba3a-07598f52ea39optimizationdeep learningmachine learningEnglishHyrax Deposit2019Berrada, LMudigonda, PZisserman, A<p>In the past decade, neural networks have demonstrated impressive performance in supervised learning. They now power many applications ranging from real- time medical diagnosis to human-sounding virtual assistants through wild animal monitoring. Despite their increasing importance however, they remain difficult to train due to a complex interplay between the learning objective, the optimization algorithm and generalization performance. Indeed, using different loss functions and optimization algorithms lead to trained models with significantly different performances on unseen data.</p> <p>In this thesis, we focus first on the loss function, for which using a task-specific approach can improve the generalization performance in the small or noisy data setting. Specifically, we consider the top-k classification setting. We show that traditional piecewise-linear top-k loss functions require smoothing to work well with neural networks. However, it is computationally challenging to evaluate the resulting smoothed loss function and its gradient. Indeed, using a naive approach would result in a runtime proportional to the number of combinations of possible k-predictions. Thanks to a connection to polynomial algebra, we develop computationally efficient algorithms to evaluate the smoothed loss function and its gradient. This allows us to train models with stochastic gradient descent (SGD) using the smooth top-k loss function. We show that doing so is more robust to over-fitting than using the standard cross-entropy loss function.</p> <p>Second, we turn our attention to optimization algorithms. Indeed, while SGD empirically provides good generalization, it requires a manually tuned learning-rate schedule. Obtaining a suitable learning-rate schedule for a given network and data set is a time-consuming and computationally expensive task. In this thesis, we propose novel optimization algorithms to alleviate this issue. In particular, we propose to exploit the structure in three different ways – each one leading to a new optimization algorithm. First, we exploit the piecewise linearity of the activation and loss functions, which results in a difference-of-convex programming approach. Second, we use the compositionality of the model and the loss function with the help of a proximal approach. Third, we exploit the property of interpolating models to derive an adaptive learning-rate for SGD. Empirically, we compare the performance of the three algorithms on various deep learning tasks, and we demonstrate their advantages over state-of-the-art methods while avoiding the need for manual learning rate schedules.</p>
spellingShingle optimization
deep learning
machine learning
Berrada, L
Leveraging structure for optimization in deep learning
title Leveraging structure for optimization in deep learning
title_full Leveraging structure for optimization in deep learning
title_fullStr Leveraging structure for optimization in deep learning
title_full_unstemmed Leveraging structure for optimization in deep learning
title_short Leveraging structure for optimization in deep learning
title_sort leveraging structure for optimization in deep learning
topic optimization
deep learning
machine learning
work_keys_str_mv AT berradal leveragingstructureforoptimizationindeeplearning