Leveraging structure for optimization in deep learning

<p>In the past decade, neural networks have demonstrated impressive performance in supervised learning. They now power many applications ranging from real- time medical diagnosis to human-sounding virtual assistants through wild animal monitoring. Despite their increasing importance however, t...

Full description

Bibliographic Details
Main Author:	Berrada, L
Other Authors:	Mudigonda, P
Format:	Thesis
Language:	English
Published:	2019
Subjects:	optimization deep learning machine learning

_version_	1797076964397285376
author	Berrada, L
author2	Mudigonda, P
author_facet	Mudigonda, P Berrada, L
author_sort	Berrada, L
collection	OXFORD
description	<p>In the past decade, neural networks have demonstrated impressive performance in supervised learning. They now power many applications ranging from real- time medical diagnosis to human-sounding virtual assistants through wild animal monitoring. Despite their increasing importance however, they remain difficult to train due to a complex interplay between the learning objective, the optimization algorithm and generalization performance. Indeed, using different loss functions and optimization algorithms lead to trained models with significantly different performances on unseen data.</p> <p>In this thesis, we focus first on the loss function, for which using a task-specific approach can improve the generalization performance in the small or noisy data setting. Specifically, we consider the top-k classification setting. We show that traditional piecewise-linear top-k loss functions require smoothing to work well with neural networks. However, it is computationally challenging to evaluate the resulting smoothed loss function and its gradient. Indeed, using a naive approach would result in a runtime proportional to the number of combinations of possible k-predictions. Thanks to a connection to polynomial algebra, we develop computationally efficient algorithms to evaluate the smoothed loss function and its gradient. This allows us to train models with stochastic gradient descent (SGD) using the smooth top-k loss function. We show that doing so is more robust to over-fitting than using the standard cross-entropy loss function.</p> <p>Second, we turn our attention to optimization algorithms. Indeed, while SGD empirically provides good generalization, it requires a manually tuned learning-rate schedule. Obtaining a suitable learning-rate schedule for a given network and data set is a time-consuming and computationally expensive task. In this thesis, we propose novel optimization algorithms to alleviate this issue. In particular, we propose to exploit the structure in three different ways – each one leading to a new optimization algorithm. First, we exploit the piecewise linearity of the activation and loss functions, which results in a difference-of-convex programming approach. Second, we use the compositionality of the model and the loss function with the help of a proximal approach. Third, we exploit the property of interpolating models to derive an adaptive learning-rate for SGD. Empirically, we compare the performance of the three algorithms on various deep learning tasks, and we demonstrate their advantages over state-of-the-art methods while avoiding the need for manual learning rate schedules.</p>
first_indexed	2024-03-07T00:10:57Z
format	Thesis
id	oxford-uuid:79360a95-a6e0-4acc-ba3a-07598f52ea39
institution	University of Oxford
language	English
last_indexed	2024-03-07T00:10:57Z
publishDate	2019
record_format	dspace
spelling	oxford-uuid:79360a95-a6e0-4acc-ba3a-07598f52ea392022-03-26T20:35:58ZLeveraging structure for optimization in deep learningThesishttp://purl.org/coar/resource_type/c_db06uuid:79360a95-a6e0-4acc-ba3a-07598f52ea39optimizationdeep learningmachine learningEnglishHyrax Deposit2019Berrada, LMudigonda, PZisserman, A<p>In the past decade, neural networks have demonstrated impressive performance in supervised learning. They now power many applications ranging from real- time medical diagnosis to human-sounding virtual assistants through wild animal monitoring. Despite their increasing importance however, they remain difficult to train due to a complex interplay between the learning objective, the optimization algorithm and generalization performance. Indeed, using different loss functions and optimization algorithms lead to trained models with significantly different performances on unseen data.</p> <p>In this thesis, we focus first on the loss function, for which using a task-specific approach can improve the generalization performance in the small or noisy data setting. Specifically, we consider the top-k classification setting. We show that traditional piecewise-linear top-k loss functions require smoothing to work well with neural networks. However, it is computationally challenging to evaluate the resulting smoothed loss function and its gradient. Indeed, using a naive approach would result in a runtime proportional to the number of combinations of possible k-predictions. Thanks to a connection to polynomial algebra, we develop computationally efficient algorithms to evaluate the smoothed loss function and its gradient. This allows us to train models with stochastic gradient descent (SGD) using the smooth top-k loss function. We show that doing so is more robust to over-fitting than using the standard cross-entropy loss function.</p> <p>Second, we turn our attention to optimization algorithms. Indeed, while SGD empirically provides good generalization, it requires a manually tuned learning-rate schedule. Obtaining a suitable learning-rate schedule for a given network and data set is a time-consuming and computationally expensive task. In this thesis, we propose novel optimization algorithms to alleviate this issue. In particular, we propose to exploit the structure in three different ways – each one leading to a new optimization algorithm. First, we exploit the piecewise linearity of the activation and loss functions, which results in a difference-of-convex programming approach. Second, we use the compositionality of the model and the loss function with the help of a proximal approach. Third, we exploit the property of interpolating models to derive an adaptive learning-rate for SGD. Empirically, we compare the performance of the three algorithms on various deep learning tasks, and we demonstrate their advantages over state-of-the-art methods while avoiding the need for manual learning rate schedules.</p>
spellingShingle	optimization deep learning machine learning Berrada, L Leveraging structure for optimization in deep learning
title	Leveraging structure for optimization in deep learning
title_full	Leveraging structure for optimization in deep learning
title_fullStr	Leveraging structure for optimization in deep learning
title_full_unstemmed	Leveraging structure for optimization in deep learning
title_short	Leveraging structure for optimization in deep learning
title_sort	leveraging structure for optimization in deep learning
topic	optimization deep learning machine learning
work_keys_str_mv	AT berradal leveragingstructureforoptimizationindeeplearning

Leveraging structure for optimization in deep learning

Similar Items