Towards robust reinforcement learning

<p>While reinforcement learning (RL) algorithms have been successfully applied to a wide range of problems -- from improving energy efficiency of data centres, trading financial products, operating self driving cars, etc., they have difficulty coping with environments characterised by signific...

Full description

Bibliographic Details
Main Author: Paul, S
Other Authors: Whiteson, S
Format: Thesis
Language:English
Published: 2020
Subjects:
Description
Summary:<p>While reinforcement learning (RL) algorithms have been successfully applied to a wide range of problems -- from improving energy efficiency of data centres, trading financial products, operating self driving cars, etc., they have difficulty coping with environments characterised by significant rare events, which have a low probability of occurrence, but can have a significant impact in determining the optimal policy. Furthermore, the performance of existing algorithms has been shown to be remarkably sensitive to their hyperparameters. As a result, when faced with a new problem the common practice is to perform a search over these hyperparameters. While this is not a major issue in settings where generating fresh interactions with the environment is cheap (for example, learning policies for playing Atari games, or simulated robotics tasks), this can preclude existing RL algorithms from being applied to problems where interactions with the environment are significantly more expensive.</p> <p>In this thesis, we focus on the issue of robustness of RL algorithms with regards to the problem of rare events and its own hyperparameters. We present two algorithms -- Alternating Optimisation and Quadrature (ALOQ), and Fingerprint Policy Optimisation (FPO), that address the problem of robustness to rare events. Both these algorithms are based on the principle of actively generating experience that takes into account the effect of any rare events, rather than relying on random sampling as is common to existing methods. While ALOQ is a self-contained method that uses Bayesian optimisation and Bayesian quadrature to perform policy search, FPO is designed to make existing policy gradient methods more robust to these settings. We also present Hyperparameter Optimisation on the Fly, a gradient free algorithm designed to automatically learn the hyperparameters of policy gradient methods while requiring no more interactions than what would be normally collected within one training run of the underlying policy gradient method. We empirically validate our algorithms through experiments across multiple domains.</p>