Bayesian and variational inference for reinforcement learning

<p>This thesis explores Bayesian and variational inference in the context of solving the reinforcement learning (RL) problem. Recent advances in developing state-ofthe-art algorithms suitable for continuous control introduce regularisation into the reinforcement learning objective. Analysis h...

Full description

Bibliographic Details
Main Author: Fellows, M
Other Authors: Whiteson, S
Format: Thesis
Language:English
Published: 2021
Subjects:
_version_ 1797108138612097024
author Fellows, M
author2 Whiteson, S
author_facet Whiteson, S
Fellows, M
author_sort Fellows, M
collection OXFORD
description <p>This thesis explores Bayesian and variational inference in the context of solving the reinforcement learning (RL) problem. Recent advances in developing state-ofthe-art algorithms suitable for continuous control introduce regularisation into the reinforcement learning objective. Analysis has revealed that introducing regularisation defines a metaphorical probabilistic inference problem whose solution is used to learn policies with improved exploration properties. Solving this inference problem enables the application of powerful optimisation tools such as variational inference to reinforcement learning (RL). However, existing inference frameworks and their algorithms pose significant challenges for learning optimal policies, e.g., the lack of mode capturing behaviour and difficulties learning deterministic policies. Our first contribution is a theoretically grounded variational inferenence for reinforcement learning (VIREL) framework that utilises a parametrised Q-function to summarise future dynamics of the underlying Markov decision process (MDP), generalising existing approaches. Our framework resolves theoretical issues of existing approaches and an empirical evaluation demonstrate that actor-critic algorithms derived from VIREL outperform state-of-the-art methods based on soft value functions in several domains, even with approximations.</p> <p>Our analysis reveals that the issues associated with regularised RL stem from the construction of the metaphorical inference problem, which does not capture the uncertainty in the RL problem like a true Bayesian approach would. Moreover, to define and solve this inference problem we must make incorrect assumptions about the Markov decision process (MDP). We show that VIREL resolves these issues by mimicking the adaptive behaviour of a policy that arises from taking a true Bayesian approach to RL. However the type of exploration used by VIREL policies is still suboptimal in comparison to true Bayesian policies is does not capturing uncertainty in the MDP. This motivates taking a Bayesian approach to RL.</p> <p>The de-facto objective used in RL is frequentist. In this thesis, we argue for starting from the Bayesian objective instead. A key theme of this thesis is that, unlike frequentist approaches, a Bayesian approach to RL adheres to the likelihood principle: in defining the RL objective, frequentist approaches assume exact prior knowledge of the environment which is unrealistic for most applications. In contrast, Bayesian methods characterise uncertainty in the MDP and only condition on knowledge that the agent has, even under approximation. Our analysis reveals that many of the pathologies of frequentist RL stem from breaking the likelihood principle and are immediately resolved by starting with a Bayesian RL objective instead.</p> <p>Finally, we introduce a novel perspective on Bayesian RL; whereas existing approaches infer a posterior over the transition distribution or Q-function, we characterise the uncertainty in the Bellman operator. Our Bayesian Bellman operator (BBO) framework is motivated by the insight that when bootstrapping is introduced, model-free approaches actually infer a posterior over Bellman operators, not value functions. In this thesis, we use BBO to provide a rigorous theoretical analysis of model-free Bayesian RL to better understand its relationship to established frequentist RL methodologies. We prove that Bayesian solutions are consistent with frequentist RL solutions, even when approximate inference is used, and derive conditions for which convergence properties hold. Empirically, we demonstrate that algorithms derived from the BBO framework have sophisticated deep exploration properties that enable them to solve continuous control tasks at which state-of-the-art regularised actor-critic algorithms fail catastrophically.</p>
first_indexed 2024-03-07T07:23:36Z
format Thesis
id oxford-uuid:ebc17e10-b727-467d-8d71-3a9db4973665
institution University of Oxford
language English
last_indexed 2024-03-07T07:23:36Z
publishDate 2021
record_format dspace
spelling oxford-uuid:ebc17e10-b727-467d-8d71-3a9db49736652022-11-07T14:07:55ZBayesian and variational inference for reinforcement learningThesishttp://purl.org/coar/resource_type/c_db06uuid:ebc17e10-b727-467d-8d71-3a9db4973665Reinforcement learningEnglishHyrax Deposit2021Fellows, MWhiteson, SHartikainen, KMahajan, A <p>This thesis explores Bayesian and variational inference in the context of solving the reinforcement learning (RL) problem. Recent advances in developing state-ofthe-art algorithms suitable for continuous control introduce regularisation into the reinforcement learning objective. Analysis has revealed that introducing regularisation defines a metaphorical probabilistic inference problem whose solution is used to learn policies with improved exploration properties. Solving this inference problem enables the application of powerful optimisation tools such as variational inference to reinforcement learning (RL). However, existing inference frameworks and their algorithms pose significant challenges for learning optimal policies, e.g., the lack of mode capturing behaviour and difficulties learning deterministic policies. Our first contribution is a theoretically grounded variational inferenence for reinforcement learning (VIREL) framework that utilises a parametrised Q-function to summarise future dynamics of the underlying Markov decision process (MDP), generalising existing approaches. Our framework resolves theoretical issues of existing approaches and an empirical evaluation demonstrate that actor-critic algorithms derived from VIREL outperform state-of-the-art methods based on soft value functions in several domains, even with approximations.</p> <p>Our analysis reveals that the issues associated with regularised RL stem from the construction of the metaphorical inference problem, which does not capture the uncertainty in the RL problem like a true Bayesian approach would. Moreover, to define and solve this inference problem we must make incorrect assumptions about the Markov decision process (MDP). We show that VIREL resolves these issues by mimicking the adaptive behaviour of a policy that arises from taking a true Bayesian approach to RL. However the type of exploration used by VIREL policies is still suboptimal in comparison to true Bayesian policies is does not capturing uncertainty in the MDP. This motivates taking a Bayesian approach to RL.</p> <p>The de-facto objective used in RL is frequentist. In this thesis, we argue for starting from the Bayesian objective instead. A key theme of this thesis is that, unlike frequentist approaches, a Bayesian approach to RL adheres to the likelihood principle: in defining the RL objective, frequentist approaches assume exact prior knowledge of the environment which is unrealistic for most applications. In contrast, Bayesian methods characterise uncertainty in the MDP and only condition on knowledge that the agent has, even under approximation. Our analysis reveals that many of the pathologies of frequentist RL stem from breaking the likelihood principle and are immediately resolved by starting with a Bayesian RL objective instead.</p> <p>Finally, we introduce a novel perspective on Bayesian RL; whereas existing approaches infer a posterior over the transition distribution or Q-function, we characterise the uncertainty in the Bellman operator. Our Bayesian Bellman operator (BBO) framework is motivated by the insight that when bootstrapping is introduced, model-free approaches actually infer a posterior over Bellman operators, not value functions. In this thesis, we use BBO to provide a rigorous theoretical analysis of model-free Bayesian RL to better understand its relationship to established frequentist RL methodologies. We prove that Bayesian solutions are consistent with frequentist RL solutions, even when approximate inference is used, and derive conditions for which convergence properties hold. Empirically, we demonstrate that algorithms derived from the BBO framework have sophisticated deep exploration properties that enable them to solve continuous control tasks at which state-of-the-art regularised actor-critic algorithms fail catastrophically.</p>
spellingShingle Reinforcement learning
Fellows, M
Bayesian and variational inference for reinforcement learning
title Bayesian and variational inference for reinforcement learning
title_full Bayesian and variational inference for reinforcement learning
title_fullStr Bayesian and variational inference for reinforcement learning
title_full_unstemmed Bayesian and variational inference for reinforcement learning
title_short Bayesian and variational inference for reinforcement learning
title_sort bayesian and variational inference for reinforcement learning
topic Reinforcement learning
work_keys_str_mv AT fellowsm bayesianandvariationalinferenceforreinforcementlearning