Bayesian and variational inference for reinforcement learning
<p>This thesis explores Bayesian and variational inference in the context of solving the reinforcement learning (RL) problem. Recent advances in developing state-ofthe-art algorithms suitable for continuous control introduce regularisation into the reinforcement learning objective. Analysis h...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Language: | English |
Published: |
2021
|
Subjects: |
_version_ | 1797108138612097024 |
---|---|
author | Fellows, M |
author2 | Whiteson, S |
author_facet | Whiteson, S Fellows, M |
author_sort | Fellows, M |
collection | OXFORD |
description | <p>This thesis explores Bayesian and variational inference in the context of solving the reinforcement learning (RL) problem. Recent advances in developing state-ofthe-art algorithms suitable for continuous control introduce regularisation into the
reinforcement learning objective. Analysis has revealed that introducing regularisation defines a metaphorical probabilistic inference problem whose solution is used to
learn policies with improved exploration properties. Solving this inference problem enables the application of powerful optimisation tools such as variational inference
to reinforcement learning (RL). However, existing inference frameworks and their algorithms pose significant challenges for learning optimal policies, e.g., the lack of mode capturing behaviour and difficulties learning deterministic policies. Our
first contribution is a theoretically grounded variational inferenence for reinforcement learning (VIREL) framework that utilises a parametrised Q-function to summarise future
dynamics of the underlying Markov decision process (MDP), generalising existing approaches. Our framework resolves theoretical issues of existing approaches and
an empirical evaluation demonstrate that actor-critic algorithms derived from VIREL outperform state-of-the-art methods based on soft value functions in several domains,
even with approximations.</p>
<p>Our analysis reveals that the issues associated with regularised RL stem from the construction of the metaphorical inference problem, which does not capture the
uncertainty in the RL problem like a true Bayesian approach would. Moreover, to define and solve this inference problem we must make incorrect assumptions about the Markov decision process (MDP). We show that VIREL resolves these issues by
mimicking the adaptive behaviour of a policy that arises from taking a true Bayesian approach to RL. However the type of exploration used by VIREL policies is still
suboptimal in comparison to true Bayesian policies is does not capturing uncertainty in the MDP. This motivates taking a Bayesian approach to RL.</p>
<p>The de-facto objective used in RL is frequentist. In this thesis, we argue for starting from the Bayesian objective instead. A key theme of this thesis is that, unlike frequentist approaches, a Bayesian approach to RL adheres to the likelihood principle:
in defining the RL objective, frequentist approaches assume exact prior knowledge of the environment which is unrealistic for most applications. In contrast, Bayesian methods characterise uncertainty in the MDP and only condition on knowledge that the agent has, even under approximation. Our analysis reveals that many of the pathologies of frequentist RL stem from breaking the likelihood principle and are immediately
resolved by starting with a Bayesian RL objective instead.</p>
<p>Finally, we introduce a novel perspective on Bayesian RL; whereas existing approaches infer a posterior over the transition distribution or Q-function, we characterise the
uncertainty in the Bellman operator. Our Bayesian Bellman operator (BBO) framework is motivated by the insight that when bootstrapping is introduced, model-free approaches actually infer a posterior over Bellman operators, not value functions. In this thesis, we use BBO to provide a rigorous theoretical analysis of model-free Bayesian RL to better understand its relationship to established frequentist RL methodologies.
We prove that Bayesian solutions are consistent with frequentist RL solutions, even when approximate inference is used, and derive conditions for which convergence properties hold. Empirically, we demonstrate that algorithms derived from the BBO framework have sophisticated deep exploration properties that enable them to solve continuous control tasks at which state-of-the-art regularised actor-critic algorithms fail catastrophically.</p> |
first_indexed | 2024-03-07T07:23:36Z |
format | Thesis |
id | oxford-uuid:ebc17e10-b727-467d-8d71-3a9db4973665 |
institution | University of Oxford |
language | English |
last_indexed | 2024-03-07T07:23:36Z |
publishDate | 2021 |
record_format | dspace |
spelling | oxford-uuid:ebc17e10-b727-467d-8d71-3a9db49736652022-11-07T14:07:55ZBayesian and variational inference for reinforcement learningThesishttp://purl.org/coar/resource_type/c_db06uuid:ebc17e10-b727-467d-8d71-3a9db4973665Reinforcement learningEnglishHyrax Deposit2021Fellows, MWhiteson, SHartikainen, KMahajan, A <p>This thesis explores Bayesian and variational inference in the context of solving the reinforcement learning (RL) problem. Recent advances in developing state-ofthe-art algorithms suitable for continuous control introduce regularisation into the reinforcement learning objective. Analysis has revealed that introducing regularisation defines a metaphorical probabilistic inference problem whose solution is used to learn policies with improved exploration properties. Solving this inference problem enables the application of powerful optimisation tools such as variational inference to reinforcement learning (RL). However, existing inference frameworks and their algorithms pose significant challenges for learning optimal policies, e.g., the lack of mode capturing behaviour and difficulties learning deterministic policies. Our first contribution is a theoretically grounded variational inferenence for reinforcement learning (VIREL) framework that utilises a parametrised Q-function to summarise future dynamics of the underlying Markov decision process (MDP), generalising existing approaches. Our framework resolves theoretical issues of existing approaches and an empirical evaluation demonstrate that actor-critic algorithms derived from VIREL outperform state-of-the-art methods based on soft value functions in several domains, even with approximations.</p> <p>Our analysis reveals that the issues associated with regularised RL stem from the construction of the metaphorical inference problem, which does not capture the uncertainty in the RL problem like a true Bayesian approach would. Moreover, to define and solve this inference problem we must make incorrect assumptions about the Markov decision process (MDP). We show that VIREL resolves these issues by mimicking the adaptive behaviour of a policy that arises from taking a true Bayesian approach to RL. However the type of exploration used by VIREL policies is still suboptimal in comparison to true Bayesian policies is does not capturing uncertainty in the MDP. This motivates taking a Bayesian approach to RL.</p> <p>The de-facto objective used in RL is frequentist. In this thesis, we argue for starting from the Bayesian objective instead. A key theme of this thesis is that, unlike frequentist approaches, a Bayesian approach to RL adheres to the likelihood principle: in defining the RL objective, frequentist approaches assume exact prior knowledge of the environment which is unrealistic for most applications. In contrast, Bayesian methods characterise uncertainty in the MDP and only condition on knowledge that the agent has, even under approximation. Our analysis reveals that many of the pathologies of frequentist RL stem from breaking the likelihood principle and are immediately resolved by starting with a Bayesian RL objective instead.</p> <p>Finally, we introduce a novel perspective on Bayesian RL; whereas existing approaches infer a posterior over the transition distribution or Q-function, we characterise the uncertainty in the Bellman operator. Our Bayesian Bellman operator (BBO) framework is motivated by the insight that when bootstrapping is introduced, model-free approaches actually infer a posterior over Bellman operators, not value functions. In this thesis, we use BBO to provide a rigorous theoretical analysis of model-free Bayesian RL to better understand its relationship to established frequentist RL methodologies. We prove that Bayesian solutions are consistent with frequentist RL solutions, even when approximate inference is used, and derive conditions for which convergence properties hold. Empirically, we demonstrate that algorithms derived from the BBO framework have sophisticated deep exploration properties that enable them to solve continuous control tasks at which state-of-the-art regularised actor-critic algorithms fail catastrophically.</p> |
spellingShingle | Reinforcement learning Fellows, M Bayesian and variational inference for reinforcement learning |
title | Bayesian and variational inference for reinforcement
learning |
title_full | Bayesian and variational inference for reinforcement
learning |
title_fullStr | Bayesian and variational inference for reinforcement
learning |
title_full_unstemmed | Bayesian and variational inference for reinforcement
learning |
title_short | Bayesian and variational inference for reinforcement
learning |
title_sort | bayesian and variational inference for reinforcement learning |
topic | Reinforcement learning |
work_keys_str_mv | AT fellowsm bayesianandvariationalinferenceforreinforcementlearning |