Breaking the deadly triad in reinforcement learning

<p>Reinforcement Learning (RL) is a promising framework for solving sequential decision making problems emerging from agent-environment interactions via trial and error. Off-policy learning is one of the most important techniques in RL, which enables an RL agent to learn from agent-environment...

Full description

Bibliographic Details
Main Author: Zhang, S
Other Authors: Whiteson, S
Format: Thesis
Language:English
Published: 2022
Subjects:
_version_ 1826316150237036544
author Zhang, S
author2 Whiteson, S
author_facet Whiteson, S
Zhang, S
author_sort Zhang, S
collection OXFORD
description <p>Reinforcement Learning (RL) is a promising framework for solving sequential decision making problems emerging from agent-environment interactions via trial and error. Off-policy learning is one of the most important techniques in RL, which enables an RL agent to learn from agent-environment interactions generated by a policy (i.e, a decision making rule that an agent relies on to interact with the environment) that is different from the policy of interest. Arguably, this flexibility is key to applying RL to real-world problems. Off-policy learning, however, often leads to instability of RL algorithms, if combined with function approximation (i.e., using a parameterized function to represent quantities of interest) and bootstrapping (i.e., recursively constructing a learning target for an estimator by using the estimator itself), two arguably indispensable ingredients for large-scale RL applications. This instability, resulting from the combination of off-policy learning, function approximation, and bootstrapping, is the notorious deadly triad in RL.</p> <p>In this thesis, we propose several novel RL algorithms theoretically addressing the deadly triad. The proposed algorithms cover a wide range of RL settings (e.g., both prediction and control, both value-based and policy-based methods, both discounted and average-reward performance metrics). By contrast, existing methods address this issue in only a few RL settings, where our methods also exhibit several advantages over existing ones, e.g., reduced variance, improved asymptotic performance guarantee. These improvements are made possible by the use of several advanced tools (e.g., target networks, differential value functions, density ratios, and truncated followon traces). Importantly, the proposed algorithms remain fully incremental and computationally efficient, making them readily available for large-scale RL applications.</p> <p>Besides the theoretical contributions in breaking the deadly triad, we also make empirical contributions by introducing a bi-directional target network that scales up residual algorithms, a family of RL algorithms that break the deadly triad in some restricted settings.</p>
first_indexed 2024-03-07T07:14:55Z
format Thesis
id oxford-uuid:2c410803-2141-41ed-b362-7f14723b2f17
institution University of Oxford
language English
last_indexed 2024-12-09T03:38:40Z
publishDate 2022
record_format dspace
spelling oxford-uuid:2c410803-2141-41ed-b362-7f14723b2f172024-12-07T10:21:13ZBreaking the deadly triad in reinforcement learningThesishttp://purl.org/coar/resource_type/c_db06uuid:2c410803-2141-41ed-b362-7f14723b2f17Artificial intelligenceEnglishHyrax Deposit2022Zhang, SWhiteson, SAbate, ABrunskill, E<p>Reinforcement Learning (RL) is a promising framework for solving sequential decision making problems emerging from agent-environment interactions via trial and error. Off-policy learning is one of the most important techniques in RL, which enables an RL agent to learn from agent-environment interactions generated by a policy (i.e, a decision making rule that an agent relies on to interact with the environment) that is different from the policy of interest. Arguably, this flexibility is key to applying RL to real-world problems. Off-policy learning, however, often leads to instability of RL algorithms, if combined with function approximation (i.e., using a parameterized function to represent quantities of interest) and bootstrapping (i.e., recursively constructing a learning target for an estimator by using the estimator itself), two arguably indispensable ingredients for large-scale RL applications. This instability, resulting from the combination of off-policy learning, function approximation, and bootstrapping, is the notorious deadly triad in RL.</p> <p>In this thesis, we propose several novel RL algorithms theoretically addressing the deadly triad. The proposed algorithms cover a wide range of RL settings (e.g., both prediction and control, both value-based and policy-based methods, both discounted and average-reward performance metrics). By contrast, existing methods address this issue in only a few RL settings, where our methods also exhibit several advantages over existing ones, e.g., reduced variance, improved asymptotic performance guarantee. These improvements are made possible by the use of several advanced tools (e.g., target networks, differential value functions, density ratios, and truncated followon traces). Importantly, the proposed algorithms remain fully incremental and computationally efficient, making them readily available for large-scale RL applications.</p> <p>Besides the theoretical contributions in breaking the deadly triad, we also make empirical contributions by introducing a bi-directional target network that scales up residual algorithms, a family of RL algorithms that break the deadly triad in some restricted settings.</p>
spellingShingle Artificial intelligence
Zhang, S
Breaking the deadly triad in reinforcement learning
title Breaking the deadly triad in reinforcement learning
title_full Breaking the deadly triad in reinforcement learning
title_fullStr Breaking the deadly triad in reinforcement learning
title_full_unstemmed Breaking the deadly triad in reinforcement learning
title_short Breaking the deadly triad in reinforcement learning
title_sort breaking the deadly triad in reinforcement learning
topic Artificial intelligence
work_keys_str_mv AT zhangs breakingthedeadlytriadinreinforcementlearning