Breaking the deadly triad in reinforcement learning
<p>Reinforcement Learning (RL) is a promising framework for solving sequential decision making problems emerging from agent-environment interactions via trial and error. Off-policy learning is one of the most important techniques in RL, which enables an RL agent to learn from agent-environment...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Language: | English |
Published: |
2022
|
Subjects: |
_version_ | 1826316150237036544 |
---|---|
author | Zhang, S |
author2 | Whiteson, S |
author_facet | Whiteson, S Zhang, S |
author_sort | Zhang, S |
collection | OXFORD |
description | <p>Reinforcement Learning (RL) is a promising framework for solving sequential decision making problems emerging from agent-environment interactions via trial and error. Off-policy learning is one of the most important techniques in RL, which enables an RL agent to learn from agent-environment interactions generated by a policy (i.e, a decision making rule that an agent relies on to interact with the environment) that is different from the policy of interest. Arguably, this flexibility is key to applying RL to real-world problems. Off-policy learning, however, often leads to instability of RL algorithms, if combined with function approximation (i.e., using a parameterized function to represent quantities of interest) and bootstrapping (i.e., recursively constructing a learning target for an estimator by using the estimator itself), two arguably indispensable ingredients for large-scale RL applications. This instability, resulting from the combination of off-policy learning, function approximation, and bootstrapping, is the notorious deadly triad in RL.</p>
<p>In this thesis, we propose several novel RL algorithms theoretically addressing the deadly triad. The proposed algorithms cover a wide range of RL settings (e.g., both prediction and control, both value-based and policy-based methods, both discounted and average-reward performance metrics). By contrast, existing methods address this issue in only a few RL settings, where our methods also exhibit several advantages over existing ones, e.g., reduced variance, improved asymptotic performance guarantee. These improvements are made possible by the use of several advanced tools (e.g., target networks, differential value functions, density ratios, and truncated followon traces). Importantly, the proposed algorithms remain fully incremental and computationally efficient, making them readily available for large-scale RL applications.</p>
<p>Besides the theoretical contributions in breaking the deadly triad, we also make empirical contributions by introducing a bi-directional target network that scales up residual algorithms, a family of RL algorithms that break the deadly triad in some restricted settings.</p> |
first_indexed | 2024-03-07T07:14:55Z |
format | Thesis |
id | oxford-uuid:2c410803-2141-41ed-b362-7f14723b2f17 |
institution | University of Oxford |
language | English |
last_indexed | 2024-12-09T03:38:40Z |
publishDate | 2022 |
record_format | dspace |
spelling | oxford-uuid:2c410803-2141-41ed-b362-7f14723b2f172024-12-07T10:21:13ZBreaking the deadly triad in reinforcement learningThesishttp://purl.org/coar/resource_type/c_db06uuid:2c410803-2141-41ed-b362-7f14723b2f17Artificial intelligenceEnglishHyrax Deposit2022Zhang, SWhiteson, SAbate, ABrunskill, E<p>Reinforcement Learning (RL) is a promising framework for solving sequential decision making problems emerging from agent-environment interactions via trial and error. Off-policy learning is one of the most important techniques in RL, which enables an RL agent to learn from agent-environment interactions generated by a policy (i.e, a decision making rule that an agent relies on to interact with the environment) that is different from the policy of interest. Arguably, this flexibility is key to applying RL to real-world problems. Off-policy learning, however, often leads to instability of RL algorithms, if combined with function approximation (i.e., using a parameterized function to represent quantities of interest) and bootstrapping (i.e., recursively constructing a learning target for an estimator by using the estimator itself), two arguably indispensable ingredients for large-scale RL applications. This instability, resulting from the combination of off-policy learning, function approximation, and bootstrapping, is the notorious deadly triad in RL.</p> <p>In this thesis, we propose several novel RL algorithms theoretically addressing the deadly triad. The proposed algorithms cover a wide range of RL settings (e.g., both prediction and control, both value-based and policy-based methods, both discounted and average-reward performance metrics). By contrast, existing methods address this issue in only a few RL settings, where our methods also exhibit several advantages over existing ones, e.g., reduced variance, improved asymptotic performance guarantee. These improvements are made possible by the use of several advanced tools (e.g., target networks, differential value functions, density ratios, and truncated followon traces). Importantly, the proposed algorithms remain fully incremental and computationally efficient, making them readily available for large-scale RL applications.</p> <p>Besides the theoretical contributions in breaking the deadly triad, we also make empirical contributions by introducing a bi-directional target network that scales up residual algorithms, a family of RL algorithms that break the deadly triad in some restricted settings.</p> |
spellingShingle | Artificial intelligence Zhang, S Breaking the deadly triad in reinforcement learning |
title | Breaking the deadly triad in reinforcement learning |
title_full | Breaking the deadly triad in reinforcement learning |
title_fullStr | Breaking the deadly triad in reinforcement learning |
title_full_unstemmed | Breaking the deadly triad in reinforcement learning |
title_short | Breaking the deadly triad in reinforcement learning |
title_sort | breaking the deadly triad in reinforcement learning |
topic | Artificial intelligence |
work_keys_str_mv | AT zhangs breakingthedeadlytriadinreinforcementlearning |