Breaking the deadly triad in reinforcement learning

<p>Reinforcement Learning (RL) is a promising framework for solving sequential decision making problems emerging from agent-environment interactions via trial and error. Off-policy learning is one of the most important techniques in RL, which enables an RL agent to learn from agent-environment...

Full description

Bibliographic Details
Main Author:	Zhang, S
Other Authors:	Whiteson, S
Format:	Thesis
Language:	English
Published:	2022
Subjects:	Artificial intelligence

_version_	1826316150237036544
author	Zhang, S
author2	Whiteson, S
author_facet	Whiteson, S Zhang, S
author_sort	Zhang, S
collection	OXFORD
description	<p>Reinforcement Learning (RL) is a promising framework for solving sequential decision making problems emerging from agent-environment interactions via trial and error. Off-policy learning is one of the most important techniques in RL, which enables an RL agent to learn from agent-environment interactions generated by a policy (i.e, a decision making rule that an agent relies on to interact with the environment) that is different from the policy of interest. Arguably, this flexibility is key to applying RL to real-world problems. Off-policy learning, however, often leads to instability of RL algorithms, if combined with function approximation (i.e., using a parameterized function to represent quantities of interest) and bootstrapping (i.e., recursively constructing a learning target for an estimator by using the estimator itself), two arguably indispensable ingredients for large-scale RL applications. This instability, resulting from the combination of off-policy learning, function approximation, and bootstrapping, is the notorious deadly triad in RL.</p> <p>In this thesis, we propose several novel RL algorithms theoretically addressing the deadly triad. The proposed algorithms cover a wide range of RL settings (e.g., both prediction and control, both value-based and policy-based methods, both discounted and average-reward performance metrics). By contrast, existing methods address this issue in only a few RL settings, where our methods also exhibit several advantages over existing ones, e.g., reduced variance, improved asymptotic performance guarantee. These improvements are made possible by the use of several advanced tools (e.g., target networks, differential value functions, density ratios, and truncated followon traces). Importantly, the proposed algorithms remain fully incremental and computationally efficient, making them readily available for large-scale RL applications.</p> <p>Besides the theoretical contributions in breaking the deadly triad, we also make empirical contributions by introducing a bi-directional target network that scales up residual algorithms, a family of RL algorithms that break the deadly triad in some restricted settings.</p>
first_indexed	2024-03-07T07:14:55Z
format	Thesis
id	oxford-uuid:2c410803-2141-41ed-b362-7f14723b2f17
institution	University of Oxford
language	English
last_indexed	2024-12-09T03:38:40Z
publishDate	2022
record_format	dspace
spelling	oxford-uuid:2c410803-2141-41ed-b362-7f14723b2f172024-12-07T10:21:13ZBreaking the deadly triad in reinforcement learningThesishttp://purl.org/coar/resource_type/c_db06uuid:2c410803-2141-41ed-b362-7f14723b2f17Artificial intelligenceEnglishHyrax Deposit2022Zhang, SWhiteson, SAbate, ABrunskill, E<p>Reinforcement Learning (RL) is a promising framework for solving sequential decision making problems emerging from agent-environment interactions via trial and error. Off-policy learning is one of the most important techniques in RL, which enables an RL agent to learn from agent-environment interactions generated by a policy (i.e, a decision making rule that an agent relies on to interact with the environment) that is different from the policy of interest. Arguably, this flexibility is key to applying RL to real-world problems. Off-policy learning, however, often leads to instability of RL algorithms, if combined with function approximation (i.e., using a parameterized function to represent quantities of interest) and bootstrapping (i.e., recursively constructing a learning target for an estimator by using the estimator itself), two arguably indispensable ingredients for large-scale RL applications. This instability, resulting from the combination of off-policy learning, function approximation, and bootstrapping, is the notorious deadly triad in RL.</p> <p>In this thesis, we propose several novel RL algorithms theoretically addressing the deadly triad. The proposed algorithms cover a wide range of RL settings (e.g., both prediction and control, both value-based and policy-based methods, both discounted and average-reward performance metrics). By contrast, existing methods address this issue in only a few RL settings, where our methods also exhibit several advantages over existing ones, e.g., reduced variance, improved asymptotic performance guarantee. These improvements are made possible by the use of several advanced tools (e.g., target networks, differential value functions, density ratios, and truncated followon traces). Importantly, the proposed algorithms remain fully incremental and computationally efficient, making them readily available for large-scale RL applications.</p> <p>Besides the theoretical contributions in breaking the deadly triad, we also make empirical contributions by introducing a bi-directional target network that scales up residual algorithms, a family of RL algorithms that break the deadly triad in some restricted settings.</p>
spellingShingle	Artificial intelligence Zhang, S Breaking the deadly triad in reinforcement learning
title	Breaking the deadly triad in reinforcement learning
title_full	Breaking the deadly triad in reinforcement learning
title_fullStr	Breaking the deadly triad in reinforcement learning
title_full_unstemmed	Breaking the deadly triad in reinforcement learning
title_short	Breaking the deadly triad in reinforcement learning
title_sort	breaking the deadly triad in reinforcement learning
topic	Artificial intelligence
work_keys_str_mv	AT zhangs breakingthedeadlytriadinreinforcementlearning

Breaking the deadly triad in reinforcement learning

Similar Items