Expected policy gradients for reinforcement learning

We propose expected policy gradients (EPG), which unify stochastic policy gradients (SPG) and deterministic policy gradients (DPG) for reinforcement learning. Inspired by expected sarsa, EPG integrates (or sums) across actions when estimating the gradient, instead of relying only on the action in th...

Full description

Bibliographic Details
Main Authors: Ciosek, K, Whiteson, S
Format: Journal article
Language:English
Published: Journal of Machine Learning Research 2020
_version_ 1797084215400988672
author Ciosek, K
Ciosek, K
Whiteson, S
author_facet Ciosek, K
Ciosek, K
Whiteson, S
author_sort Ciosek, K
collection OXFORD
description We propose expected policy gradients (EPG), which unify stochastic policy gradients (SPG) and deterministic policy gradients (DPG) for reinforcement learning. Inspired by expected sarsa, EPG integrates (or sums) across actions when estimating the gradient, instead of relying only on the action in the sampled trajectory. For continuous action spaces, we first derive a practical result for Gaussian policies and quadratic critics and then extend it to a universal analytical method, covering a broad class of actors and critics, including Gaussian, exponential families, and policies with bounded support. For Gaussian policies, we introduce an exploration method that uses covariance proportional to eH, where H is the scaled Hessian of the critic with respect to the actions. For discrete action spaces, we derive a variant of EPG based on softmax policies. We also establish a new general policy gradient theorem, of which the stochastic and deterministic policy gradient theorems are special cases. Furthermore, we prove that EPG reduces the variance of the gradient estimates without requiring deterministic policies and with little computational overhead. Finally, we provide an extensive experimental evaluation of EPG and show that it outperforms existing approaches on multiple challenging control domains.
first_indexed 2024-03-07T01:52:17Z
format Journal article
id oxford-uuid:9a878bfc-7fd1-42b2-b1a7-33f09346efc7
institution University of Oxford
language English
last_indexed 2024-03-07T01:52:17Z
publishDate 2020
publisher Journal of Machine Learning Research
record_format dspace
spelling oxford-uuid:9a878bfc-7fd1-42b2-b1a7-33f09346efc72022-03-27T00:21:58ZExpected policy gradients for reinforcement learningJournal articlehttp://purl.org/coar/resource_type/c_dcae04bcuuid:9a878bfc-7fd1-42b2-b1a7-33f09346efc7EnglishSymplectic ElementsJournal of Machine Learning Research2020Ciosek, KCiosek, KWhiteson, SWe propose expected policy gradients (EPG), which unify stochastic policy gradients (SPG) and deterministic policy gradients (DPG) for reinforcement learning. Inspired by expected sarsa, EPG integrates (or sums) across actions when estimating the gradient, instead of relying only on the action in the sampled trajectory. For continuous action spaces, we first derive a practical result for Gaussian policies and quadratic critics and then extend it to a universal analytical method, covering a broad class of actors and critics, including Gaussian, exponential families, and policies with bounded support. For Gaussian policies, we introduce an exploration method that uses covariance proportional to eH, where H is the scaled Hessian of the critic with respect to the actions. For discrete action spaces, we derive a variant of EPG based on softmax policies. We also establish a new general policy gradient theorem, of which the stochastic and deterministic policy gradient theorems are special cases. Furthermore, we prove that EPG reduces the variance of the gradient estimates without requiring deterministic policies and with little computational overhead. Finally, we provide an extensive experimental evaluation of EPG and show that it outperforms existing approaches on multiple challenging control domains.
spellingShingle Ciosek, K
Ciosek, K
Whiteson, S
Expected policy gradients for reinforcement learning
title Expected policy gradients for reinforcement learning
title_full Expected policy gradients for reinforcement learning
title_fullStr Expected policy gradients for reinforcement learning
title_full_unstemmed Expected policy gradients for reinforcement learning
title_short Expected policy gradients for reinforcement learning
title_sort expected policy gradients for reinforcement learning
work_keys_str_mv AT ciosekk expectedpolicygradientsforreinforcementlearning
AT ciosekk expectedpolicygradientsforreinforcementlearning
AT whitesons expectedpolicygradientsforreinforcementlearning