GradientDICE: rethinking generalized offline estimation of stationary values

We present GradientDICE for estimating the density ratio between the state distribution of the target policy and the sampling distribution in off-policy reinforcement learning. GradientDICE fixes several problems of GenDICE (Zhang et al., 2020), the current state-of-the-art for estimating such densi...

Cur síos iomlán

Sonraí bibleagrafaíochta
Príomhchruthaitheoirí:	Zhang, S, Liu, B, Whiteson, S
Formáid:	Conference item
Teanga:	English
Foilsithe / Cruthaithe:	Journal of Machine Learning Research 2020

_version_	1826268141179633664
author	Zhang, S Liu, B Whiteson, S
author_facet	Zhang, S Liu, B Whiteson, S
author_sort	Zhang, S
collection	OXFORD
description	We present GradientDICE for estimating the density ratio between the state distribution of the target policy and the sampling distribution in off-policy reinforcement learning. GradientDICE fixes several problems of GenDICE (Zhang et al., 2020), the current state-of-the-art for estimating such density ratios. Namely, the optimization problem in GenDICE is not a convex-concave saddle-point problem once nonlinearity in optimization variable parameterization is introduced to ensure positivity, so primal-dual algorithms are not guaranteed to find the desired solution. However, such nonlinearity is essential to ensure the consistency of GenDICE even with a tabular representation. This is a fundamental contradiction, resulting from GenDICE’s original formulation of the optimization problem. In GradientDICE, we optimize a different objective from GenDICE by using the Perron-Frobenius theorem and eliminating GenDICE’s use of divergence, such that nonlinearity in parameterization is not necessary for GradientDICE, which is provably convergent under linear function approximation.
first_indexed	2024-03-06T21:05:03Z
format	Conference item
id	oxford-uuid:3c29812e-af50-407f-acf8-c9ee52d43fec
institution	University of Oxford
language	English
last_indexed	2024-03-06T21:05:03Z
publishDate	2020
publisher	Journal of Machine Learning Research
record_format	dspace
spelling	oxford-uuid:3c29812e-af50-407f-acf8-c9ee52d43fec2022-03-26T14:11:56ZGradientDICE: rethinking generalized offline estimation of stationary valuesConference itemhttp://purl.org/coar/resource_type/c_5794uuid:3c29812e-af50-407f-acf8-c9ee52d43fecEnglishSymplectic ElementsJournal of Machine Learning Research2020Zhang, SLiu, BWhiteson, SWe present GradientDICE for estimating the density ratio between the state distribution of the target policy and the sampling distribution in off-policy reinforcement learning. GradientDICE fixes several problems of GenDICE (Zhang et al., 2020), the current state-of-the-art for estimating such density ratios. Namely, the optimization problem in GenDICE is not a convex-concave saddle-point problem once nonlinearity in optimization variable parameterization is introduced to ensure positivity, so primal-dual algorithms are not guaranteed to find the desired solution. However, such nonlinearity is essential to ensure the consistency of GenDICE even with a tabular representation. This is a fundamental contradiction, resulting from GenDICE’s original formulation of the optimization problem. In GradientDICE, we optimize a different objective from GenDICE by using the Perron-Frobenius theorem and eliminating GenDICE’s use of divergence, such that nonlinearity in parameterization is not necessary for GradientDICE, which is provably convergent under linear function approximation.
spellingShingle	Zhang, S Liu, B Whiteson, S GradientDICE: rethinking generalized offline estimation of stationary values
title	GradientDICE: rethinking generalized offline estimation of stationary values
title_full	GradientDICE: rethinking generalized offline estimation of stationary values
title_fullStr	GradientDICE: rethinking generalized offline estimation of stationary values
title_full_unstemmed	GradientDICE: rethinking generalized offline estimation of stationary values
title_short	GradientDICE: rethinking generalized offline estimation of stationary values
title_sort	gradientdice rethinking generalized offline estimation of stationary values
work_keys_str_mv	AT zhangs gradientdicerethinkinggeneralizedofflineestimationofstationaryvalues AT liub gradientdicerethinkinggeneralizedofflineestimationofstationaryvalues AT whitesons gradientdicerethinkinggeneralizedofflineestimationofstationaryvalues

GradientDICE: rethinking generalized offline estimation of stationary values

Míreanna comhchosúla