Policy Improvement for POMDPs Using Normalized Importance Sampling

We present a new method for estimating the expected return of a POMDP from experience. The estimator does not assume any knowle ge of the POMDP and allows the experience to be gathered with an arbitrary set of policies. The return is estimated for any new policy of the POMDP. We motivate the estimat...

Full description

Bibliographic Details
Main Author:	Shelton, Christian R.
Language:	en_US
Published:	2004
Online Access:	http://hdl.handle.net/1721.1/7218

_version_	1811085304824594432
author	Shelton, Christian R.
author_facet	Shelton, Christian R.
author_sort	Shelton, Christian R.
collection	MIT
description	We present a new method for estimating the expected return of a POMDP from experience. The estimator does not assume any knowle ge of the POMDP and allows the experience to be gathered with an arbitrary set of policies. The return is estimated for any new policy of the POMDP. We motivate the estimator from function-approximation and importance sampling points-of-view and derive its theoretical properties. Although the estimator is biased, it has low variance and the bias is often irrelevant when the estimator is used for pair-wise comparisons.We conclude by extending the estimator to policies with memory and compare its performance in a greedy search algorithm to the REINFORCE algorithm showing an order of magnitude reduction in the number of trials required.
first_indexed	2024-09-23T13:06:51Z
id	mit-1721.1/7218
institution	Massachusetts Institute of Technology
language	en_US
last_indexed	2024-09-23T13:06:51Z
publishDate	2004
record_format	dspace
spelling	mit-1721.1/72182019-04-12T08:34:07Z Policy Improvement for POMDPs Using Normalized Importance Sampling Shelton, Christian R. We present a new method for estimating the expected return of a POMDP from experience. The estimator does not assume any knowle ge of the POMDP and allows the experience to be gathered with an arbitrary set of policies. The return is estimated for any new policy of the POMDP. We motivate the estimator from function-approximation and importance sampling points-of-view and derive its theoretical properties. Although the estimator is biased, it has low variance and the bias is often irrelevant when the estimator is used for pair-wise comparisons.We conclude by extending the estimator to policies with memory and compare its performance in a greedy search algorithm to the REINFORCE algorithm showing an order of magnitude reduction in the number of trials required. 2004-10-20T20:50:06Z 2004-10-20T20:50:06Z 2001-03-20 AIM-2001-002 CBCL-194 http://hdl.handle.net/1721.1/7218 en_US AIM-2001-002 CBCL-194 4576001 bytes 768071 bytes application/postscript application/pdf application/postscript application/pdf
spellingShingle	Shelton, Christian R. Policy Improvement for POMDPs Using Normalized Importance Sampling
title	Policy Improvement for POMDPs Using Normalized Importance Sampling
title_full	Policy Improvement for POMDPs Using Normalized Importance Sampling
title_fullStr	Policy Improvement for POMDPs Using Normalized Importance Sampling
title_full_unstemmed	Policy Improvement for POMDPs Using Normalized Importance Sampling
title_short	Policy Improvement for POMDPs Using Normalized Importance Sampling
title_sort	policy improvement for pomdps using normalized importance sampling
url	http://hdl.handle.net/1721.1/7218
work_keys_str_mv	AT sheltonchristianr policyimprovementforpomdpsusingnormalizedimportancesampling

Policy Improvement for POMDPs Using Normalized Importance Sampling

Similar Items