Policy Improvement for POMDPs Using Normalized Importance Sampling

We present a new method for estimating the expected return of a POMDP from experience. The estimator does not assume any knowle ge of the POMDP and allows the experience to be gathered with an arbitrary set of policies. The return is estimated for any new policy of the POMDP. We motivate the estimat...

Full description

Bibliographic Details
Main Author: Shelton, Christian R.
Language:en_US
Published: 2004
Online Access:http://hdl.handle.net/1721.1/7218
_version_ 1811085304824594432
author Shelton, Christian R.
author_facet Shelton, Christian R.
author_sort Shelton, Christian R.
collection MIT
description We present a new method for estimating the expected return of a POMDP from experience. The estimator does not assume any knowle ge of the POMDP and allows the experience to be gathered with an arbitrary set of policies. The return is estimated for any new policy of the POMDP. We motivate the estimator from function-approximation and importance sampling points-of-view and derive its theoretical properties. Although the estimator is biased, it has low variance and the bias is often irrelevant when the estimator is used for pair-wise comparisons.We conclude by extending the estimator to policies with memory and compare its performance in a greedy search algorithm to the REINFORCE algorithm showing an order of magnitude reduction in the number of trials required.
first_indexed 2024-09-23T13:06:51Z
id mit-1721.1/7218
institution Massachusetts Institute of Technology
language en_US
last_indexed 2024-09-23T13:06:51Z
publishDate 2004
record_format dspace
spelling mit-1721.1/72182019-04-12T08:34:07Z Policy Improvement for POMDPs Using Normalized Importance Sampling Shelton, Christian R. We present a new method for estimating the expected return of a POMDP from experience. The estimator does not assume any knowle ge of the POMDP and allows the experience to be gathered with an arbitrary set of policies. The return is estimated for any new policy of the POMDP. We motivate the estimator from function-approximation and importance sampling points-of-view and derive its theoretical properties. Although the estimator is biased, it has low variance and the bias is often irrelevant when the estimator is used for pair-wise comparisons.We conclude by extending the estimator to policies with memory and compare its performance in a greedy search algorithm to the REINFORCE algorithm showing an order of magnitude reduction in the number of trials required. 2004-10-20T20:50:06Z 2004-10-20T20:50:06Z 2001-03-20 AIM-2001-002 CBCL-194 http://hdl.handle.net/1721.1/7218 en_US AIM-2001-002 CBCL-194 4576001 bytes 768071 bytes application/postscript application/pdf application/postscript application/pdf
spellingShingle Shelton, Christian R.
Policy Improvement for POMDPs Using Normalized Importance Sampling
title Policy Improvement for POMDPs Using Normalized Importance Sampling
title_full Policy Improvement for POMDPs Using Normalized Importance Sampling
title_fullStr Policy Improvement for POMDPs Using Normalized Importance Sampling
title_full_unstemmed Policy Improvement for POMDPs Using Normalized Importance Sampling
title_short Policy Improvement for POMDPs Using Normalized Importance Sampling
title_sort policy improvement for pomdps using normalized importance sampling
url http://hdl.handle.net/1721.1/7218
work_keys_str_mv AT sheltonchristianr policyimprovementforpomdpsusingnormalizedimportancesampling