Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

We consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal state costs or Q-factors. The main difference is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm...

Full description

Bibliographic Details
Main Authors:	Bertsekas, Dimitri P, Yu, Huizhen
Other Authors:	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Format:	Article
Language:	en_US
Published:	Institute for Operations Research and the Management Sciences (INFORMS) 2019
Online Access:	https://hdl.handle.net/1721.1/121248

_version_	1826214265969704960
author	Bertsekas, Dimitri P Yu, Huizhen
author2	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
author_facet	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Bertsekas, Dimitri P Yu, Huizhen
author_sort	Bertsekas, Dimitri P
collection	MIT
description	We consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal state costs or Q-factors. The main difference is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm requires solving an optimal stopping problem. The solution of this problem may be inexact, with a finite number of value iterations, in the spirit of modified policy iteration. The stopping problem structure is incorporated into the standard Q-learning algorithm to obtain a new method that is intermediate between policy iteration and Q-learning/value iteration. Thanks to its special contraction properties, our method overcomes some of the traditional convergence difficulties of modified policy iteration and admits asynchronous deterministic and stochastic iterative implementations, with lower overhead and/or more reliable convergence over existing Q-learning schemes. Furthermore, for large-scale problems, where linear basis function approximations and simulation-based temporal difference implementations are used, our algorithm addresses effectively the inherent difficulties of approximate policy iteration due to inadequate exploration of the state and control spaces.
first_indexed	2024-09-23T16:02:24Z
format	Article
id	mit-1721.1/121248
institution	Massachusetts Institute of Technology
language	en_US
last_indexed	2024-09-23T16:02:24Z
publishDate	2019
publisher	Institute for Operations Research and the Management Sciences (INFORMS)
record_format	dspace
spelling	mit-1721.1/1212482022-09-29T17:49:37Z Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming Bertsekas, Dimitri P Yu, Huizhen Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology. Laboratory for Information and Decision Systems We consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal state costs or Q-factors. The main difference is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm requires solving an optimal stopping problem. The solution of this problem may be inexact, with a finite number of value iterations, in the spirit of modified policy iteration. The stopping problem structure is incorporated into the standard Q-learning algorithm to obtain a new method that is intermediate between policy iteration and Q-learning/value iteration. Thanks to its special contraction properties, our method overcomes some of the traditional convergence difficulties of modified policy iteration and admits asynchronous deterministic and stochastic iterative implementations, with lower overhead and/or more reliable convergence over existing Q-learning schemes. Furthermore, for large-scale problems, where linear basis function approximations and simulation-based temporal difference implementations are used, our algorithm addresses effectively the inherent difficulties of approximate policy iteration due to inadequate exploration of the state and control spaces. National Science Foundation (U.S.) (Grant ECCS-0801549) United States. Air Force (Grant FA9550-10-1-0412) United States. Air Force (Grant FA9550-10-1-0412) Academy of Finland (Grant 118653 ) 2019-06-11T20:09:45Z 2019-06-11T20:09:45Z 2012-02 2011-05 Article http://purl.org/eprint/type/JournalArticle 0364-765X 1526-5471 https://hdl.handle.net/1721.1/121248 Bertsekas, Dimitri P., and Huizhen Yu. “Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming.” Mathematics of Operations Research, vol. 37, no. 1, Feb. 2012, pp. 66–94. en_US https://doi.org/10.1287/moor.1110.0532 Mathematics of Operations Research Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/ application/pdf Institute for Operations Research and the Management Sciences (INFORMS) MIT web domain
spellingShingle	Bertsekas, Dimitri P Yu, Huizhen Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming
title	Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming
title_full	Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming
title_fullStr	Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming
title_full_unstemmed	Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming
title_short	Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming
title_sort	q learning and enhanced policy iteration in discounted dynamic programming
url	https://hdl.handle.net/1721.1/121248
work_keys_str_mv	AT bertsekasdimitrip qlearningandenhancedpolicyiterationindiscounteddynamicprogramming AT yuhuizhen qlearningandenhancedpolicyiterationindiscounteddynamicprogramming

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

Similar Items