Q-learning and policy iteration algorithms for stochastic shortest path problems

We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the...

Full description

Bibliographic Details
Main Authors:	Yu, Huizhen, Bertsekas, Dimitri P.
Other Authors:	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Format:	Article
Language:	en_US
Published:	Springer-Verlag 2015
Online Access:	http://hdl.handle.net/1721.1/93745 https://orcid.org/0000-0001-6909-7208

_version_	1826202721448886272
author	Yu, Huizhen Bertsekas, Dimitri P.
author2	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
author_facet	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Yu, Huizhen Bertsekas, Dimitri P.
author_sort	Yu, Huizhen
collection	MIT
description	We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound.
first_indexed	2024-09-23T12:15:58Z
format	Article
id	mit-1721.1/93745
institution	Massachusetts Institute of Technology
language	en_US
last_indexed	2024-09-23T12:15:58Z
publishDate	2015
publisher	Springer-Verlag
record_format	dspace
spelling	mit-1721.1/937452022-09-28T00:50:13Z Q-learning and policy iteration algorithms for stochastic shortest path problems Yu, Huizhen Bertsekas, Dimitri P. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology. Laboratory for Information and Decision Systems Bertsekas, Dimitri P. Yu, Huizhen Bertsekas, Dimitri P. We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound. United States. Air Force (Grant FA9550-10-1-0412) National Science Foundation (U.S.) (Grant ECCS-0801549) 2015-02-03T19:29:00Z 2015-02-03T19:29:00Z 2012-04 Article http://purl.org/eprint/type/JournalArticle 0254-5330 1572-9338 http://hdl.handle.net/1721.1/93745 Yu, Huizhen, and Dimitri P. Bertsekas. “Q-Learning and Policy Iteration Algorithms for Stochastic Shortest Path Problems.” Annals of Operations Research 208, no. 1 (April 18, 2012): 95–132. https://orcid.org/0000-0001-6909-7208 en_US http://dx.doi.org/10.1007/s10479-012-1128-z Annals of Operations Research Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/ application/pdf Springer-Verlag Prof. Bertsekas via Chris Sherratt
spellingShingle	Yu, Huizhen Bertsekas, Dimitri P. Q-learning and policy iteration algorithms for stochastic shortest path problems
title	Q-learning and policy iteration algorithms for stochastic shortest path problems
title_full	Q-learning and policy iteration algorithms for stochastic shortest path problems
title_fullStr	Q-learning and policy iteration algorithms for stochastic shortest path problems
title_full_unstemmed	Q-learning and policy iteration algorithms for stochastic shortest path problems
title_short	Q-learning and policy iteration algorithms for stochastic shortest path problems
title_sort	q learning and policy iteration algorithms for stochastic shortest path problems
url	http://hdl.handle.net/1721.1/93745 https://orcid.org/0000-0001-6909-7208
work_keys_str_mv	AT yuhuizhen qlearningandpolicyiterationalgorithmsforstochasticshortestpathproblems AT bertsekasdimitrip qlearningandpolicyiterationalgorithmsforstochasticshortestpathproblems

Q-learning and policy iteration algorithms for stochastic shortest path problems

Similar Items