Pathologies of Temporal Difference Methods in Approximate Dynamic Programming

Approximate policy iteration methods based on temporal differences are popular in practice, and have been tested extensively, dating to the early nineties, but the associated convergence behavior is complex, and not well understood at present. An important question is whether the policy iterati...

Full description

Bibliographic Details
Main Author:	Bertsekas, Dimitri P.
Other Authors:	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Format:	Article
Language:	en_US
Published:	Institute of Electrical and Electronics Engineers 2011
Online Access:	http://hdl.handle.net/1721.1/64641 https://orcid.org/0000-0001-6909-7208

_version_	1811080194691170304
author	Bertsekas, Dimitri P.
author2	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
author_facet	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Bertsekas, Dimitri P.
author_sort	Bertsekas, Dimitri P.
collection	MIT
description	Approximate policy iteration methods based on temporal differences are popular in practice, and have been tested extensively, dating to the early nineties, but the associated convergence behavior is complex, and not well understood at present. An important question is whether the policy iteration process is seriously hampered by oscillations between poor policies, roughly similar to the attraction of gradient methods to poor local minima. There has been little apparent concern in the approximate DP/reinforcement learning literature about this possibility, even though it has been documented with several simple examples. Recent computational experimentation with the game of tetris, a popular testbed for approximate DP algorithms over a 15-year period, has brought the issue to sharp focus. In particular, using a standard set of 22 features and temporal difference methods, an average score of a few thousands was achieved. Using the same features and a random search method, an overwhelmingly better average score was achieved (600,000-900,000). The paper explains the likely mechanism of this phenomenon, and derives conditions under which it will not occur.
first_indexed	2024-09-23T11:27:21Z
format	Article
id	mit-1721.1/64641
institution	Massachusetts Institute of Technology
language	en_US
last_indexed	2024-09-23T11:27:21Z
publishDate	2011
publisher	Institute of Electrical and Electronics Engineers
record_format	dspace
spelling	mit-1721.1/646412022-09-27T19:36:56Z Pathologies of Temporal Difference Methods in Approximate Dynamic Programming Bertsekas, Dimitri P. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Bertsekas, Dimitri P. Bertsekas, Dimitri P. Approximate policy iteration methods based on temporal differences are popular in practice, and have been tested extensively, dating to the early nineties, but the associated convergence behavior is complex, and not well understood at present. An important question is whether the policy iteration process is seriously hampered by oscillations between poor policies, roughly similar to the attraction of gradient methods to poor local minima. There has been little apparent concern in the approximate DP/reinforcement learning literature about this possibility, even though it has been documented with several simple examples. Recent computational experimentation with the game of tetris, a popular testbed for approximate DP algorithms over a 15-year period, has brought the issue to sharp focus. In particular, using a standard set of 22 features and temporal difference methods, an average score of a few thousands was achieved. Using the same features and a random search method, an overwhelmingly better average score was achieved (600,000-900,000). The paper explains the likely mechanism of this phenomenon, and derives conditions under which it will not occur. National Science Foundation (U.S.) (NSF Grant ECCS-0801549) United States. Air Force (Grant FA9550-10-1-0412) Los Alamos National Laboratory. Information Science and Technology Institute 2011-06-21T19:35:53Z 2011-06-21T19:35:53Z 2010-12 Article http://purl.org/eprint/type/ConferencePaper 0743-1546 0191-2216 http://hdl.handle.net/1721.1/64641 Bertsekas, Dimitri P. "Pathologies of Temporal Difference Methods in Approximate Dynamic Programming." In Proceedings of the 49th IEEE Conference on Decision and Control, Dec.15-17, 2010, Hilton Atlanta Hotel, Atlanta, Georgia USA. https://orcid.org/0000-0001-6909-7208 en_US Proceedings of the 49th IEEE Conference on Decision and Control (CDC), 2010 Creative Commons Attribution-Noncommercial-Share Alike 3.0 http://creativecommons.org/licenses/by-nc-sa/3.0/ application/pdf Institute of Electrical and Electronics Engineers MIT web domain
spellingShingle	Bertsekas, Dimitri P. Pathologies of Temporal Difference Methods in Approximate Dynamic Programming
title	Pathologies of Temporal Difference Methods in Approximate Dynamic Programming
title_full	Pathologies of Temporal Difference Methods in Approximate Dynamic Programming
title_fullStr	Pathologies of Temporal Difference Methods in Approximate Dynamic Programming
title_full_unstemmed	Pathologies of Temporal Difference Methods in Approximate Dynamic Programming
title_short	Pathologies of Temporal Difference Methods in Approximate Dynamic Programming
title_sort	pathologies of temporal difference methods in approximate dynamic programming
url	http://hdl.handle.net/1721.1/64641 https://orcid.org/0000-0001-6909-7208
work_keys_str_mv	AT bertsekasdimitrip pathologiesoftemporaldifferencemethodsinapproximatedynamicprogramming

Pathologies of Temporal Difference Methods in Approximate Dynamic Programming

Similar Items