Pathologies of Temporal Difference Methods in Approximate Dynamic Programming

Approximate policy iteration methods based on temporal differences are popular in practice, and have been tested extensively, dating to the early nineties, but the associated convergence behavior is complex, and not well understood at present. An important question is whether the policy iterati...

Full description

Bibliographic Details
Main Author: Bertsekas, Dimitri P.
Other Authors: Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Format: Article
Language:en_US
Published: Institute of Electrical and Electronics Engineers 2011
Online Access:http://hdl.handle.net/1721.1/64641
https://orcid.org/0000-0001-6909-7208
_version_ 1811080194691170304
author Bertsekas, Dimitri P.
author2 Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
author_facet Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Bertsekas, Dimitri P.
author_sort Bertsekas, Dimitri P.
collection MIT
description Approximate policy iteration methods based on temporal differences are popular in practice, and have been tested extensively, dating to the early nineties, but the associated convergence behavior is complex, and not well understood at present. An important question is whether the policy iteration process is seriously hampered by oscillations between poor policies, roughly similar to the attraction of gradient methods to poor local minima. There has been little apparent concern in the approximate DP/reinforcement learning literature about this possibility, even though it has been documented with several simple examples. Recent computational experimentation with the game of tetris, a popular testbed for approximate DP algorithms over a 15-year period, has brought the issue to sharp focus. In particular, using a standard set of 22 features and temporal difference methods, an average score of a few thousands was achieved. Using the same features and a random search method, an overwhelmingly better average score was achieved (600,000-900,000). The paper explains the likely mechanism of this phenomenon, and derives conditions under which it will not occur.
first_indexed 2024-09-23T11:27:21Z
format Article
id mit-1721.1/64641
institution Massachusetts Institute of Technology
language en_US
last_indexed 2024-09-23T11:27:21Z
publishDate 2011
publisher Institute of Electrical and Electronics Engineers
record_format dspace
spelling mit-1721.1/646412022-09-27T19:36:56Z Pathologies of Temporal Difference Methods in Approximate Dynamic Programming Bertsekas, Dimitri P. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Bertsekas, Dimitri P. Bertsekas, Dimitri P. Approximate policy iteration methods based on temporal differences are popular in practice, and have been tested extensively, dating to the early nineties, but the associated convergence behavior is complex, and not well understood at present. An important question is whether the policy iteration process is seriously hampered by oscillations between poor policies, roughly similar to the attraction of gradient methods to poor local minima. There has been little apparent concern in the approximate DP/reinforcement learning literature about this possibility, even though it has been documented with several simple examples. Recent computational experimentation with the game of tetris, a popular testbed for approximate DP algorithms over a 15-year period, has brought the issue to sharp focus. In particular, using a standard set of 22 features and temporal difference methods, an average score of a few thousands was achieved. Using the same features and a random search method, an overwhelmingly better average score was achieved (600,000-900,000). The paper explains the likely mechanism of this phenomenon, and derives conditions under which it will not occur. National Science Foundation (U.S.) (NSF Grant ECCS-0801549) United States. Air Force (Grant FA9550-10-1-0412) Los Alamos National Laboratory. Information Science and Technology Institute 2011-06-21T19:35:53Z 2011-06-21T19:35:53Z 2010-12 Article http://purl.org/eprint/type/ConferencePaper 0743-1546 0191-2216 http://hdl.handle.net/1721.1/64641 Bertsekas, Dimitri P. "Pathologies of Temporal Difference Methods in Approximate Dynamic Programming." In Proceedings of the 49th IEEE Conference on Decision and Control, Dec.15-17, 2010, Hilton Atlanta Hotel, Atlanta, Georgia USA. https://orcid.org/0000-0001-6909-7208 en_US Proceedings of the 49th IEEE Conference on Decision and Control (CDC), 2010 Creative Commons Attribution-Noncommercial-Share Alike 3.0 http://creativecommons.org/licenses/by-nc-sa/3.0/ application/pdf Institute of Electrical and Electronics Engineers MIT web domain
spellingShingle Bertsekas, Dimitri P.
Pathologies of Temporal Difference Methods in Approximate Dynamic Programming
title Pathologies of Temporal Difference Methods in Approximate Dynamic Programming
title_full Pathologies of Temporal Difference Methods in Approximate Dynamic Programming
title_fullStr Pathologies of Temporal Difference Methods in Approximate Dynamic Programming
title_full_unstemmed Pathologies of Temporal Difference Methods in Approximate Dynamic Programming
title_short Pathologies of Temporal Difference Methods in Approximate Dynamic Programming
title_sort pathologies of temporal difference methods in approximate dynamic programming
url http://hdl.handle.net/1721.1/64641
https://orcid.org/0000-0001-6909-7208
work_keys_str_mv AT bertsekasdimitrip pathologiesoftemporaldifferencemethodsinapproximatedynamicprogramming