Pathologies of Temporal Difference Methods in Approximate Dynamic Programming
Approximate policy iteration methods based on temporal differences are popular in practice, and have been tested extensively, dating to the early nineties, but the associated convergence behavior is complex, and not well understood at present. An important question is whether the policy iterati...
Main Author: | |
---|---|
Other Authors: | |
Format: | Article |
Language: | en_US |
Published: |
Institute of Electrical and Electronics Engineers
2011
|
Online Access: | http://hdl.handle.net/1721.1/64641 https://orcid.org/0000-0001-6909-7208 |
_version_ | 1811080194691170304 |
---|---|
author | Bertsekas, Dimitri P. |
author2 | Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science |
author_facet | Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Bertsekas, Dimitri P. |
author_sort | Bertsekas, Dimitri P. |
collection | MIT |
description | Approximate policy iteration methods based
on temporal differences are popular in practice, and have
been tested extensively, dating to the early nineties, but
the associated convergence behavior is complex, and not
well understood at present. An important question is
whether the policy iteration process is seriously hampered
by oscillations between poor policies, roughly similar to
the attraction of gradient methods to poor local minima.
There has been little apparent concern in the approximate
DP/reinforcement learning literature about this possibility,
even though it has been documented with several simple
examples. Recent computational experimentation with the
game of tetris, a popular testbed for approximate DP
algorithms over a 15-year period, has brought the issue
to sharp focus. In particular, using a standard set of 22
features and temporal difference methods, an average score
of a few thousands was achieved. Using the same features
and a random search method, an overwhelmingly better
average score was achieved (600,000-900,000). The paper
explains the likely mechanism of this phenomenon, and
derives conditions under which it will not occur. |
first_indexed | 2024-09-23T11:27:21Z |
format | Article |
id | mit-1721.1/64641 |
institution | Massachusetts Institute of Technology |
language | en_US |
last_indexed | 2024-09-23T11:27:21Z |
publishDate | 2011 |
publisher | Institute of Electrical and Electronics Engineers |
record_format | dspace |
spelling | mit-1721.1/646412022-09-27T19:36:56Z Pathologies of Temporal Difference Methods in Approximate Dynamic Programming Bertsekas, Dimitri P. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Bertsekas, Dimitri P. Bertsekas, Dimitri P. Approximate policy iteration methods based on temporal differences are popular in practice, and have been tested extensively, dating to the early nineties, but the associated convergence behavior is complex, and not well understood at present. An important question is whether the policy iteration process is seriously hampered by oscillations between poor policies, roughly similar to the attraction of gradient methods to poor local minima. There has been little apparent concern in the approximate DP/reinforcement learning literature about this possibility, even though it has been documented with several simple examples. Recent computational experimentation with the game of tetris, a popular testbed for approximate DP algorithms over a 15-year period, has brought the issue to sharp focus. In particular, using a standard set of 22 features and temporal difference methods, an average score of a few thousands was achieved. Using the same features and a random search method, an overwhelmingly better average score was achieved (600,000-900,000). The paper explains the likely mechanism of this phenomenon, and derives conditions under which it will not occur. National Science Foundation (U.S.) (NSF Grant ECCS-0801549) United States. Air Force (Grant FA9550-10-1-0412) Los Alamos National Laboratory. Information Science and Technology Institute 2011-06-21T19:35:53Z 2011-06-21T19:35:53Z 2010-12 Article http://purl.org/eprint/type/ConferencePaper 0743-1546 0191-2216 http://hdl.handle.net/1721.1/64641 Bertsekas, Dimitri P. "Pathologies of Temporal Difference Methods in Approximate Dynamic Programming." In Proceedings of the 49th IEEE Conference on Decision and Control, Dec.15-17, 2010, Hilton Atlanta Hotel, Atlanta, Georgia USA. https://orcid.org/0000-0001-6909-7208 en_US Proceedings of the 49th IEEE Conference on Decision and Control (CDC), 2010 Creative Commons Attribution-Noncommercial-Share Alike 3.0 http://creativecommons.org/licenses/by-nc-sa/3.0/ application/pdf Institute of Electrical and Electronics Engineers MIT web domain |
spellingShingle | Bertsekas, Dimitri P. Pathologies of Temporal Difference Methods in Approximate Dynamic Programming |
title | Pathologies of Temporal Difference Methods in Approximate Dynamic Programming |
title_full | Pathologies of Temporal Difference Methods in Approximate Dynamic Programming |
title_fullStr | Pathologies of Temporal Difference Methods in Approximate Dynamic Programming |
title_full_unstemmed | Pathologies of Temporal Difference Methods in Approximate Dynamic Programming |
title_short | Pathologies of Temporal Difference Methods in Approximate Dynamic Programming |
title_sort | pathologies of temporal difference methods in approximate dynamic programming |
url | http://hdl.handle.net/1721.1/64641 https://orcid.org/0000-0001-6909-7208 |
work_keys_str_mv | AT bertsekasdimitrip pathologiesoftemporaldifferencemethodsinapproximatedynamicprogramming |