On the Convergence of Stochastic Iterative Dynamic Programming Algorithms

Recent developments in the area of reinforcement learning have yielded a number of new algorithms for the prediction and control of Markovian environments. These algorithms, including the TD(lambda) algorithm of Sutton (1988) and the Q-learning algorithm of Watkins (1989), can be motivated he...

Full description

Bibliographic Details
Main Authors: Jaakkola, Tommi, Jordan, Michael I., Singh, Satinder P.
Language:en_US
Published: 2004
Subjects:
Online Access:http://hdl.handle.net/1721.1/7205
_version_ 1826199681532690432
author Jaakkola, Tommi
Jordan, Michael I.
Singh, Satinder P.
author_facet Jaakkola, Tommi
Jordan, Michael I.
Singh, Satinder P.
author_sort Jaakkola, Tommi
collection MIT
description Recent developments in the area of reinforcement learning have yielded a number of new algorithms for the prediction and control of Markovian environments. These algorithms, including the TD(lambda) algorithm of Sutton (1988) and the Q-learning algorithm of Watkins (1989), can be motivated heuristically as approximations to dynamic programming (DP). In this paper we provide a rigorous proof of convergence of these DP-based learning algorithms by relating them to the powerful techniques of stochastic approximation theory via a new convergence theorem. The theorem establishes a general class of convergent algorithms to which both TD(lambda) and Q-learning belong.
first_indexed 2024-09-23T11:24:16Z
id mit-1721.1/7205
institution Massachusetts Institute of Technology
language en_US
last_indexed 2024-09-23T11:24:16Z
publishDate 2004
record_format dspace
spelling mit-1721.1/72052019-04-10T11:52:45Z On the Convergence of Stochastic Iterative Dynamic Programming Algorithms Jaakkola, Tommi Jordan, Michael I. Singh, Satinder P. reinforcement learning stochastic approximation sconvergence dynamic programming Recent developments in the area of reinforcement learning have yielded a number of new algorithms for the prediction and control of Markovian environments. These algorithms, including the TD(lambda) algorithm of Sutton (1988) and the Q-learning algorithm of Watkins (1989), can be motivated heuristically as approximations to dynamic programming (DP). In this paper we provide a rigorous proof of convergence of these DP-based learning algorithms by relating them to the powerful techniques of stochastic approximation theory via a new convergence theorem. The theorem establishes a general class of convergent algorithms to which both TD(lambda) and Q-learning belong. 2004-10-20T20:49:46Z 2004-10-20T20:49:46Z 1993-08-01 AIM-1441 CBCL-084 http://hdl.handle.net/1721.1/7205 en_US AIM-1441 CBCL-084 15 p. 77605 bytes 356324 bytes application/octet-stream application/pdf application/octet-stream application/pdf
spellingShingle reinforcement learning
stochastic approximation
sconvergence
dynamic programming
Jaakkola, Tommi
Jordan, Michael I.
Singh, Satinder P.
On the Convergence of Stochastic Iterative Dynamic Programming Algorithms
title On the Convergence of Stochastic Iterative Dynamic Programming Algorithms
title_full On the Convergence of Stochastic Iterative Dynamic Programming Algorithms
title_fullStr On the Convergence of Stochastic Iterative Dynamic Programming Algorithms
title_full_unstemmed On the Convergence of Stochastic Iterative Dynamic Programming Algorithms
title_short On the Convergence of Stochastic Iterative Dynamic Programming Algorithms
title_sort on the convergence of stochastic iterative dynamic programming algorithms
topic reinforcement learning
stochastic approximation
sconvergence
dynamic programming
url http://hdl.handle.net/1721.1/7205
work_keys_str_mv AT jaakkolatommi ontheconvergenceofstochasticiterativedynamicprogrammingalgorithms
AT jordanmichaeli ontheconvergenceofstochasticiterativedynamicprogrammingalgorithms
AT singhsatinderp ontheconvergenceofstochasticiterativedynamicprogrammingalgorithms