Q-learning with nearest neighbors

© 2018 Curran Associates Inc.All rights reserved. We consider model-free reinforcement learning for infinite-horizon discounted Markov Decision Processes (MDPs) with a continuous state space and unknown transition kernel, when only a single sample path under an arbitrary policy of the system is ava...

Full description

Bibliographic Details
Main Authors:	Shah, Devavrat, Xie, Qiaomin
Other Authors:	Massachusetts Institute of Technology. Laboratory for Information and Decision Systems
Format:	Article
Language:	English
Published:	2021
Online Access:	https://hdl.handle.net/1721.1/137946

_version_	1811085750509240320
author	Shah, Devavrat Xie, Qiaomin
author2	Massachusetts Institute of Technology. Laboratory for Information and Decision Systems
author_facet	Massachusetts Institute of Technology. Laboratory for Information and Decision Systems Shah, Devavrat Xie, Qiaomin
author_sort	Shah, Devavrat
collection	MIT
description	© 2018 Curran Associates Inc.All rights reserved. We consider model-free reinforcement learning for infinite-horizon discounted Markov Decision Processes (MDPs) with a continuous state space and unknown transition kernel, when only a single sample path under an arbitrary policy of the system is available. We consider the Nearest Neighbor Q-Learning (NNQL) algorithm to learn the optimal Q function using nearest neighbor regression method. As the main contribution, we provide tight finite sample analysis of the convergence rate. In particular, for MDPs with a d-dimensional state space and the discounted factor γ ∈ (0, 1), given an arbitrary sample path with “covering time” L, we establish that the algorithm is guaranteed to output an ε-accurate estimate of the optimal Q-function using Õ e (L/(ε 3 (1 - γ) 7 )) samples. For instance, for a well-behaved MDP, the covering time of the sample path under the purely random policy scales as Õ e (1/ε d ), so the sample complexity scales as Õ e (1/ε d+3 ). Indeed, we establish a lower bound that argues that the dependence of Ω e (1/ε d+2 ) is necessary.
first_indexed	2024-09-23T13:15:05Z
format	Article
id	mit-1721.1/137946
institution	Massachusetts Institute of Technology
language	English
last_indexed	2024-09-23T13:15:05Z
publishDate	2021
record_format	dspace
spelling	mit-1721.1/1379462023-02-13T18:50:06Z Q-learning with nearest neighbors Shah, Devavrat Xie, Qiaomin Massachusetts Institute of Technology. Laboratory for Information and Decision Systems Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Statistics and Data Science Center (Massachusetts Institute of Technology) © 2018 Curran Associates Inc.All rights reserved. We consider model-free reinforcement learning for infinite-horizon discounted Markov Decision Processes (MDPs) with a continuous state space and unknown transition kernel, when only a single sample path under an arbitrary policy of the system is available. We consider the Nearest Neighbor Q-Learning (NNQL) algorithm to learn the optimal Q function using nearest neighbor regression method. As the main contribution, we provide tight finite sample analysis of the convergence rate. In particular, for MDPs with a d-dimensional state space and the discounted factor γ ∈ (0, 1), given an arbitrary sample path with “covering time” L, we establish that the algorithm is guaranteed to output an ε-accurate estimate of the optimal Q-function using Õ e (L/(ε 3 (1 - γ) 7 )) samples. For instance, for a well-behaved MDP, the covering time of the sample path under the purely random policy scales as Õ e (1/ε d ), so the sample complexity scales as Õ e (1/ε d+3 ). Indeed, we establish a lower bound that argues that the dependence of Ω e (1/ε d+2 ) is necessary. 2021-11-09T16:08:56Z 2021-11-09T16:08:56Z 2018 2019-07-10T16:26:47Z Article http://purl.org/eprint/type/ConferencePaper https://hdl.handle.net/1721.1/137946 Shah, Devavrat and Xie, Qiaomin. 2018. "Q-learning with nearest neighbors." en https://papers.nips.cc/paper/7574-q-learning-with-nearest-neighbors Article is made available in accordance with the publisher's policy and may be subject to US copyright law. Please refer to the publisher's site for terms of use. application/pdf Neural Information Processing Systems (NIPS)
spellingShingle	Shah, Devavrat Xie, Qiaomin Q-learning with nearest neighbors
title	Q-learning with nearest neighbors
title_full	Q-learning with nearest neighbors
title_fullStr	Q-learning with nearest neighbors
title_full_unstemmed	Q-learning with nearest neighbors
title_short	Q-learning with nearest neighbors
title_sort	q learning with nearest neighbors
url	https://hdl.handle.net/1721.1/137946
work_keys_str_mv	AT shahdevavrat qlearningwithnearestneighbors AT xieqiaomin qlearningwithnearestneighbors

Q-learning with nearest neighbors

Similar Items