Scaling Up Q-Learning via Exploiting State–Action Equivalence

Recent success stories in reinforcement learning have demonstrated that leveraging structural properties of the underlying environment is key in devising viable methods capable of solving complex tasks. We study off-policy learning in discounted reinforcement learning, where some equivalence relatio...

Full description

Bibliographic Details
Main Authors:	Yunlian Lyu, Aymeric Côme, Yijie Zhang, Mohammad Sadegh Talebi
Format:	Article
Language:	English
Published:	MDPI AG 2023-03-01
Series:	Entropy
Subjects:	reinforcement learning Markov decision process Q-learning equivalence structure
Online Access:	https://www.mdpi.com/1099-4300/25/4/584

_version_	1827745173249982464
author	Yunlian Lyu Aymeric Côme Yijie Zhang Mohammad Sadegh Talebi
author_facet	Yunlian Lyu Aymeric Côme Yijie Zhang Mohammad Sadegh Talebi
author_sort	Yunlian Lyu
collection	DOAJ
description	Recent success stories in reinforcement learning have demonstrated that leveraging structural properties of the underlying environment is key in devising viable methods capable of solving complex tasks. We study off-policy learning in discounted reinforcement learning, where some equivalence relation in the environment exists. We introduce a new model-free algorithm, called QL-ES (Q-learning with equivalence structure), which is a variant of (asynchronous) Q-learning tailored to exploit the equivalence structure in the MDP. We report a non-asymptotic PAC-type sample complexity bound for QL-ES, thereby establishing its sample efficiency. This bound also allows us to quantify the superiority of QL-ES over Q-learning analytically, which shows that the theoretical gain in some domains can be massive. We report extensive numerical experiments demonstrating that QL-ES converges significantly faster than (structure-oblivious) Q-learning empirically. They imply that the empirical performance gain obtained by exploiting the equivalence structure could be massive, even in simple domains. To the best of our knowledge, QL-ES is the first provably efficient model-free algorithm to exploit the equivalence structure in finite MDPs.
first_indexed	2024-03-11T05:03:22Z
format	Article
id	doaj.art-addf7b4aa4d047c4b2412f00d0ca28b3
institution	Directory Open Access Journal
issn	1099-4300
language	English
last_indexed	2024-03-11T05:03:22Z
publishDate	2023-03-01
publisher	MDPI AG
record_format	Article
series	Entropy
spelling	doaj.art-addf7b4aa4d047c4b2412f00d0ca28b32023-11-17T19:08:08ZengMDPI AGEntropy1099-43002023-03-0125458410.3390/e25040584Scaling Up Q-Learning via Exploiting State–Action EquivalenceYunlian Lyu0Aymeric Côme1Yijie Zhang2Mohammad Sadegh Talebi3Department of Computer Science, University of Copenhagen, Universitetsparken 1, 2100 Copenhagen, DenmarkInria Rennes, Bretagne Atlantique Campus Universitaire de Beaulieu, Avenue du Général Leclerc, 35042 Rennes, FranceDepartment of Computer Science, University of Copenhagen, Universitetsparken 1, 2100 Copenhagen, DenmarkDepartment of Computer Science, University of Copenhagen, Universitetsparken 1, 2100 Copenhagen, DenmarkRecent success stories in reinforcement learning have demonstrated that leveraging structural properties of the underlying environment is key in devising viable methods capable of solving complex tasks. We study off-policy learning in discounted reinforcement learning, where some equivalence relation in the environment exists. We introduce a new model-free algorithm, called QL-ES (Q-learning with equivalence structure), which is a variant of (asynchronous) Q-learning tailored to exploit the equivalence structure in the MDP. We report a non-asymptotic PAC-type sample complexity bound for QL-ES, thereby establishing its sample efficiency. This bound also allows us to quantify the superiority of QL-ES over Q-learning analytically, which shows that the theoretical gain in some domains can be massive. We report extensive numerical experiments demonstrating that QL-ES converges significantly faster than (structure-oblivious) Q-learning empirically. They imply that the empirical performance gain obtained by exploiting the equivalence structure could be massive, even in simple domains. To the best of our knowledge, QL-ES is the first provably efficient model-free algorithm to exploit the equivalence structure in finite MDPs.https://www.mdpi.com/1099-4300/25/4/584reinforcement learningMarkov decision processQ-learningequivalence structure
spellingShingle	Yunlian Lyu Aymeric Côme Yijie Zhang Mohammad Sadegh Talebi Scaling Up Q-Learning via Exploiting State–Action Equivalence Entropy reinforcement learning Markov decision process Q-learning equivalence structure
title	Scaling Up Q-Learning via Exploiting State–Action Equivalence
title_full	Scaling Up Q-Learning via Exploiting State–Action Equivalence
title_fullStr	Scaling Up Q-Learning via Exploiting State–Action Equivalence
title_full_unstemmed	Scaling Up Q-Learning via Exploiting State–Action Equivalence
title_short	Scaling Up Q-Learning via Exploiting State–Action Equivalence
title_sort	scaling up q learning via exploiting state action equivalence
topic	reinforcement learning Markov decision process Q-learning equivalence structure
url	https://www.mdpi.com/1099-4300/25/4/584
work_keys_str_mv	AT yunlianlyu scalingupqlearningviaexploitingstateactionequivalence AT aymericcome scalingupqlearningviaexploitingstateactionequivalence AT yijiezhang scalingupqlearningviaexploitingstateactionequivalence AT mohammadsadeghtalebi scalingupqlearningviaexploitingstateactionequivalence

Scaling Up Q-Learning via Exploiting State–Action Equivalence

Similar Items