Exploration and Exploitation Balanced Experience Replay

Experience replay can reuse past experience to update target policy and improve the utilization of samples,which has become an important component of deep reinforcement learning.Prioritized experience replay performs selective sampling based on experience replay to use samples more efficiently.Never...

Full description

Bibliographic Details
Main Author:	ZHANG Jia-neng, LI Hui, WU Hao-lin, WANG Zhuang
Format:	Article
Language:	zho
Published:	Editorial office of Computer Science 2022-05-01
Series:	Jisuanji kexue
Subjects:	reinforcement learning\|experience replay\|priority sampling\|exploitation\|exploration\|soft actor-critic algorithm
Online Access:	https://www.jsjkx.com/fileup/1002-137X/PDF/1002-137X-2022-49-5-179.pdf

_version_	1797845103969042432
author	ZHANG Jia-neng, LI Hui, WU Hao-lin, WANG Zhuang
author_facet	ZHANG Jia-neng, LI Hui, WU Hao-lin, WANG Zhuang
author_sort	ZHANG Jia-neng, LI Hui, WU Hao-lin, WANG Zhuang
collection	DOAJ
description	Experience replay can reuse past experience to update target policy and improve the utilization of samples,which has become an important component of deep reinforcement learning.Prioritized experience replay performs selective sampling based on experience replay to use samples more efficiently.Nevertheless,the current prioritized experience replay methods will reduce the diversity of samples sampled from the experience buffer,causing the neural network to converge to the local optimum.To tackle the above issue,a novel method named exploration and exploitation balanced experience replay (E3R) is proposed to ba-lances exploration and utilization.This method can comprehensively consider the exploration utility and utilization utility of the samples,and sample according to the weighted sum of two similarities.One of them is the similarity between the behavior strategy and the target strategy in the same state of action,and the other is the similarity between the current state and the past state.Besides,the E3R is combined with the policy gradient algorithm soft actor-critic and the value function algorithm deep Q lear-ning,and experiments are carried out on the suite of OpenAI gym tasks.Experimental results show that,compared to traditional random sampling and sequential differential priority sampling,E3R can achieve faster convergence speed and higher cumulative return.
first_indexed	2024-04-09T17:33:08Z
format	Article
id	doaj.art-ff5d324d720443a899e77bc19f2a25d2
institution	Directory Open Access Journal
issn	1002-137X
language	zho
last_indexed	2024-04-09T17:33:08Z
publishDate	2022-05-01
publisher	Editorial office of Computer Science
record_format	Article
series	Jisuanji kexue
spelling	doaj.art-ff5d324d720443a899e77bc19f2a25d22023-04-18T02:35:57ZzhoEditorial office of Computer ScienceJisuanji kexue1002-137X2022-05-0149517918510.11896/jsjkx.210300084Exploration and Exploitation Balanced Experience ReplayZHANG Jia-neng, LI Hui, WU Hao-lin, WANG Zhuang0College of Computer Science,Sichuan University,Chengdu 610065,ChinaExperience replay can reuse past experience to update target policy and improve the utilization of samples,which has become an important component of deep reinforcement learning.Prioritized experience replay performs selective sampling based on experience replay to use samples more efficiently.Nevertheless,the current prioritized experience replay methods will reduce the diversity of samples sampled from the experience buffer,causing the neural network to converge to the local optimum.To tackle the above issue,a novel method named exploration and exploitation balanced experience replay (E3R) is proposed to ba-lances exploration and utilization.This method can comprehensively consider the exploration utility and utilization utility of the samples,and sample according to the weighted sum of two similarities.One of them is the similarity between the behavior strategy and the target strategy in the same state of action,and the other is the similarity between the current state and the past state.Besides,the E3R is combined with the policy gradient algorithm soft actor-critic and the value function algorithm deep Q lear-ning,and experiments are carried out on the suite of OpenAI gym tasks.Experimental results show that,compared to traditional random sampling and sequential differential priority sampling,E3R can achieve faster convergence speed and higher cumulative return.https://www.jsjkx.com/fileup/1002-137X/PDF/1002-137X-2022-49-5-179.pdfreinforcement learning\|experience replay\|priority sampling\|exploitation\|exploration\|soft actor-critic algorithm
spellingShingle	ZHANG Jia-neng, LI Hui, WU Hao-lin, WANG Zhuang Exploration and Exploitation Balanced Experience Replay Jisuanji kexue reinforcement learning\|experience replay\|priority sampling\|exploitation\|exploration\|soft actor-critic algorithm
title	Exploration and Exploitation Balanced Experience Replay
title_full	Exploration and Exploitation Balanced Experience Replay
title_fullStr	Exploration and Exploitation Balanced Experience Replay
title_full_unstemmed	Exploration and Exploitation Balanced Experience Replay
title_short	Exploration and Exploitation Balanced Experience Replay
title_sort	exploration and exploitation balanced experience replay
topic	reinforcement learning\|experience replay\|priority sampling\|exploitation\|exploration\|soft actor-critic algorithm
url	https://www.jsjkx.com/fileup/1002-137X/PDF/1002-137X-2022-49-5-179.pdf
work_keys_str_mv	AT zhangjianenglihuiwuhaolinwangzhuang explorationandexploitationbalancedexperiencereplay

Exploration and Exploitation Balanced Experience Replay

Similar Items