Exploration and Exploitation Balanced Experience Replay

Experience replay can reuse past experience to update target policy and improve the utilization of samples,which has become an important component of deep reinforcement learning.Prioritized experience replay performs selective sampling based on experience replay to use samples more efficiently.Never...

Full description

Bibliographic Details
Main Author: ZHANG Jia-neng, LI Hui, WU Hao-lin, WANG Zhuang
Format: Article
Language:zho
Published: Editorial office of Computer Science 2022-05-01
Series:Jisuanji kexue
Subjects:
Online Access:https://www.jsjkx.com/fileup/1002-137X/PDF/1002-137X-2022-49-5-179.pdf
_version_ 1797845103969042432
author ZHANG Jia-neng, LI Hui, WU Hao-lin, WANG Zhuang
author_facet ZHANG Jia-neng, LI Hui, WU Hao-lin, WANG Zhuang
author_sort ZHANG Jia-neng, LI Hui, WU Hao-lin, WANG Zhuang
collection DOAJ
description Experience replay can reuse past experience to update target policy and improve the utilization of samples,which has become an important component of deep reinforcement learning.Prioritized experience replay performs selective sampling based on experience replay to use samples more efficiently.Nevertheless,the current prioritized experience replay methods will reduce the diversity of samples sampled from the experience buffer,causing the neural network to converge to the local optimum.To tackle the above issue,a novel method named exploration and exploitation balanced experience replay (E3R) is proposed to ba-lances exploration and utilization.This method can comprehensively consider the exploration utility and utilization utility of the samples,and sample according to the weighted sum of two similarities.One of them is the similarity between the behavior strategy and the target strategy in the same state of action,and the other is the similarity between the current state and the past state.Besides,the E3R is combined with the policy gradient algorithm soft actor-critic and the value function algorithm deep Q lear-ning,and experiments are carried out on the suite of OpenAI gym tasks.Experimental results show that,compared to traditional random sampling and sequential differential priority sampling,E3R can achieve faster convergence speed and higher cumulative return.
first_indexed 2024-04-09T17:33:08Z
format Article
id doaj.art-ff5d324d720443a899e77bc19f2a25d2
institution Directory Open Access Journal
issn 1002-137X
language zho
last_indexed 2024-04-09T17:33:08Z
publishDate 2022-05-01
publisher Editorial office of Computer Science
record_format Article
series Jisuanji kexue
spelling doaj.art-ff5d324d720443a899e77bc19f2a25d22023-04-18T02:35:57ZzhoEditorial office of Computer ScienceJisuanji kexue1002-137X2022-05-0149517918510.11896/jsjkx.210300084Exploration and Exploitation Balanced Experience ReplayZHANG Jia-neng, LI Hui, WU Hao-lin, WANG Zhuang0College of Computer Science,Sichuan University,Chengdu 610065,ChinaExperience replay can reuse past experience to update target policy and improve the utilization of samples,which has become an important component of deep reinforcement learning.Prioritized experience replay performs selective sampling based on experience replay to use samples more efficiently.Nevertheless,the current prioritized experience replay methods will reduce the diversity of samples sampled from the experience buffer,causing the neural network to converge to the local optimum.To tackle the above issue,a novel method named exploration and exploitation balanced experience replay (E3R) is proposed to ba-lances exploration and utilization.This method can comprehensively consider the exploration utility and utilization utility of the samples,and sample according to the weighted sum of two similarities.One of them is the similarity between the behavior strategy and the target strategy in the same state of action,and the other is the similarity between the current state and the past state.Besides,the E3R is combined with the policy gradient algorithm soft actor-critic and the value function algorithm deep Q lear-ning,and experiments are carried out on the suite of OpenAI gym tasks.Experimental results show that,compared to traditional random sampling and sequential differential priority sampling,E3R can achieve faster convergence speed and higher cumulative return.https://www.jsjkx.com/fileup/1002-137X/PDF/1002-137X-2022-49-5-179.pdfreinforcement learning|experience replay|priority sampling|exploitation|exploration|soft actor-critic algorithm
spellingShingle ZHANG Jia-neng, LI Hui, WU Hao-lin, WANG Zhuang
Exploration and Exploitation Balanced Experience Replay
Jisuanji kexue
reinforcement learning|experience replay|priority sampling|exploitation|exploration|soft actor-critic algorithm
title Exploration and Exploitation Balanced Experience Replay
title_full Exploration and Exploitation Balanced Experience Replay
title_fullStr Exploration and Exploitation Balanced Experience Replay
title_full_unstemmed Exploration and Exploitation Balanced Experience Replay
title_short Exploration and Exploitation Balanced Experience Replay
title_sort exploration and exploitation balanced experience replay
topic reinforcement learning|experience replay|priority sampling|exploitation|exploration|soft actor-critic algorithm
url https://www.jsjkx.com/fileup/1002-137X/PDF/1002-137X-2022-49-5-179.pdf
work_keys_str_mv AT zhangjianenglihuiwuhaolinwangzhuang explorationandexploitationbalancedexperiencereplay