Exploration and Exploitation Balanced Experience Replay
Experience replay can reuse past experience to update target policy and improve the utilization of samples,which has become an important component of deep reinforcement learning.Prioritized experience replay performs selective sampling based on experience replay to use samples more efficiently.Never...
Main Author: | |
---|---|
Format: | Article |
Language: | zho |
Published: |
Editorial office of Computer Science
2022-05-01
|
Series: | Jisuanji kexue |
Subjects: | |
Online Access: | https://www.jsjkx.com/fileup/1002-137X/PDF/1002-137X-2022-49-5-179.pdf |
_version_ | 1797845103969042432 |
---|---|
author | ZHANG Jia-neng, LI Hui, WU Hao-lin, WANG Zhuang |
author_facet | ZHANG Jia-neng, LI Hui, WU Hao-lin, WANG Zhuang |
author_sort | ZHANG Jia-neng, LI Hui, WU Hao-lin, WANG Zhuang |
collection | DOAJ |
description | Experience replay can reuse past experience to update target policy and improve the utilization of samples,which has become an important component of deep reinforcement learning.Prioritized experience replay performs selective sampling based on experience replay to use samples more efficiently.Nevertheless,the current prioritized experience replay methods will reduce the diversity of samples sampled from the experience buffer,causing the neural network to converge to the local optimum.To tackle the above issue,a novel method named exploration and exploitation balanced experience replay (E3R) is proposed to ba-lances exploration and utilization.This method can comprehensively consider the exploration utility and utilization utility of the samples,and sample according to the weighted sum of two similarities.One of them is the similarity between the behavior strategy and the target strategy in the same state of action,and the other is the similarity between the current state and the past state.Besides,the E3R is combined with the policy gradient algorithm soft actor-critic and the value function algorithm deep Q lear-ning,and experiments are carried out on the suite of OpenAI gym tasks.Experimental results show that,compared to traditional random sampling and sequential differential priority sampling,E3R can achieve faster convergence speed and higher cumulative return. |
first_indexed | 2024-04-09T17:33:08Z |
format | Article |
id | doaj.art-ff5d324d720443a899e77bc19f2a25d2 |
institution | Directory Open Access Journal |
issn | 1002-137X |
language | zho |
last_indexed | 2024-04-09T17:33:08Z |
publishDate | 2022-05-01 |
publisher | Editorial office of Computer Science |
record_format | Article |
series | Jisuanji kexue |
spelling | doaj.art-ff5d324d720443a899e77bc19f2a25d22023-04-18T02:35:57ZzhoEditorial office of Computer ScienceJisuanji kexue1002-137X2022-05-0149517918510.11896/jsjkx.210300084Exploration and Exploitation Balanced Experience ReplayZHANG Jia-neng, LI Hui, WU Hao-lin, WANG Zhuang0College of Computer Science,Sichuan University,Chengdu 610065,ChinaExperience replay can reuse past experience to update target policy and improve the utilization of samples,which has become an important component of deep reinforcement learning.Prioritized experience replay performs selective sampling based on experience replay to use samples more efficiently.Nevertheless,the current prioritized experience replay methods will reduce the diversity of samples sampled from the experience buffer,causing the neural network to converge to the local optimum.To tackle the above issue,a novel method named exploration and exploitation balanced experience replay (E3R) is proposed to ba-lances exploration and utilization.This method can comprehensively consider the exploration utility and utilization utility of the samples,and sample according to the weighted sum of two similarities.One of them is the similarity between the behavior strategy and the target strategy in the same state of action,and the other is the similarity between the current state and the past state.Besides,the E3R is combined with the policy gradient algorithm soft actor-critic and the value function algorithm deep Q lear-ning,and experiments are carried out on the suite of OpenAI gym tasks.Experimental results show that,compared to traditional random sampling and sequential differential priority sampling,E3R can achieve faster convergence speed and higher cumulative return.https://www.jsjkx.com/fileup/1002-137X/PDF/1002-137X-2022-49-5-179.pdfreinforcement learning|experience replay|priority sampling|exploitation|exploration|soft actor-critic algorithm |
spellingShingle | ZHANG Jia-neng, LI Hui, WU Hao-lin, WANG Zhuang Exploration and Exploitation Balanced Experience Replay Jisuanji kexue reinforcement learning|experience replay|priority sampling|exploitation|exploration|soft actor-critic algorithm |
title | Exploration and Exploitation Balanced Experience Replay |
title_full | Exploration and Exploitation Balanced Experience Replay |
title_fullStr | Exploration and Exploitation Balanced Experience Replay |
title_full_unstemmed | Exploration and Exploitation Balanced Experience Replay |
title_short | Exploration and Exploitation Balanced Experience Replay |
title_sort | exploration and exploitation balanced experience replay |
topic | reinforcement learning|experience replay|priority sampling|exploitation|exploration|soft actor-critic algorithm |
url | https://www.jsjkx.com/fileup/1002-137X/PDF/1002-137X-2022-49-5-179.pdf |
work_keys_str_mv | AT zhangjianenglihuiwuhaolinwangzhuang explorationandexploitationbalancedexperiencereplay |