Enhanced Off-Policy Reinforcement Learning With Focused Experience Replay

Utilizing the collected experience tuples in the replay buffer (RB) is the primary way of exploiting the experiences in the off-policy reinforcement learning (RL) algorithms, and, therefore, the sampling scheme for the experience tuples in the RB can be critical for experience utilization. In this p...

Full description

Bibliographic Details
Main Authors: Seung-Hyun Kong, I. Made Aswin Nahrendra, Dong-Hee Paek
Format: Article
Language:English
Published: IEEE 2021-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9444458/
_version_ 1818653121110343680
author Seung-Hyun Kong
I. Made Aswin Nahrendra
Dong-Hee Paek
author_facet Seung-Hyun Kong
I. Made Aswin Nahrendra
Dong-Hee Paek
author_sort Seung-Hyun Kong
collection DOAJ
description Utilizing the collected experience tuples in the replay buffer (RB) is the primary way of exploiting the experiences in the off-policy reinforcement learning (RL) algorithms, and, therefore, the sampling scheme for the experience tuples in the RB can be critical for experience utilization. In this paper, it is found that a widely used sampling scheme in the off-policy RL suffers from inefficiency due to the inadequate uneven sampling of experience tuples from the RB. In fact, the conventional uniform sampling of the experience tuples in the RB causes a severely unbalanced experience utilization, since experiences stored earlier in the RB is sampled with much higher frequency especially in the early stage of learning. We mitigate this fundamental problem by employing a half-normal sampling probability window that allocates a higher sampling probability to newer experiences in the RB. In addition, we propose general and local size adjustment schemes that determine the standard deviation of the half-normal sampling window to enhance the learning speed and performance and to mitigate the temporary performance degradation during training, respectively. For performance demonstration, we apply the proposed sampling technique to the state-of-the-art off-policy RL algorithms and test for various RL benchmark tasks such as MuJoCo gym and CARLA simulator. As a result, the proposed technique shows considerable learning speed and final performance improvement, especially on the tasks with large state and action space. Furthermore, the proposed sampling technique increases the stability of the considered RL algorithms, verified with less variance of the performance results across different random seeds of network initialization.
first_indexed 2024-12-17T02:32:52Z
format Article
id doaj.art-79a39ba6debb44519b40eb48b79a0470
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-12-17T02:32:52Z
publishDate 2021-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-79a39ba6debb44519b40eb48b79a04702022-12-21T22:06:55ZengIEEEIEEE Access2169-35362021-01-019931529316410.1109/ACCESS.2021.30851429444458Enhanced Off-Policy Reinforcement Learning With Focused Experience ReplaySeung-Hyun Kong0https://orcid.org/0000-0002-4753-1998I. Made Aswin Nahrendra1https://orcid.org/0000-0001-9515-7059Dong-Hee Paek2https://orcid.org/0000-0003-0008-3726The Cho Chun Shik Graduate School of Green Transportation, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South KoreaThe Robotics Program, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South KoreaThe Cho Chun Shik Graduate School of Green Transportation, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South KoreaUtilizing the collected experience tuples in the replay buffer (RB) is the primary way of exploiting the experiences in the off-policy reinforcement learning (RL) algorithms, and, therefore, the sampling scheme for the experience tuples in the RB can be critical for experience utilization. In this paper, it is found that a widely used sampling scheme in the off-policy RL suffers from inefficiency due to the inadequate uneven sampling of experience tuples from the RB. In fact, the conventional uniform sampling of the experience tuples in the RB causes a severely unbalanced experience utilization, since experiences stored earlier in the RB is sampled with much higher frequency especially in the early stage of learning. We mitigate this fundamental problem by employing a half-normal sampling probability window that allocates a higher sampling probability to newer experiences in the RB. In addition, we propose general and local size adjustment schemes that determine the standard deviation of the half-normal sampling window to enhance the learning speed and performance and to mitigate the temporary performance degradation during training, respectively. For performance demonstration, we apply the proposed sampling technique to the state-of-the-art off-policy RL algorithms and test for various RL benchmark tasks such as MuJoCo gym and CARLA simulator. As a result, the proposed technique shows considerable learning speed and final performance improvement, especially on the tasks with large state and action space. Furthermore, the proposed sampling technique increases the stability of the considered RL algorithms, verified with less variance of the performance results across different random seeds of network initialization.https://ieeexplore.ieee.org/document/9444458/Reinforcement learningoff-policyactor-criticexperience replayreplay buffer
spellingShingle Seung-Hyun Kong
I. Made Aswin Nahrendra
Dong-Hee Paek
Enhanced Off-Policy Reinforcement Learning With Focused Experience Replay
IEEE Access
Reinforcement learning
off-policy
actor-critic
experience replay
replay buffer
title Enhanced Off-Policy Reinforcement Learning With Focused Experience Replay
title_full Enhanced Off-Policy Reinforcement Learning With Focused Experience Replay
title_fullStr Enhanced Off-Policy Reinforcement Learning With Focused Experience Replay
title_full_unstemmed Enhanced Off-Policy Reinforcement Learning With Focused Experience Replay
title_short Enhanced Off-Policy Reinforcement Learning With Focused Experience Replay
title_sort enhanced off policy reinforcement learning with focused experience replay
topic Reinforcement learning
off-policy
actor-critic
experience replay
replay buffer
url https://ieeexplore.ieee.org/document/9444458/
work_keys_str_mv AT seunghyunkong enhancedoffpolicyreinforcementlearningwithfocusedexperiencereplay
AT imadeaswinnahrendra enhancedoffpolicyreinforcementlearningwithfocusedexperiencereplay
AT dongheepaek enhancedoffpolicyreinforcementlearningwithfocusedexperiencereplay