Consistent Experience Replay in High-Dimensional Continuous Control with Decayed Hindsights
The manipulation of complex robotics, which is in general high-dimensional continuous control without an accurate dynamic model, summons studies and applications of reinforcement learning (RL) algorithms. Typically, RL learns with the objective of maximizing the accumulated rewards from interactions...
Main Author: | |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2022-09-01
|
Series: | Machines |
Subjects: | |
Online Access: | https://www.mdpi.com/2075-1702/10/10/856 |
_version_ | 1827649248359874560 |
---|---|
author | Xiaoyun Feng |
author_facet | Xiaoyun Feng |
author_sort | Xiaoyun Feng |
collection | DOAJ |
description | The manipulation of complex robotics, which is in general high-dimensional continuous control without an accurate dynamic model, summons studies and applications of reinforcement learning (RL) algorithms. Typically, RL learns with the objective of maximizing the accumulated rewards from interactions with the environment. In reality, external rewards are not trivial, which depend on either expert knowledge or domain priors. Recent advances on hindsight experience replay (HER) instead enable a robot to learn from the automatically generated sparse and binary rewards, indicating whether it reaches the desired goals or pseudo goals. However, HER inevitably introduces hindsight bias that skews the optimal control since the replays against the achieved pseudo goals may often differ from the exploration of the desired goals. To tackle the problem, we analyze the skewed objective and induce the decayed hindsight (DH), which enables consistent multi-goal experience replay via countering the bias between exploration and hindsight replay. We implement DH for goal-conditioned RL both in online and offline settings. Experiments on online robotic control tasks demonstrate that DH achieves the best average performance and is competitive with state-of-the-art replay strategies. Experiments on offline robotic control tasks show that DH substantially improves the ability to extract near-optimal policies from offline datasets. |
first_indexed | 2024-03-09T19:55:11Z |
format | Article |
id | doaj.art-8c4c74cfdacd45a9aa7c5e7665c332f1 |
institution | Directory Open Access Journal |
issn | 2075-1702 |
language | English |
last_indexed | 2024-03-09T19:55:11Z |
publishDate | 2022-09-01 |
publisher | MDPI AG |
record_format | Article |
series | Machines |
spelling | doaj.art-8c4c74cfdacd45a9aa7c5e7665c332f12023-11-24T00:59:01ZengMDPI AGMachines2075-17022022-09-01101085610.3390/machines10100856Consistent Experience Replay in High-Dimensional Continuous Control with Decayed HindsightsXiaoyun Feng0Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei 230027, ChinaThe manipulation of complex robotics, which is in general high-dimensional continuous control without an accurate dynamic model, summons studies and applications of reinforcement learning (RL) algorithms. Typically, RL learns with the objective of maximizing the accumulated rewards from interactions with the environment. In reality, external rewards are not trivial, which depend on either expert knowledge or domain priors. Recent advances on hindsight experience replay (HER) instead enable a robot to learn from the automatically generated sparse and binary rewards, indicating whether it reaches the desired goals or pseudo goals. However, HER inevitably introduces hindsight bias that skews the optimal control since the replays against the achieved pseudo goals may often differ from the exploration of the desired goals. To tackle the problem, we analyze the skewed objective and induce the decayed hindsight (DH), which enables consistent multi-goal experience replay via countering the bias between exploration and hindsight replay. We implement DH for goal-conditioned RL both in online and offline settings. Experiments on online robotic control tasks demonstrate that DH achieves the best average performance and is competitive with state-of-the-art replay strategies. Experiments on offline robotic control tasks show that DH substantially improves the ability to extract near-optimal policies from offline datasets.https://www.mdpi.com/2075-1702/10/10/856robotic controlgoal-conditioned reinforcement learningoffline reinforcement learningsparse rewardsexperience replayhindsight bias |
spellingShingle | Xiaoyun Feng Consistent Experience Replay in High-Dimensional Continuous Control with Decayed Hindsights Machines robotic control goal-conditioned reinforcement learning offline reinforcement learning sparse rewards experience replay hindsight bias |
title | Consistent Experience Replay in High-Dimensional Continuous Control with Decayed Hindsights |
title_full | Consistent Experience Replay in High-Dimensional Continuous Control with Decayed Hindsights |
title_fullStr | Consistent Experience Replay in High-Dimensional Continuous Control with Decayed Hindsights |
title_full_unstemmed | Consistent Experience Replay in High-Dimensional Continuous Control with Decayed Hindsights |
title_short | Consistent Experience Replay in High-Dimensional Continuous Control with Decayed Hindsights |
title_sort | consistent experience replay in high dimensional continuous control with decayed hindsights |
topic | robotic control goal-conditioned reinforcement learning offline reinforcement learning sparse rewards experience replay hindsight bias |
url | https://www.mdpi.com/2075-1702/10/10/856 |
work_keys_str_mv | AT xiaoyunfeng consistentexperiencereplayinhighdimensionalcontinuouscontrolwithdecayedhindsights |