Consistent Experience Replay in High-Dimensional Continuous Control with Decayed Hindsights

The manipulation of complex robotics, which is in general high-dimensional continuous control without an accurate dynamic model, summons studies and applications of reinforcement learning (RL) algorithms. Typically, RL learns with the objective of maximizing the accumulated rewards from interactions...

Full description

Bibliographic Details
Main Author: Xiaoyun Feng
Format: Article
Language:English
Published: MDPI AG 2022-09-01
Series:Machines
Subjects:
Online Access:https://www.mdpi.com/2075-1702/10/10/856
_version_ 1827649248359874560
author Xiaoyun Feng
author_facet Xiaoyun Feng
author_sort Xiaoyun Feng
collection DOAJ
description The manipulation of complex robotics, which is in general high-dimensional continuous control without an accurate dynamic model, summons studies and applications of reinforcement learning (RL) algorithms. Typically, RL learns with the objective of maximizing the accumulated rewards from interactions with the environment. In reality, external rewards are not trivial, which depend on either expert knowledge or domain priors. Recent advances on hindsight experience replay (HER) instead enable a robot to learn from the automatically generated sparse and binary rewards, indicating whether it reaches the desired goals or pseudo goals. However, HER inevitably introduces hindsight bias that skews the optimal control since the replays against the achieved pseudo goals may often differ from the exploration of the desired goals. To tackle the problem, we analyze the skewed objective and induce the decayed hindsight (DH), which enables consistent multi-goal experience replay via countering the bias between exploration and hindsight replay. We implement DH for goal-conditioned RL both in online and offline settings. Experiments on online robotic control tasks demonstrate that DH achieves the best average performance and is competitive with state-of-the-art replay strategies. Experiments on offline robotic control tasks show that DH substantially improves the ability to extract near-optimal policies from offline datasets.
first_indexed 2024-03-09T19:55:11Z
format Article
id doaj.art-8c4c74cfdacd45a9aa7c5e7665c332f1
institution Directory Open Access Journal
issn 2075-1702
language English
last_indexed 2024-03-09T19:55:11Z
publishDate 2022-09-01
publisher MDPI AG
record_format Article
series Machines
spelling doaj.art-8c4c74cfdacd45a9aa7c5e7665c332f12023-11-24T00:59:01ZengMDPI AGMachines2075-17022022-09-01101085610.3390/machines10100856Consistent Experience Replay in High-Dimensional Continuous Control with Decayed HindsightsXiaoyun Feng0Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei 230027, ChinaThe manipulation of complex robotics, which is in general high-dimensional continuous control without an accurate dynamic model, summons studies and applications of reinforcement learning (RL) algorithms. Typically, RL learns with the objective of maximizing the accumulated rewards from interactions with the environment. In reality, external rewards are not trivial, which depend on either expert knowledge or domain priors. Recent advances on hindsight experience replay (HER) instead enable a robot to learn from the automatically generated sparse and binary rewards, indicating whether it reaches the desired goals or pseudo goals. However, HER inevitably introduces hindsight bias that skews the optimal control since the replays against the achieved pseudo goals may often differ from the exploration of the desired goals. To tackle the problem, we analyze the skewed objective and induce the decayed hindsight (DH), which enables consistent multi-goal experience replay via countering the bias between exploration and hindsight replay. We implement DH for goal-conditioned RL both in online and offline settings. Experiments on online robotic control tasks demonstrate that DH achieves the best average performance and is competitive with state-of-the-art replay strategies. Experiments on offline robotic control tasks show that DH substantially improves the ability to extract near-optimal policies from offline datasets.https://www.mdpi.com/2075-1702/10/10/856robotic controlgoal-conditioned reinforcement learningoffline reinforcement learningsparse rewardsexperience replayhindsight bias
spellingShingle Xiaoyun Feng
Consistent Experience Replay in High-Dimensional Continuous Control with Decayed Hindsights
Machines
robotic control
goal-conditioned reinforcement learning
offline reinforcement learning
sparse rewards
experience replay
hindsight bias
title Consistent Experience Replay in High-Dimensional Continuous Control with Decayed Hindsights
title_full Consistent Experience Replay in High-Dimensional Continuous Control with Decayed Hindsights
title_fullStr Consistent Experience Replay in High-Dimensional Continuous Control with Decayed Hindsights
title_full_unstemmed Consistent Experience Replay in High-Dimensional Continuous Control with Decayed Hindsights
title_short Consistent Experience Replay in High-Dimensional Continuous Control with Decayed Hindsights
title_sort consistent experience replay in high dimensional continuous control with decayed hindsights
topic robotic control
goal-conditioned reinforcement learning
offline reinforcement learning
sparse rewards
experience replay
hindsight bias
url https://www.mdpi.com/2075-1702/10/10/856
work_keys_str_mv AT xiaoyunfeng consistentexperiencereplayinhighdimensionalcontinuouscontrolwithdecayedhindsights