Consistent Experience Replay in High-Dimensional Continuous Control with Decayed Hindsights

The manipulation of complex robotics, which is in general high-dimensional continuous control without an accurate dynamic model, summons studies and applications of reinforcement learning (RL) algorithms. Typically, RL learns with the objective of maximizing the accumulated rewards from interactions...

Full description

Bibliographic Details
Main Author:	Xiaoyun Feng
Format:	Article
Language:	English
Published:	MDPI AG 2022-09-01
Series:	Machines
Subjects:	robotic control goal-conditioned reinforcement learning offline reinforcement learning sparse rewards experience replay hindsight bias
Online Access:	https://www.mdpi.com/2075-1702/10/10/856

_version_	1827649248359874560
author	Xiaoyun Feng
author_facet	Xiaoyun Feng
author_sort	Xiaoyun Feng
collection	DOAJ
description	The manipulation of complex robotics, which is in general high-dimensional continuous control without an accurate dynamic model, summons studies and applications of reinforcement learning (RL) algorithms. Typically, RL learns with the objective of maximizing the accumulated rewards from interactions with the environment. In reality, external rewards are not trivial, which depend on either expert knowledge or domain priors. Recent advances on hindsight experience replay (HER) instead enable a robot to learn from the automatically generated sparse and binary rewards, indicating whether it reaches the desired goals or pseudo goals. However, HER inevitably introduces hindsight bias that skews the optimal control since the replays against the achieved pseudo goals may often differ from the exploration of the desired goals. To tackle the problem, we analyze the skewed objective and induce the decayed hindsight (DH), which enables consistent multi-goal experience replay via countering the bias between exploration and hindsight replay. We implement DH for goal-conditioned RL both in online and offline settings. Experiments on online robotic control tasks demonstrate that DH achieves the best average performance and is competitive with state-of-the-art replay strategies. Experiments on offline robotic control tasks show that DH substantially improves the ability to extract near-optimal policies from offline datasets.
first_indexed	2024-03-09T19:55:11Z
format	Article
id	doaj.art-8c4c74cfdacd45a9aa7c5e7665c332f1
institution	Directory Open Access Journal
issn	2075-1702
language	English
last_indexed	2024-03-09T19:55:11Z
publishDate	2022-09-01
publisher	MDPI AG
record_format	Article
series	Machines
spelling	doaj.art-8c4c74cfdacd45a9aa7c5e7665c332f12023-11-24T00:59:01ZengMDPI AGMachines2075-17022022-09-01101085610.3390/machines10100856Consistent Experience Replay in High-Dimensional Continuous Control with Decayed HindsightsXiaoyun Feng0Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei 230027, ChinaThe manipulation of complex robotics, which is in general high-dimensional continuous control without an accurate dynamic model, summons studies and applications of reinforcement learning (RL) algorithms. Typically, RL learns with the objective of maximizing the accumulated rewards from interactions with the environment. In reality, external rewards are not trivial, which depend on either expert knowledge or domain priors. Recent advances on hindsight experience replay (HER) instead enable a robot to learn from the automatically generated sparse and binary rewards, indicating whether it reaches the desired goals or pseudo goals. However, HER inevitably introduces hindsight bias that skews the optimal control since the replays against the achieved pseudo goals may often differ from the exploration of the desired goals. To tackle the problem, we analyze the skewed objective and induce the decayed hindsight (DH), which enables consistent multi-goal experience replay via countering the bias between exploration and hindsight replay. We implement DH for goal-conditioned RL both in online and offline settings. Experiments on online robotic control tasks demonstrate that DH achieves the best average performance and is competitive with state-of-the-art replay strategies. Experiments on offline robotic control tasks show that DH substantially improves the ability to extract near-optimal policies from offline datasets.https://www.mdpi.com/2075-1702/10/10/856robotic controlgoal-conditioned reinforcement learningoffline reinforcement learningsparse rewardsexperience replayhindsight bias
spellingShingle	Xiaoyun Feng Consistent Experience Replay in High-Dimensional Continuous Control with Decayed Hindsights Machines robotic control goal-conditioned reinforcement learning offline reinforcement learning sparse rewards experience replay hindsight bias
title	Consistent Experience Replay in High-Dimensional Continuous Control with Decayed Hindsights
title_full	Consistent Experience Replay in High-Dimensional Continuous Control with Decayed Hindsights
title_fullStr	Consistent Experience Replay in High-Dimensional Continuous Control with Decayed Hindsights
title_full_unstemmed	Consistent Experience Replay in High-Dimensional Continuous Control with Decayed Hindsights
title_short	Consistent Experience Replay in High-Dimensional Continuous Control with Decayed Hindsights
title_sort	consistent experience replay in high dimensional continuous control with decayed hindsights
topic	robotic control goal-conditioned reinforcement learning offline reinforcement learning sparse rewards experience replay hindsight bias
url	https://www.mdpi.com/2075-1702/10/10/856
work_keys_str_mv	AT xiaoyunfeng consistentexperiencereplayinhighdimensionalcontinuouscontrolwithdecayedhindsights

Consistent Experience Replay in High-Dimensional Continuous Control with Decayed Hindsights

Similar Items