Self-Adaptive Priority Correction for Prioritized Experience Replay

Deep Reinforcement Learning (DRL) is a promising approach for general artificial intelligence. However, most DRL methods suffer from the problem of data inefficiency. To alleviate this problem, DeepMind proposed Prioritized Experience Replay (PER). Though PER improves data utilization, the prioritie...

Full description

Bibliographic Details
Main Authors:	Hongjie Zhang, Cheng Qu, Jindou Zhang, Jing Li
Format:	Article
Language:	English
Published:	MDPI AG 2020-10-01
Series:	Applied Sciences
Subjects:	deep reinforcement learning experience replay importance sampling DDQN DDPG
Online Access:	https://www.mdpi.com/2076-3417/10/19/6925

_version_	1827704970491723776
author	Hongjie Zhang Cheng Qu Jindou Zhang Jing Li
author_facet	Hongjie Zhang Cheng Qu Jindou Zhang Jing Li
author_sort	Hongjie Zhang
collection	DOAJ
description	Deep Reinforcement Learning (DRL) is a promising approach for general artificial intelligence. However, most DRL methods suffer from the problem of data inefficiency. To alleviate this problem, DeepMind proposed Prioritized Experience Replay (PER). Though PER improves data utilization, the priorities of most samples in its Experience Memory (EM) are out of date, as only the priorities of a small part of the data are updated while the Q network parameters are updated. Consequently, the difference between storage and real priority distributions gradually increases, which will introduce bias into the gradients of Deep Q-Learning (DQL) and make the DQL update toward a non-ideal direction. In this work, we propose a novel self-adaptive priority correction algorithm named Importance-PER (Imp-PER) to fix the update deviation. Specifically, we predict the sum of real Temporal-Difference error (TD-error) of all data in EM. Data are corrected by an importance weight, which is estimated by the predicted sum and the real TD-error calculated by the latest agent. To control the unbounded importance weight, we use truncated importance sampling with a self-adaptive truncation threshold. The conducted experiments on various games of Atari 2600 with Double Deep Q-Network and MuJoCo with Deep Deterministic Policy Gradient demonstrate that Imp-PER improves the data utilization and final policy quality on discrete states and continuous states tasks without increasing the computational cost.
first_indexed	2024-03-10T15:52:44Z
format	Article
id	doaj.art-51c612d6d5034a94acb46d121a98b3f0
institution	Directory Open Access Journal
issn	2076-3417
language	English
last_indexed	2024-03-10T15:52:44Z
publishDate	2020-10-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj.art-51c612d6d5034a94acb46d121a98b3f02023-11-20T15:56:29ZengMDPI AGApplied Sciences2076-34172020-10-011019692510.3390/app10196925Self-Adaptive Priority Correction for Prioritized Experience ReplayHongjie Zhang0Cheng Qu1Jindou Zhang2Jing Li3School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, ChinaSchool of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, ChinaSchool of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, ChinaSchool of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, ChinaDeep Reinforcement Learning (DRL) is a promising approach for general artificial intelligence. However, most DRL methods suffer from the problem of data inefficiency. To alleviate this problem, DeepMind proposed Prioritized Experience Replay (PER). Though PER improves data utilization, the priorities of most samples in its Experience Memory (EM) are out of date, as only the priorities of a small part of the data are updated while the Q network parameters are updated. Consequently, the difference between storage and real priority distributions gradually increases, which will introduce bias into the gradients of Deep Q-Learning (DQL) and make the DQL update toward a non-ideal direction. In this work, we propose a novel self-adaptive priority correction algorithm named Importance-PER (Imp-PER) to fix the update deviation. Specifically, we predict the sum of real Temporal-Difference error (TD-error) of all data in EM. Data are corrected by an importance weight, which is estimated by the predicted sum and the real TD-error calculated by the latest agent. To control the unbounded importance weight, we use truncated importance sampling with a self-adaptive truncation threshold. The conducted experiments on various games of Atari 2600 with Double Deep Q-Network and MuJoCo with Deep Deterministic Policy Gradient demonstrate that Imp-PER improves the data utilization and final policy quality on discrete states and continuous states tasks without increasing the computational cost.https://www.mdpi.com/2076-3417/10/19/6925deep reinforcement learningexperience replayimportance samplingDDQNDDPG
spellingShingle	Hongjie Zhang Cheng Qu Jindou Zhang Jing Li Self-Adaptive Priority Correction for Prioritized Experience Replay Applied Sciences deep reinforcement learning experience replay importance sampling DDQN DDPG
title	Self-Adaptive Priority Correction for Prioritized Experience Replay
title_full	Self-Adaptive Priority Correction for Prioritized Experience Replay
title_fullStr	Self-Adaptive Priority Correction for Prioritized Experience Replay
title_full_unstemmed	Self-Adaptive Priority Correction for Prioritized Experience Replay
title_short	Self-Adaptive Priority Correction for Prioritized Experience Replay
title_sort	self adaptive priority correction for prioritized experience replay
topic	deep reinforcement learning experience replay importance sampling DDQN DDPG
url	https://www.mdpi.com/2076-3417/10/19/6925
work_keys_str_mv	AT hongjiezhang selfadaptiveprioritycorrectionforprioritizedexperiencereplay AT chengqu selfadaptiveprioritycorrectionforprioritizedexperiencereplay AT jindouzhang selfadaptiveprioritycorrectionforprioritizedexperiencereplay AT jingli selfadaptiveprioritycorrectionforprioritizedexperiencereplay

Self-Adaptive Priority Correction for Prioritized Experience Replay

Similar Items