Scheduled Curiosity-Deep Dyna-Q: Efficient Exploration for Dialog Policy Learning

Training task-oriented dialog agents based on reinforcement learning is time-consuming and requires a large number of interactions with real users. How to grasp dialog policy within limited dialog experiences remains an obstacle that makes the agent training process less efficient. In addition, most...

Full description

Bibliographic Details
Main Authors:	Xuecheng Niu, Akinori Ito, Takashi Nose
Format:	Article
Language:	English
Published:	IEEE 2024-01-01
Series:	IEEE Access
Subjects:	Dialog management reinforcement learning deep Dyna-Q curiosity curriculum learning
Online Access:	https://ieeexplore.ieee.org/document/10468605/

_version_	1827292872648425472
author	Xuecheng Niu Akinori Ito Takashi Nose
author_facet	Xuecheng Niu Akinori Ito Takashi Nose
author_sort	Xuecheng Niu
collection	DOAJ
description	Training task-oriented dialog agents based on reinforcement learning is time-consuming and requires a large number of interactions with real users. How to grasp dialog policy within limited dialog experiences remains an obstacle that makes the agent training process less efficient. In addition, most previous frameworks start training by randomly choosing training samples, which differs from the human learning method and hurts the efficiency and stability of training. Therefore, we propose Scheduled Curiosity-Deep Dyna-Q (SC-DDQ), a curiosity-driven curriculum learning framework based on a state-of- the-art model-based reinforcement learning dialog model, Deep Dyna-Q (DDQ). Furthermore, we design ed learning schedules for SC-DDQ and DDQ, respectively, following two opposite training strategies: classic curriculum learning and its reverse version. Our results show that by introducing scheduled learning and curiosity, the new framework leads to a significant improvement over the DDQ and Deep Q-learning (DQN). Surprisingly, we found that traditional curriculum learning was not always effective. Specifically, according to the experimental results, the easy-first and difficult-first strategies are more suitable for SC-DDQ and DDQ. To analyze our results, we adopt ed the entropy of sampled actions to depict action exploration and found that training strategies with high entropy in the first stage and low entropy in the last stage lead to better performance.
first_indexed	2024-04-24T13:14:21Z
format	Article
id	doaj.art-ce20b18eb08b46bd9cf33b813651fd2b
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-04-24T13:14:21Z
publishDate	2024-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-ce20b18eb08b46bd9cf33b813651fd2b2024-04-04T23:00:36ZengIEEEIEEE Access2169-35362024-01-0112469404695210.1109/ACCESS.2024.337641810468605Scheduled Curiosity-Deep Dyna-Q: Efficient Exploration for Dialog Policy LearningXuecheng Niu0https://orcid.org/0009-0006-8861-6273Akinori Ito1https://orcid.org/0000-0002-8835-7877Takashi Nose2Graduate School of Engineering, Tohoku University, Sendai, JapanGraduate School of Engineering, Tohoku University, Sendai, JapanGraduate School of Engineering, Tohoku University, Sendai, JapanTraining task-oriented dialog agents based on reinforcement learning is time-consuming and requires a large number of interactions with real users. How to grasp dialog policy within limited dialog experiences remains an obstacle that makes the agent training process less efficient. In addition, most previous frameworks start training by randomly choosing training samples, which differs from the human learning method and hurts the efficiency and stability of training. Therefore, we propose Scheduled Curiosity-Deep Dyna-Q (SC-DDQ), a curiosity-driven curriculum learning framework based on a state-of- the-art model-based reinforcement learning dialog model, Deep Dyna-Q (DDQ). Furthermore, we design ed learning schedules for SC-DDQ and DDQ, respectively, following two opposite training strategies: classic curriculum learning and its reverse version. Our results show that by introducing scheduled learning and curiosity, the new framework leads to a significant improvement over the DDQ and Deep Q-learning (DQN). Surprisingly, we found that traditional curriculum learning was not always effective. Specifically, according to the experimental results, the easy-first and difficult-first strategies are more suitable for SC-DDQ and DDQ. To analyze our results, we adopt ed the entropy of sampled actions to depict action exploration and found that training strategies with high entropy in the first stage and low entropy in the last stage lead to better performance.https://ieeexplore.ieee.org/document/10468605/Dialog managementreinforcement learningdeep Dyna-Qcuriositycurriculum learning
spellingShingle	Xuecheng Niu Akinori Ito Takashi Nose Scheduled Curiosity-Deep Dyna-Q: Efficient Exploration for Dialog Policy Learning IEEE Access Dialog management reinforcement learning deep Dyna-Q curiosity curriculum learning
title	Scheduled Curiosity-Deep Dyna-Q: Efficient Exploration for Dialog Policy Learning
title_full	Scheduled Curiosity-Deep Dyna-Q: Efficient Exploration for Dialog Policy Learning
title_fullStr	Scheduled Curiosity-Deep Dyna-Q: Efficient Exploration for Dialog Policy Learning
title_full_unstemmed	Scheduled Curiosity-Deep Dyna-Q: Efficient Exploration for Dialog Policy Learning
title_short	Scheduled Curiosity-Deep Dyna-Q: Efficient Exploration for Dialog Policy Learning
title_sort	scheduled curiosity deep dyna q efficient exploration for dialog policy learning
topic	Dialog management reinforcement learning deep Dyna-Q curiosity curriculum learning
url	https://ieeexplore.ieee.org/document/10468605/
work_keys_str_mv	AT xuechengniu scheduledcuriositydeepdynaqefficientexplorationfordialogpolicylearning AT akinoriito scheduledcuriositydeepdynaqefficientexplorationfordialogpolicylearning AT takashinose scheduledcuriositydeepdynaqefficientexplorationfordialogpolicylearning

Scheduled Curiosity-Deep Dyna-Q: Efficient Exploration for Dialog Policy Learning

Similar Items