Scheduled Curiosity-Deep Dyna-Q: Efficient Exploration for Dialog Policy Learning

Training task-oriented dialog agents based on reinforcement learning is time-consuming and requires a large number of interactions with real users. How to grasp dialog policy within limited dialog experiences remains an obstacle that makes the agent training process less efficient. In addition, most...

Full description

Bibliographic Details
Main Authors: Xuecheng Niu, Akinori Ito, Takashi Nose
Format: Article
Language:English
Published: IEEE 2024-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10468605/
_version_ 1827292872648425472
author Xuecheng Niu
Akinori Ito
Takashi Nose
author_facet Xuecheng Niu
Akinori Ito
Takashi Nose
author_sort Xuecheng Niu
collection DOAJ
description Training task-oriented dialog agents based on reinforcement learning is time-consuming and requires a large number of interactions with real users. How to grasp dialog policy within limited dialog experiences remains an obstacle that makes the agent training process less efficient. In addition, most previous frameworks start training by randomly choosing training samples, which differs from the human learning method and hurts the efficiency and stability of training. Therefore, we propose Scheduled Curiosity-Deep Dyna-Q (SC-DDQ), a curiosity-driven curriculum learning framework based on a state-of- the-art model-based reinforcement learning dialog model, Deep Dyna-Q (DDQ). Furthermore, we design ed learning schedules for SC-DDQ and DDQ, respectively, following two opposite training strategies: classic curriculum learning and its reverse version. Our results show that by introducing scheduled learning and curiosity, the new framework leads to a significant improvement over the DDQ and Deep Q-learning (DQN). Surprisingly, we found that traditional curriculum learning was not always effective. Specifically, according to the experimental results, the easy-first and difficult-first strategies are more suitable for SC-DDQ and DDQ. To analyze our results, we adopt ed the entropy of sampled actions to depict action exploration and found that training strategies with high entropy in the first stage and low entropy in the last stage lead to better performance.
first_indexed 2024-04-24T13:14:21Z
format Article
id doaj.art-ce20b18eb08b46bd9cf33b813651fd2b
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-04-24T13:14:21Z
publishDate 2024-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-ce20b18eb08b46bd9cf33b813651fd2b2024-04-04T23:00:36ZengIEEEIEEE Access2169-35362024-01-0112469404695210.1109/ACCESS.2024.337641810468605Scheduled Curiosity-Deep Dyna-Q: Efficient Exploration for Dialog Policy LearningXuecheng Niu0https://orcid.org/0009-0006-8861-6273Akinori Ito1https://orcid.org/0000-0002-8835-7877Takashi Nose2Graduate School of Engineering, Tohoku University, Sendai, JapanGraduate School of Engineering, Tohoku University, Sendai, JapanGraduate School of Engineering, Tohoku University, Sendai, JapanTraining task-oriented dialog agents based on reinforcement learning is time-consuming and requires a large number of interactions with real users. How to grasp dialog policy within limited dialog experiences remains an obstacle that makes the agent training process less efficient. In addition, most previous frameworks start training by randomly choosing training samples, which differs from the human learning method and hurts the efficiency and stability of training. Therefore, we propose Scheduled Curiosity-Deep Dyna-Q (SC-DDQ), a curiosity-driven curriculum learning framework based on a state-of- the-art model-based reinforcement learning dialog model, Deep Dyna-Q (DDQ). Furthermore, we design ed learning schedules for SC-DDQ and DDQ, respectively, following two opposite training strategies: classic curriculum learning and its reverse version. Our results show that by introducing scheduled learning and curiosity, the new framework leads to a significant improvement over the DDQ and Deep Q-learning (DQN). Surprisingly, we found that traditional curriculum learning was not always effective. Specifically, according to the experimental results, the easy-first and difficult-first strategies are more suitable for SC-DDQ and DDQ. To analyze our results, we adopt ed the entropy of sampled actions to depict action exploration and found that training strategies with high entropy in the first stage and low entropy in the last stage lead to better performance.https://ieeexplore.ieee.org/document/10468605/Dialog managementreinforcement learningdeep Dyna-Qcuriositycurriculum learning
spellingShingle Xuecheng Niu
Akinori Ito
Takashi Nose
Scheduled Curiosity-Deep Dyna-Q: Efficient Exploration for Dialog Policy Learning
IEEE Access
Dialog management
reinforcement learning
deep Dyna-Q
curiosity
curriculum learning
title Scheduled Curiosity-Deep Dyna-Q: Efficient Exploration for Dialog Policy Learning
title_full Scheduled Curiosity-Deep Dyna-Q: Efficient Exploration for Dialog Policy Learning
title_fullStr Scheduled Curiosity-Deep Dyna-Q: Efficient Exploration for Dialog Policy Learning
title_full_unstemmed Scheduled Curiosity-Deep Dyna-Q: Efficient Exploration for Dialog Policy Learning
title_short Scheduled Curiosity-Deep Dyna-Q: Efficient Exploration for Dialog Policy Learning
title_sort scheduled curiosity deep dyna q efficient exploration for dialog policy learning
topic Dialog management
reinforcement learning
deep Dyna-Q
curiosity
curriculum learning
url https://ieeexplore.ieee.org/document/10468605/
work_keys_str_mv AT xuechengniu scheduledcuriositydeepdynaqefficientexplorationfordialogpolicylearning
AT akinoriito scheduledcuriositydeepdynaqefficientexplorationfordialogpolicylearning
AT takashinose scheduledcuriositydeepdynaqefficientexplorationfordialogpolicylearning