Scheduled Curiosity-Deep Dyna-Q: Efficient Exploration for Dialog Policy Learning
Training task-oriented dialog agents based on reinforcement learning is time-consuming and requires a large number of interactions with real users. How to grasp dialog policy within limited dialog experiences remains an obstacle that makes the agent training process less efficient. In addition, most...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2024-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10468605/ |
_version_ | 1827292872648425472 |
---|---|
author | Xuecheng Niu Akinori Ito Takashi Nose |
author_facet | Xuecheng Niu Akinori Ito Takashi Nose |
author_sort | Xuecheng Niu |
collection | DOAJ |
description | Training task-oriented dialog agents based on reinforcement learning is time-consuming and requires a large number of interactions with real users. How to grasp dialog policy within limited dialog experiences remains an obstacle that makes the agent training process less efficient. In addition, most previous frameworks start training by randomly choosing training samples, which differs from the human learning method and hurts the efficiency and stability of training. Therefore, we propose Scheduled Curiosity-Deep Dyna-Q (SC-DDQ), a curiosity-driven curriculum learning framework based on a state-of- the-art model-based reinforcement learning dialog model, Deep Dyna-Q (DDQ). Furthermore, we design ed learning schedules for SC-DDQ and DDQ, respectively, following two opposite training strategies: classic curriculum learning and its reverse version. Our results show that by introducing scheduled learning and curiosity, the new framework leads to a significant improvement over the DDQ and Deep Q-learning (DQN). Surprisingly, we found that traditional curriculum learning was not always effective. Specifically, according to the experimental results, the easy-first and difficult-first strategies are more suitable for SC-DDQ and DDQ. To analyze our results, we adopt ed the entropy of sampled actions to depict action exploration and found that training strategies with high entropy in the first stage and low entropy in the last stage lead to better performance. |
first_indexed | 2024-04-24T13:14:21Z |
format | Article |
id | doaj.art-ce20b18eb08b46bd9cf33b813651fd2b |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-04-24T13:14:21Z |
publishDate | 2024-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-ce20b18eb08b46bd9cf33b813651fd2b2024-04-04T23:00:36ZengIEEEIEEE Access2169-35362024-01-0112469404695210.1109/ACCESS.2024.337641810468605Scheduled Curiosity-Deep Dyna-Q: Efficient Exploration for Dialog Policy LearningXuecheng Niu0https://orcid.org/0009-0006-8861-6273Akinori Ito1https://orcid.org/0000-0002-8835-7877Takashi Nose2Graduate School of Engineering, Tohoku University, Sendai, JapanGraduate School of Engineering, Tohoku University, Sendai, JapanGraduate School of Engineering, Tohoku University, Sendai, JapanTraining task-oriented dialog agents based on reinforcement learning is time-consuming and requires a large number of interactions with real users. How to grasp dialog policy within limited dialog experiences remains an obstacle that makes the agent training process less efficient. In addition, most previous frameworks start training by randomly choosing training samples, which differs from the human learning method and hurts the efficiency and stability of training. Therefore, we propose Scheduled Curiosity-Deep Dyna-Q (SC-DDQ), a curiosity-driven curriculum learning framework based on a state-of- the-art model-based reinforcement learning dialog model, Deep Dyna-Q (DDQ). Furthermore, we design ed learning schedules for SC-DDQ and DDQ, respectively, following two opposite training strategies: classic curriculum learning and its reverse version. Our results show that by introducing scheduled learning and curiosity, the new framework leads to a significant improvement over the DDQ and Deep Q-learning (DQN). Surprisingly, we found that traditional curriculum learning was not always effective. Specifically, according to the experimental results, the easy-first and difficult-first strategies are more suitable for SC-DDQ and DDQ. To analyze our results, we adopt ed the entropy of sampled actions to depict action exploration and found that training strategies with high entropy in the first stage and low entropy in the last stage lead to better performance.https://ieeexplore.ieee.org/document/10468605/Dialog managementreinforcement learningdeep Dyna-Qcuriositycurriculum learning |
spellingShingle | Xuecheng Niu Akinori Ito Takashi Nose Scheduled Curiosity-Deep Dyna-Q: Efficient Exploration for Dialog Policy Learning IEEE Access Dialog management reinforcement learning deep Dyna-Q curiosity curriculum learning |
title | Scheduled Curiosity-Deep Dyna-Q: Efficient Exploration for Dialog Policy Learning |
title_full | Scheduled Curiosity-Deep Dyna-Q: Efficient Exploration for Dialog Policy Learning |
title_fullStr | Scheduled Curiosity-Deep Dyna-Q: Efficient Exploration for Dialog Policy Learning |
title_full_unstemmed | Scheduled Curiosity-Deep Dyna-Q: Efficient Exploration for Dialog Policy Learning |
title_short | Scheduled Curiosity-Deep Dyna-Q: Efficient Exploration for Dialog Policy Learning |
title_sort | scheduled curiosity deep dyna q efficient exploration for dialog policy learning |
topic | Dialog management reinforcement learning deep Dyna-Q curiosity curriculum learning |
url | https://ieeexplore.ieee.org/document/10468605/ |
work_keys_str_mv | AT xuechengniu scheduledcuriositydeepdynaqefficientexplorationfordialogpolicylearning AT akinoriito scheduledcuriositydeepdynaqefficientexplorationfordialogpolicylearning AT takashinose scheduledcuriositydeepdynaqefficientexplorationfordialogpolicylearning |