Learning to Plan via Deep Optimistic Value Exploration

Deep exploration requires coordinated long-term planning. We present a model-based reinforcement learning algorithm that guides policy learning through a value function that exhibits optimism in the face of uncertainty. We capture uncertainty over values by combining predictions from an ensemble o...

Full description

Bibliographic Details
Main Authors:	Seyde, Tim, Schwarting, Wilko, Karaman, Sertac, Rus, Daniela L
Other Authors:	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Format:	Article
Published:	2020
Online Access:	https://hdl.handle.net/1721.1/125161

_version_	1826191313812324352
author	Seyde, Tim Schwarting, Wilko Karaman, Sertac Rus, Daniela L
author2	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
author_facet	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory Seyde, Tim Schwarting, Wilko Karaman, Sertac Rus, Daniela L
author_sort	Seyde, Tim
collection	MIT
description	Deep exploration requires coordinated long-term planning. We present a model-based reinforcement learning algorithm that guides policy learning through a value function that exhibits optimism in the face of uncertainty. We capture uncertainty over values by combining predictions from an ensemble of models and formulate an upper confidence bound (UCB) objective to recover optimistic estimates. Training the policy on ensemble rollouts with the learned value function as the terminal cost allows for projecting long-term interactions into a limited planning horizon, thus enabling deep optimistic exploration. We do not assume a priori knowledge of either the dynamics or reward function. We demonstrate that our approach can accommodate both dense and sparse reward signals, while improving sample complexity on a variety of benchmarking tasks. Keywords: Reinforcement Learning; Deep Exploration; Model-Based; Value Function; UCB
first_indexed	2024-09-23T08:53:58Z
format	Article
id	mit-1721.1/125161
institution	Massachusetts Institute of Technology
last_indexed	2024-09-23T08:53:58Z
publishDate	2020
record_format	dspace
spelling	mit-1721.1/1251612022-09-30T12:03:07Z Learning to Plan via Deep Optimistic Value Exploration Seyde, Tim Schwarting, Wilko Karaman, Sertac Rus, Daniela L Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology. Laboratory for Information and Decision Systems Deep exploration requires coordinated long-term planning. We present a model-based reinforcement learning algorithm that guides policy learning through a value function that exhibits optimism in the face of uncertainty. We capture uncertainty over values by combining predictions from an ensemble of models and formulate an upper confidence bound (UCB) objective to recover optimistic estimates. Training the policy on ensemble rollouts with the learned value function as the terminal cost allows for projecting long-term interactions into a limited planning horizon, thus enabling deep optimistic exploration. We do not assume a priori knowledge of either the dynamics or reward function. We demonstrate that our approach can accommodate both dense and sparse reward signals, while improving sample complexity on a variety of benchmarking tasks. Keywords: Reinforcement Learning; Deep Exploration; Model-Based; Value Function; UCB Office of Naval Research; Qualcomm; Toyota Research Institute 2020-05-11T19:59:29Z 2020-05-11T19:59:29Z 2020-08 2020-05 Article http://purl.org/eprint/type/ConferencePaper https://hdl.handle.net/1721.1/125161 Seyde, Tim, et al. "Learning to Plan via Deep Optimistic Value Exploration." Proceedings of Machine Learning Research, 120 (August 2020), 1-14. Proceedings of Machine Learning Research Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/ application/pdf Tim Seyde
spellingShingle	Seyde, Tim Schwarting, Wilko Karaman, Sertac Rus, Daniela L Learning to Plan via Deep Optimistic Value Exploration
title	Learning to Plan via Deep Optimistic Value Exploration
title_full	Learning to Plan via Deep Optimistic Value Exploration
title_fullStr	Learning to Plan via Deep Optimistic Value Exploration
title_full_unstemmed	Learning to Plan via Deep Optimistic Value Exploration
title_short	Learning to Plan via Deep Optimistic Value Exploration
title_sort	learning to plan via deep optimistic value exploration
url	https://hdl.handle.net/1721.1/125161
work_keys_str_mv	AT seydetim learningtoplanviadeepoptimisticvalueexploration AT schwartingwilko learningtoplanviadeepoptimisticvalueexploration AT karamansertac learningtoplanviadeepoptimisticvalueexploration AT rusdanielal learningtoplanviadeepoptimisticvalueexploration

Learning to Plan via Deep Optimistic Value Exploration

Similar Items