Computationally Efficient Reinforcement Learning under Partial Observability

A key challenge in reinforcement learning is the inability of the agent to fully observe the latent state of the system. Partially observable Markov decision processes (POMDPs) are a generalization of Markov decision processes (MDPs) that model this challenge. Unfortunately, planning and learning ne...

Full description

Bibliographic Details
Main Author:	Rohatgi, Dhruv
Other Authors:	Moitra, Ankur
Format:	Thesis
Published:	Massachusetts Institute of Technology 2023
Online Access:	https://hdl.handle.net/1721.1/150191

_version_	1826197953470005248
author	Rohatgi, Dhruv
author2	Moitra, Ankur
author_facet	Moitra, Ankur Rohatgi, Dhruv
author_sort	Rohatgi, Dhruv
collection	MIT
description	A key challenge in reinforcement learning is the inability of the agent to fully observe the latent state of the system. Partially observable Markov decision processes (POMDPs) are a generalization of Markov decision processes (MDPs) that model this challenge. Unfortunately, planning and learning near-optimal policies in POMDPs is computationally intractable. Most existing algorithms either lack provable guarantees, require exponential time, or only apply under stringent assumptions about either the dynamics of the system or the observation model. This thesis shows that the computational intractability of planning and learning in worst-case POMDPs is fundamentally due to degeneracy in the observation model, in that making an appropriate assumption about the informativeness of the partial observations (of the latent state) mitigates intractability. Specifically, we show that planning and learning are both possible in quasi-polynomial time for gamma-observable POMDPs, where gamma-observability is the assumption that c-well-separated distributions over the latent states induce (gamma*c)-well-separated distributions over observations. These are the first sub-exponential time planning and learning algorithms for POMDPs under reasonable assumptions. While falling just short of polynomial time, it turns out that quasi-polynomial time is optimal for gamma-observable POMDPs under standard complexity assumptions. The main technical innovation driving our algorithmic results is a new quantitative connection between gamma-observability and the stability of posterior distributions for the latent state in Hidden Markov Models and (more generally) POMDPs. Essentially, stability implies that old observations have limited relevance to the current state, and hence "short-memory'' policies that only depend on a short window of recent observations are nearly optimal. This connection has several applications beyond planning and learning POMDPs. Leveraging gamma-observability, we give a quasi-polynomial time algorithm for (improperly) learning overcomplete HMMs that does not require a full-rankness assumption on the transition matrices. We also give a quasi-polynomial time algorithm for planning coarse correlated equilibria in partially observable Markov games.
first_indexed	2024-09-23T10:56:39Z
format	Thesis
id	mit-1721.1/150191
institution	Massachusetts Institute of Technology
last_indexed	2024-09-23T10:56:39Z
publishDate	2023
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/1501912023-04-01T03:43:04Z Computationally Efficient Reinforcement Learning under Partial Observability Rohatgi, Dhruv Moitra, Ankur Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science A key challenge in reinforcement learning is the inability of the agent to fully observe the latent state of the system. Partially observable Markov decision processes (POMDPs) are a generalization of Markov decision processes (MDPs) that model this challenge. Unfortunately, planning and learning near-optimal policies in POMDPs is computationally intractable. Most existing algorithms either lack provable guarantees, require exponential time, or only apply under stringent assumptions about either the dynamics of the system or the observation model. This thesis shows that the computational intractability of planning and learning in worst-case POMDPs is fundamentally due to degeneracy in the observation model, in that making an appropriate assumption about the informativeness of the partial observations (of the latent state) mitigates intractability. Specifically, we show that planning and learning are both possible in quasi-polynomial time for gamma-observable POMDPs, where gamma-observability is the assumption that c-well-separated distributions over the latent states induce (gamma*c)-well-separated distributions over observations. These are the first sub-exponential time planning and learning algorithms for POMDPs under reasonable assumptions. While falling just short of polynomial time, it turns out that quasi-polynomial time is optimal for gamma-observable POMDPs under standard complexity assumptions. The main technical innovation driving our algorithmic results is a new quantitative connection between gamma-observability and the stability of posterior distributions for the latent state in Hidden Markov Models and (more generally) POMDPs. Essentially, stability implies that old observations have limited relevance to the current state, and hence "short-memory'' policies that only depend on a short window of recent observations are nearly optimal. This connection has several applications beyond planning and learning POMDPs. Leveraging gamma-observability, we give a quasi-polynomial time algorithm for (improperly) learning overcomplete HMMs that does not require a full-rankness assumption on the transition matrices. We also give a quasi-polynomial time algorithm for planning coarse correlated equilibria in partially observable Markov games. S.M. 2023-03-31T14:38:39Z 2023-03-31T14:38:39Z 2023-02 2023-02-28T14:36:05.904Z Thesis https://hdl.handle.net/1721.1/150191 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle	Rohatgi, Dhruv Computationally Efficient Reinforcement Learning under Partial Observability
title	Computationally Efficient Reinforcement Learning under Partial Observability
title_full	Computationally Efficient Reinforcement Learning under Partial Observability
title_fullStr	Computationally Efficient Reinforcement Learning under Partial Observability
title_full_unstemmed	Computationally Efficient Reinforcement Learning under Partial Observability
title_short	Computationally Efficient Reinforcement Learning under Partial Observability
title_sort	computationally efficient reinforcement learning under partial observability
url	https://hdl.handle.net/1721.1/150191
work_keys_str_mv	AT rohatgidhruv computationallyefficientreinforcementlearningunderpartialobservability

Computationally Efficient Reinforcement Learning under Partial Observability

Similar Items