Computationally Efficient Reinforcement Learning under Partial Observability

A key challenge in reinforcement learning is the inability of the agent to fully observe the latent state of the system. Partially observable Markov decision processes (POMDPs) are a generalization of Markov decision processes (MDPs) that model this challenge. Unfortunately, planning and learning ne...

Full description

Bibliographic Details
Main Author: Rohatgi, Dhruv
Other Authors: Moitra, Ankur
Format: Thesis
Published: Massachusetts Institute of Technology 2023
Online Access:https://hdl.handle.net/1721.1/150191
_version_ 1826197953470005248
author Rohatgi, Dhruv
author2 Moitra, Ankur
author_facet Moitra, Ankur
Rohatgi, Dhruv
author_sort Rohatgi, Dhruv
collection MIT
description A key challenge in reinforcement learning is the inability of the agent to fully observe the latent state of the system. Partially observable Markov decision processes (POMDPs) are a generalization of Markov decision processes (MDPs) that model this challenge. Unfortunately, planning and learning near-optimal policies in POMDPs is computationally intractable. Most existing algorithms either lack provable guarantees, require exponential time, or only apply under stringent assumptions about either the dynamics of the system or the observation model. This thesis shows that the computational intractability of planning and learning in worst-case POMDPs is fundamentally due to degeneracy in the observation model, in that making an appropriate assumption about the informativeness of the partial observations (of the latent state) mitigates intractability. Specifically, we show that planning and learning are both possible in quasi-polynomial time for gamma-observable POMDPs, where gamma-observability is the assumption that c-well-separated distributions over the latent states induce (gamma*c)-well-separated distributions over observations. These are the first sub-exponential time planning and learning algorithms for POMDPs under reasonable assumptions. While falling just short of polynomial time, it turns out that quasi-polynomial time is optimal for gamma-observable POMDPs under standard complexity assumptions. The main technical innovation driving our algorithmic results is a new quantitative connection between gamma-observability and the stability of posterior distributions for the latent state in Hidden Markov Models and (more generally) POMDPs. Essentially, stability implies that old observations have limited relevance to the current state, and hence "short-memory'' policies that only depend on a short window of recent observations are nearly optimal. This connection has several applications beyond planning and learning POMDPs. Leveraging gamma-observability, we give a quasi-polynomial time algorithm for (improperly) learning overcomplete HMMs that does not require a full-rankness assumption on the transition matrices. We also give a quasi-polynomial time algorithm for planning coarse correlated equilibria in partially observable Markov games.
first_indexed 2024-09-23T10:56:39Z
format Thesis
id mit-1721.1/150191
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T10:56:39Z
publishDate 2023
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/1501912023-04-01T03:43:04Z Computationally Efficient Reinforcement Learning under Partial Observability Rohatgi, Dhruv Moitra, Ankur Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science A key challenge in reinforcement learning is the inability of the agent to fully observe the latent state of the system. Partially observable Markov decision processes (POMDPs) are a generalization of Markov decision processes (MDPs) that model this challenge. Unfortunately, planning and learning near-optimal policies in POMDPs is computationally intractable. Most existing algorithms either lack provable guarantees, require exponential time, or only apply under stringent assumptions about either the dynamics of the system or the observation model. This thesis shows that the computational intractability of planning and learning in worst-case POMDPs is fundamentally due to degeneracy in the observation model, in that making an appropriate assumption about the informativeness of the partial observations (of the latent state) mitigates intractability. Specifically, we show that planning and learning are both possible in quasi-polynomial time for gamma-observable POMDPs, where gamma-observability is the assumption that c-well-separated distributions over the latent states induce (gamma*c)-well-separated distributions over observations. These are the first sub-exponential time planning and learning algorithms for POMDPs under reasonable assumptions. While falling just short of polynomial time, it turns out that quasi-polynomial time is optimal for gamma-observable POMDPs under standard complexity assumptions. The main technical innovation driving our algorithmic results is a new quantitative connection between gamma-observability and the stability of posterior distributions for the latent state in Hidden Markov Models and (more generally) POMDPs. Essentially, stability implies that old observations have limited relevance to the current state, and hence "short-memory'' policies that only depend on a short window of recent observations are nearly optimal. This connection has several applications beyond planning and learning POMDPs. Leveraging gamma-observability, we give a quasi-polynomial time algorithm for (improperly) learning overcomplete HMMs that does not require a full-rankness assumption on the transition matrices. We also give a quasi-polynomial time algorithm for planning coarse correlated equilibria in partially observable Markov games. S.M. 2023-03-31T14:38:39Z 2023-03-31T14:38:39Z 2023-02 2023-02-28T14:36:05.904Z Thesis https://hdl.handle.net/1721.1/150191 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle Rohatgi, Dhruv
Computationally Efficient Reinforcement Learning under Partial Observability
title Computationally Efficient Reinforcement Learning under Partial Observability
title_full Computationally Efficient Reinforcement Learning under Partial Observability
title_fullStr Computationally Efficient Reinforcement Learning under Partial Observability
title_full_unstemmed Computationally Efficient Reinforcement Learning under Partial Observability
title_short Computationally Efficient Reinforcement Learning under Partial Observability
title_sort computationally efficient reinforcement learning under partial observability
url https://hdl.handle.net/1721.1/150191
work_keys_str_mv AT rohatgidhruv computationallyefficientreinforcementlearningunderpartialobservability