Computationally Efficient Reinforcement Learning under Partial Observability

A key challenge in reinforcement learning is the inability of the agent to fully observe the latent state of the system. Partially observable Markov decision processes (POMDPs) are a generalization of Markov decision processes (MDPs) that model this challenge. Unfortunately, planning and learning ne...

Full description

Bibliographic Details
Main Author: Rohatgi, Dhruv
Other Authors: Moitra, Ankur
Format: Thesis
Published: Massachusetts Institute of Technology 2023
Online Access:https://hdl.handle.net/1721.1/150191
Description
Summary:A key challenge in reinforcement learning is the inability of the agent to fully observe the latent state of the system. Partially observable Markov decision processes (POMDPs) are a generalization of Markov decision processes (MDPs) that model this challenge. Unfortunately, planning and learning near-optimal policies in POMDPs is computationally intractable. Most existing algorithms either lack provable guarantees, require exponential time, or only apply under stringent assumptions about either the dynamics of the system or the observation model. This thesis shows that the computational intractability of planning and learning in worst-case POMDPs is fundamentally due to degeneracy in the observation model, in that making an appropriate assumption about the informativeness of the partial observations (of the latent state) mitigates intractability. Specifically, we show that planning and learning are both possible in quasi-polynomial time for gamma-observable POMDPs, where gamma-observability is the assumption that c-well-separated distributions over the latent states induce (gamma*c)-well-separated distributions over observations. These are the first sub-exponential time planning and learning algorithms for POMDPs under reasonable assumptions. While falling just short of polynomial time, it turns out that quasi-polynomial time is optimal for gamma-observable POMDPs under standard complexity assumptions. The main technical innovation driving our algorithmic results is a new quantitative connection between gamma-observability and the stability of posterior distributions for the latent state in Hidden Markov Models and (more generally) POMDPs. Essentially, stability implies that old observations have limited relevance to the current state, and hence "short-memory'' policies that only depend on a short window of recent observations are nearly optimal. This connection has several applications beyond planning and learning POMDPs. Leveraging gamma-observability, we give a quasi-polynomial time algorithm for (improperly) learning overcomplete HMMs that does not require a full-rankness assumption on the transition matrices. We also give a quasi-polynomial time algorithm for planning coarse correlated equilibria in partially observable Markov games.