Optimal Reinforcement Learning with Black Holes

We introduce the Black Hole Reinforcement Learning problem, a previously unexplored variant of reinforcement learning in which we lose all turn information and all reward from trajectories that visit a particular subset of states. We assume awareness of the trajectory loss events, making this a cens...

Full description

Bibliographic Details
Main Author: Micali, Enrico
Other Authors: Daskalakis, Constantinos
Format: Thesis
Published: Massachusetts Institute of Technology 2022
Online Access:https://hdl.handle.net/1721.1/143151
Description
Summary:We introduce the Black Hole Reinforcement Learning problem, a previously unexplored variant of reinforcement learning in which we lose all turn information and all reward from trajectories that visit a particular subset of states. We assume awareness of the trajectory loss events, making this a censored data problem but not a truncated data problem. We have elected to work in a fixed-horizon setting for easier readability of our arguments in what we hope will be a seminal paper in a new branch of “biased” data science. We begin by proposing a series of three MDP-inspired models for the Black Hole RL setting, each striking a unique compromise between the compactness of the state space and the optimality of a memory-less policy over the states. For the two models that are memory-less in both the state transition and turn reward functions, we describe a stochastic policy gradient algorithm that guarantees non-asymptotic convergence to a policy with finitely suboptimal expected trajectory reward, focusing on the more compact model which benefits from better convergence rates. This algorithm’s guarantee hinges on the assumption that any location in the problem environment has a non-zero chance of being an agent’s start state. We dedicate the remainder of the paper to the feasibility of finding optimal policies for Black Hole RL problems in which we don’t have this guarantee. We give proof that converging to an optimal policy is feasible under the new assumption that the distribution determining agents’ starting state is known. Problems in the Black Hole RL setting where the starting state distribution is “sparse” and unknown remain open problems. We provide a potential first avenue of research into their feasibility, which we believe to be the next natural frontier in Black Hole Reinforcement Learning Theory.