Counterfactual off-policy evaluation with gumbel-max structural causal models

We introduce an off-policy evaluation procedure for highlighting episodes where applying a reinforcement learned (RL) policy is likely to have produced a substantially different outcome than the observed policy. In particular, we introduce a class of structural causal models (SCMs) for generating co...

Full description

Bibliographic Details
Main Authors: Oberst, Michael, Sontag, David Alexander
Other Authors: Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Format: Article
Language:English
Published: MLResearch Press 2021
Online Access:https://hdl.handle.net/1721.1/130437
_version_ 1826206361015287808
author Oberst, Michael
Sontag, David Alexander
author2 Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
author_facet Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Oberst, Michael
Sontag, David Alexander
author_sort Oberst, Michael
collection MIT
description We introduce an off-policy evaluation procedure for highlighting episodes where applying a reinforcement learned (RL) policy is likely to have produced a substantially different outcome than the observed policy. In particular, we introduce a class of structural causal models (SCMs) for generating counterfactual trajectories in finite partially observable Markov Decision Processes (POMDPs). We see this as a useful procedure for off-policy "debugging" in high-risk settings (e.g., healthcare); by decomposing the expected difference in reward between the RL and observed policy into specific episodes, we can identify episodes where the counterfactual difference in reward is most dramatic. This in turn can be used to facilitate review of specific episodes by domain experts. We demonstrate the utility of this procedure with a synthetic environment of sepsis management.
first_indexed 2024-09-23T13:28:16Z
format Article
id mit-1721.1/130437
institution Massachusetts Institute of Technology
language English
last_indexed 2024-09-23T13:28:16Z
publishDate 2021
publisher MLResearch Press
record_format dspace
spelling mit-1721.1/1304372022-09-28T14:30:47Z Counterfactual off-policy evaluation with gumbel-max structural causal models Oberst, Michael Sontag, David Alexander Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory We introduce an off-policy evaluation procedure for highlighting episodes where applying a reinforcement learned (RL) policy is likely to have produced a substantially different outcome than the observed policy. In particular, we introduce a class of structural causal models (SCMs) for generating counterfactual trajectories in finite partially observable Markov Decision Processes (POMDPs). We see this as a useful procedure for off-policy "debugging" in high-risk settings (e.g., healthcare); by decomposing the expected difference in reward between the RL and observed policy into specific episodes, we can identify episodes where the counterfactual difference in reward is most dramatic. This in turn can be used to facilitate review of specific episodes by domain experts. We demonstrate the utility of this procedure with a synthetic environment of sepsis management. 2021-04-09T20:45:12Z 2021-04-09T20:45:12Z 2019-06 2021-04-06T18:37:22Z Article http://purl.org/eprint/type/ConferencePaper https://hdl.handle.net/1721.1/130437 Oberst, Michael and David Sontag. "Counterfactual off-policy evaluation with gumbel-max structural causal models." Proceedings of the 36th International Conference on Machine Learning, June 2019, Long Beach, California, MLResearch Press, 2019. en http://proceedings.mlr.press/v97/ Proceedings of the 36th International Conference on Machine Learning Article is made available in accordance with the publisher's policy and may be subject to US copyright law. Please refer to the publisher's site for terms of use. application/pdf MLResearch Press Proceedings of Machine Learning Research
spellingShingle Oberst, Michael
Sontag, David Alexander
Counterfactual off-policy evaluation with gumbel-max structural causal models
title Counterfactual off-policy evaluation with gumbel-max structural causal models
title_full Counterfactual off-policy evaluation with gumbel-max structural causal models
title_fullStr Counterfactual off-policy evaluation with gumbel-max structural causal models
title_full_unstemmed Counterfactual off-policy evaluation with gumbel-max structural causal models
title_short Counterfactual off-policy evaluation with gumbel-max structural causal models
title_sort counterfactual off policy evaluation with gumbel max structural causal models
url https://hdl.handle.net/1721.1/130437
work_keys_str_mv AT oberstmichael counterfactualoffpolicyevaluationwithgumbelmaxstructuralcausalmodels
AT sontagdavidalexander counterfactualoffpolicyevaluationwithgumbelmaxstructuralcausalmodels