Offline Reward Learning from Human Demonstrations and Feedback: A Linear Programming Approach

In many complex sequential decision-making tasks, there is often no known explicit reward function, and the only information available is human demonstrations and feedback data. To infer and shape the underlying reward function from this data, two key methodologies have emerged: inverse reinforcemen...

Full description

Bibliographic Details
Main Author:	Kim, Kihyun
Other Authors:	Ozdaglar, Asuman
Format:	Thesis
Published:	Massachusetts Institute of Technology 2024
Online Access:	https://hdl.handle.net/1721.1/156337

_version_	1826205988091330560
author	Kim, Kihyun
author2	Ozdaglar, Asuman
author_facet	Ozdaglar, Asuman Kim, Kihyun
author_sort	Kim, Kihyun
collection	MIT
description	In many complex sequential decision-making tasks, there is often no known explicit reward function, and the only information available is human demonstrations and feedback data. To infer and shape the underlying reward function from this data, two key methodologies have emerged: inverse reinforcement learning (IRL) and reinforcement learning from human feedback (RLHF). Despite the successful application of these reward learning techniques across a wide range of tasks, a significant gap between theory and practice persists. This work aims to bridge this gap by introducing a novel linear programming (LP) framework tailored for offline IRL and RLHF. Most previous work in reward learning has employed the maximum likelihood estimation (MLE) approach, relying on prior knowledge or assumptions about decision or preference models. However, such dependencies can lead to robustness issues, particularly when there is a mismatch between the presupposed models and actual human behavior. In response to these challenges, recent research has shifted toward recovering a feasible reward set, a general set of rewards where the expert policy is optimal. In line with this evolving perspective, we focus on estimating the feasible reward set in an offline context. Utilizing pre-collected trajectories without online exploration, our framework estimates a feasible reward set from the primal-dual optimality conditions of a suitably designed LP, and offers an optimality guarantee with provable sample efficiency. One notable feature of our LP framework is the convexity of the resulting solution set, which facilitates the alignment of reward functions with human feedback, such as pairwise trajectory comparison data, while maintaining computational tractability and sample efficiency. Through analytical examples and numerical experiments, we demonstrate that our framework has the potential to outperform the conventional MLE approach.
first_indexed	2024-09-23T13:22:12Z
format	Thesis
id	mit-1721.1/156337
institution	Massachusetts Institute of Technology
last_indexed	2024-09-23T13:22:12Z
publishDate	2024
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/1563372024-08-22T03:30:16Z Offline Reward Learning from Human Demonstrations and Feedback: A Linear Programming Approach Kim, Kihyun Ozdaglar, Asuman Parrilo, Pablo A. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science In many complex sequential decision-making tasks, there is often no known explicit reward function, and the only information available is human demonstrations and feedback data. To infer and shape the underlying reward function from this data, two key methodologies have emerged: inverse reinforcement learning (IRL) and reinforcement learning from human feedback (RLHF). Despite the successful application of these reward learning techniques across a wide range of tasks, a significant gap between theory and practice persists. This work aims to bridge this gap by introducing a novel linear programming (LP) framework tailored for offline IRL and RLHF. Most previous work in reward learning has employed the maximum likelihood estimation (MLE) approach, relying on prior knowledge or assumptions about decision or preference models. However, such dependencies can lead to robustness issues, particularly when there is a mismatch between the presupposed models and actual human behavior. In response to these challenges, recent research has shifted toward recovering a feasible reward set, a general set of rewards where the expert policy is optimal. In line with this evolving perspective, we focus on estimating the feasible reward set in an offline context. Utilizing pre-collected trajectories without online exploration, our framework estimates a feasible reward set from the primal-dual optimality conditions of a suitably designed LP, and offers an optimality guarantee with provable sample efficiency. One notable feature of our LP framework is the convexity of the resulting solution set, which facilitates the alignment of reward functions with human feedback, such as pairwise trajectory comparison data, while maintaining computational tractability and sample efficiency. Through analytical examples and numerical experiments, we demonstrate that our framework has the potential to outperform the conventional MLE approach. S.M. 2024-08-21T18:57:49Z 2024-08-21T18:57:49Z 2024-05 2024-07-10T12:59:41.574Z Thesis https://hdl.handle.net/1721.1/156337 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle	Kim, Kihyun Offline Reward Learning from Human Demonstrations and Feedback: A Linear Programming Approach
title	Offline Reward Learning from Human Demonstrations and Feedback: A Linear Programming Approach
title_full	Offline Reward Learning from Human Demonstrations and Feedback: A Linear Programming Approach
title_fullStr	Offline Reward Learning from Human Demonstrations and Feedback: A Linear Programming Approach
title_full_unstemmed	Offline Reward Learning from Human Demonstrations and Feedback: A Linear Programming Approach
title_short	Offline Reward Learning from Human Demonstrations and Feedback: A Linear Programming Approach
title_sort	offline reward learning from human demonstrations and feedback a linear programming approach
url	https://hdl.handle.net/1721.1/156337
work_keys_str_mv	AT kimkihyun offlinerewardlearningfromhumandemonstrationsandfeedbackalinearprogrammingapproach

Offline Reward Learning from Human Demonstrations and Feedback: A Linear Programming Approach

Similar Items