Building Blocks for Human-AI Alignment: Specify, Inspect, Model, and Revise

The learned behaviors of AI systems and robots should align with the intentions of their human designers. In service of this goal, people—especially experts—must be able to easily specify, inspect, model, and revise AI system and robot behaviors. These four interactions are critical building blocks...

Full description

Bibliographic Details
Main Author: Booth, Serena Lynn
Other Authors: Shah, Julie A.
Format: Thesis
Published: Massachusetts Institute of Technology 2024
Online Access:https://hdl.handle.net/1721.1/153862
Description
Summary:The learned behaviors of AI systems and robots should align with the intentions of their human designers. In service of this goal, people—especially experts—must be able to easily specify, inspect, model, and revise AI system and robot behaviors. These four interactions are critical building blocks for human-AI alignment. In this thesis, I study each of these problems. First, I study how experts write reward function specifications for reinforcement learning (RL). I find that these specifications are written with respect to the RL algorithm, not independently, and I find that experts often write erroneous specifications that fail to encode their true intent, even in a trivial setting [22]. Second, I study how to support people in inspecting the agent’s learned behaviors. To do so, I introduce two related Bayesian inference methods to find examples or environments which invoke particular system behaviors; viewing these examples and environments is helpful for conceptual model formation and for system debugging [25, 213]. Third, I study cognitive science theories that govern how people build conceptual models to explain these observed examples of agent behaviors. While I find that some foundations of these theories are employed in typical interventions to support humans in learning about agent behaviors, I also find there is significant room to build better curricula for interaction—for example, by showing counterexamples of alternative behaviors [24]. I conclude by speculating about how these building blocks of human-AI interaction can be combined to enable people to revise their specifications, and, in doing so, create better aligned agents.