Robustness of Reinforcement Learning Systems in Real-World Environments

Reinforcement Learning (RL) is recognized as a promising paradigm to improve numerous decision-making processes in the real world, potentially constituting the core of many future autonomous systems. However, despite its popularity across multiple fields, the number of proofs of concept in the liter...

Full description

Bibliographic Details
Main Author: Garau Luis, Juan José
Other Authors: Crawley, Edward F.
Format: Thesis
Published: Massachusetts Institute of Technology 2023
Online Access:https://hdl.handle.net/1721.1/153087
Description
Summary:Reinforcement Learning (RL) is recognized as a promising paradigm to improve numerous decision-making processes in the real world, potentially constituting the core of many future autonomous systems. However, despite its popularity across multiple fields, the number of proofs of concept in the literature is substantially larger than the number of reported deployments. This can be primarily attributed to differences between real-world environments and experimental RL setups. On one hand, from a domain-specific perspective, it is challenging to fully characterize concrete tasks and environments in the real world, and training in physical environments may not always be possible. On the other hand, the real world presents several domain-agnostic challenges that make learning more difficult, such as high-dimensionality, non-stationarity, or generalizability. Although RL agents have demonstrated effective performance in practical applications, their robustness to these real-world phenomena is still challenging. To move a step forward towards better RL deployability, this thesis investigates different aspects of RL system design, focusing on enhancing robustness in real-world environments. It is composed of three main areas of research: Firstly, to comprehensively characterize the problem of real-world robustness, I propose an RL roadmap. This identifies key factors that influence the interaction between an RL system and a real-world environment, and it offers a structured approach to addressing the overall problem. I further delve into one specific element of this roadmap, the state space, and present a set of mathematical bounds for the change in mutual information (MI) between state features and rewards during policy learning. By observing how MI evolves during learning, I demonstrate how to identify more effective feature sets, as shown through the study of a practical use case, the Traffic Signal Control problem. Secondly, I introduce MetaPG, a novel domain-agnostic RL design method that prioritizes robustness in addition to performance. MetaPG is an AutoRL method that automates the design of new actor-critic loss functions, represented as computational graphs, for optimizing multiple independent objectives. Through evolutionary search, MetaPG generates Pareto Fronts of new algorithms that maximize and trade all objectives considered. When applied to a use case aimed at optimizing single-task performance, zero-shot generalizability, and stability on five different environments, evolved algorithms show, on average, a 4.2%, a 13.4%, and a 67% increase in each of these metrics, respectively, compared to the SAC algorithm used as warm-start. Furthermore, MetaPG offers insights into the structure of the evolved algorithms, allowing for a better understanding of their functionality. Lastly, this thesis focuses on the application of conceptual frameworks and design principles to specific real-world problems in which robustness has been systematically overlooked. I introduce a novel RL system for solving the frequency assignment problem for multibeam satellite constellations. By conducting a comprehensive search over six major design decisions, I identify a design variation that achieves a 99.8% success rate in 100-beam scenarios. However, this variation falls short in handling high-dimensionality and non-stationarity. This thesis demonstrates that robustness against these challenges can be obtained through different design variations, which attain an 87.3% success rate in 2,000-beam cases. Additionally, I also investigate design trade-offs in another real-world application, molecular optimization, and show that current methods are not well aligned with robustness.