Achieving Robustness and Generalization in MARL for Sequential Social Dilemmas through Bilinear Value Networks

This thesis presents a novel approach for training multi-agent reinforcement learning (MARL) agents that are robust to different unforeseen gameplay strategies in sequential social dilemma (SSD) games. Recent literature has demonstrated that reward shaping can not only be used to enable MARL agents...

Full description

Bibliographic Details
Main Author: Ma, Jeremy
Other Authors: How, Jonathan P.
Format: Thesis
Published: Massachusetts Institute of Technology 2023
Online Access:https://hdl.handle.net/1721.1/152745
_version_ 1826194030481899520
author Ma, Jeremy
author2 How, Jonathan P.
author_facet How, Jonathan P.
Ma, Jeremy
author_sort Ma, Jeremy
collection MIT
description This thesis presents a novel approach for training multi-agent reinforcement learning (MARL) agents that are robust to different unforeseen gameplay strategies in sequential social dilemma (SSD) games. Recent literature has demonstrated that reward shaping can not only be used to enable MARL agents to discover diverse, human-interpretable strategies with emergent qualities, but also help alleviate the issue in conventional actor-critic methods that tend to converge to suboptimal Nash equilibria in SSD games. However, agents trained through self-play typically converge and overfit to a singular Nash equilibrium. Consequently, these agents are limited to executing the specific strategy they have converged to during training, which renders them ineffective when faced with opponents employing commonly-used strategies such as tit-for-tat. This thesis proposes a method that employs a bilinear value critic that can learn an adaptive and robust strategy in SSD games through self-play with randomized reward sharing. We evaluate the efficacy of this approach on “prisoner’s buddy,” an iterated three-player variant of the prisoner’s dilemma game. Our results show that the bilinear value structure helps the critic generalize over the reward sharing manifold and leads to an adaptive agent with emergent qualities such as reputation. The results of this research highlight the ability of MARL agents to learn a general high-level policy that can effectively socialize with agents with different strategies in SSD games, despite being trained through self-play. The proposed method is scalable and has the potential to be applied to a wide range of multi-agent competitive-cooperative environments, providing insights into the design of MARL algorithms for solving social dilemmas.
first_indexed 2024-09-23T09:49:26Z
format Thesis
id mit-1721.1/152745
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T09:49:26Z
publishDate 2023
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/1527452023-11-03T03:54:33Z Achieving Robustness and Generalization in MARL for Sequential Social Dilemmas through Bilinear Value Networks Ma, Jeremy How, Jonathan P. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science This thesis presents a novel approach for training multi-agent reinforcement learning (MARL) agents that are robust to different unforeseen gameplay strategies in sequential social dilemma (SSD) games. Recent literature has demonstrated that reward shaping can not only be used to enable MARL agents to discover diverse, human-interpretable strategies with emergent qualities, but also help alleviate the issue in conventional actor-critic methods that tend to converge to suboptimal Nash equilibria in SSD games. However, agents trained through self-play typically converge and overfit to a singular Nash equilibrium. Consequently, these agents are limited to executing the specific strategy they have converged to during training, which renders them ineffective when faced with opponents employing commonly-used strategies such as tit-for-tat. This thesis proposes a method that employs a bilinear value critic that can learn an adaptive and robust strategy in SSD games through self-play with randomized reward sharing. We evaluate the efficacy of this approach on “prisoner’s buddy,” an iterated three-player variant of the prisoner’s dilemma game. Our results show that the bilinear value structure helps the critic generalize over the reward sharing manifold and leads to an adaptive agent with emergent qualities such as reputation. The results of this research highlight the ability of MARL agents to learn a general high-level policy that can effectively socialize with agents with different strategies in SSD games, despite being trained through self-play. The proposed method is scalable and has the potential to be applied to a wide range of multi-agent competitive-cooperative environments, providing insights into the design of MARL algorithms for solving social dilemmas. M.Eng. 2023-11-02T20:12:45Z 2023-11-02T20:12:45Z 2023-09 2023-10-03T18:21:28.799Z Thesis https://hdl.handle.net/1721.1/152745 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle Ma, Jeremy
Achieving Robustness and Generalization in MARL for Sequential Social Dilemmas through Bilinear Value Networks
title Achieving Robustness and Generalization in MARL for Sequential Social Dilemmas through Bilinear Value Networks
title_full Achieving Robustness and Generalization in MARL for Sequential Social Dilemmas through Bilinear Value Networks
title_fullStr Achieving Robustness and Generalization in MARL for Sequential Social Dilemmas through Bilinear Value Networks
title_full_unstemmed Achieving Robustness and Generalization in MARL for Sequential Social Dilemmas through Bilinear Value Networks
title_short Achieving Robustness and Generalization in MARL for Sequential Social Dilemmas through Bilinear Value Networks
title_sort achieving robustness and generalization in marl for sequential social dilemmas through bilinear value networks
url https://hdl.handle.net/1721.1/152745
work_keys_str_mv AT majeremy achievingrobustnessandgeneralizationinmarlforsequentialsocialdilemmasthroughbilinearvaluenetworks