Achieving Robustness and Generalization in MARL for Sequential Social Dilemmas through Bilinear Value Networks
This thesis presents a novel approach for training multi-agent reinforcement learning (MARL) agents that are robust to different unforeseen gameplay strategies in sequential social dilemma (SSD) games. Recent literature has demonstrated that reward shaping can not only be used to enable MARL agents...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Published: |
Massachusetts Institute of Technology
2023
|
Online Access: | https://hdl.handle.net/1721.1/152745 |
_version_ | 1826194030481899520 |
---|---|
author | Ma, Jeremy |
author2 | How, Jonathan P. |
author_facet | How, Jonathan P. Ma, Jeremy |
author_sort | Ma, Jeremy |
collection | MIT |
description | This thesis presents a novel approach for training multi-agent reinforcement learning (MARL) agents that are robust to different unforeseen gameplay strategies in sequential social dilemma (SSD) games. Recent literature has demonstrated that reward shaping can not only be used to enable MARL agents to discover diverse, human-interpretable strategies with emergent qualities, but also help alleviate the issue in conventional actor-critic methods that tend to converge to suboptimal Nash equilibria in SSD games. However, agents trained through self-play typically converge and overfit to a singular Nash equilibrium. Consequently, these agents are limited to executing the specific strategy they have converged to during training, which renders them ineffective when faced with opponents employing commonly-used strategies such as tit-for-tat. This thesis proposes a method that employs a bilinear value critic that can learn an adaptive and robust strategy in SSD games through self-play with randomized reward sharing. We evaluate the efficacy of this approach on “prisoner’s buddy,” an iterated three-player variant of the prisoner’s dilemma game. Our results show that the bilinear value structure helps the critic generalize over the reward sharing manifold and leads to an adaptive agent with emergent qualities such as reputation. The results of this research highlight the ability of MARL agents to learn a general high-level policy that can effectively socialize with agents with different strategies in SSD games, despite being trained through self-play. The proposed method is scalable and has the potential to be applied to a wide range of multi-agent competitive-cooperative environments, providing insights into the design of MARL algorithms for solving social dilemmas. |
first_indexed | 2024-09-23T09:49:26Z |
format | Thesis |
id | mit-1721.1/152745 |
institution | Massachusetts Institute of Technology |
last_indexed | 2024-09-23T09:49:26Z |
publishDate | 2023 |
publisher | Massachusetts Institute of Technology |
record_format | dspace |
spelling | mit-1721.1/1527452023-11-03T03:54:33Z Achieving Robustness and Generalization in MARL for Sequential Social Dilemmas through Bilinear Value Networks Ma, Jeremy How, Jonathan P. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science This thesis presents a novel approach for training multi-agent reinforcement learning (MARL) agents that are robust to different unforeseen gameplay strategies in sequential social dilemma (SSD) games. Recent literature has demonstrated that reward shaping can not only be used to enable MARL agents to discover diverse, human-interpretable strategies with emergent qualities, but also help alleviate the issue in conventional actor-critic methods that tend to converge to suboptimal Nash equilibria in SSD games. However, agents trained through self-play typically converge and overfit to a singular Nash equilibrium. Consequently, these agents are limited to executing the specific strategy they have converged to during training, which renders them ineffective when faced with opponents employing commonly-used strategies such as tit-for-tat. This thesis proposes a method that employs a bilinear value critic that can learn an adaptive and robust strategy in SSD games through self-play with randomized reward sharing. We evaluate the efficacy of this approach on “prisoner’s buddy,” an iterated three-player variant of the prisoner’s dilemma game. Our results show that the bilinear value structure helps the critic generalize over the reward sharing manifold and leads to an adaptive agent with emergent qualities such as reputation. The results of this research highlight the ability of MARL agents to learn a general high-level policy that can effectively socialize with agents with different strategies in SSD games, despite being trained through self-play. The proposed method is scalable and has the potential to be applied to a wide range of multi-agent competitive-cooperative environments, providing insights into the design of MARL algorithms for solving social dilemmas. M.Eng. 2023-11-02T20:12:45Z 2023-11-02T20:12:45Z 2023-09 2023-10-03T18:21:28.799Z Thesis https://hdl.handle.net/1721.1/152745 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology |
spellingShingle | Ma, Jeremy Achieving Robustness and Generalization in MARL for Sequential Social Dilemmas through Bilinear Value Networks |
title | Achieving Robustness and Generalization in MARL
for Sequential Social Dilemmas through Bilinear
Value Networks |
title_full | Achieving Robustness and Generalization in MARL
for Sequential Social Dilemmas through Bilinear
Value Networks |
title_fullStr | Achieving Robustness and Generalization in MARL
for Sequential Social Dilemmas through Bilinear
Value Networks |
title_full_unstemmed | Achieving Robustness and Generalization in MARL
for Sequential Social Dilemmas through Bilinear
Value Networks |
title_short | Achieving Robustness and Generalization in MARL
for Sequential Social Dilemmas through Bilinear
Value Networks |
title_sort | achieving robustness and generalization in marl for sequential social dilemmas through bilinear value networks |
url | https://hdl.handle.net/1721.1/152745 |
work_keys_str_mv | AT majeremy achievingrobustnessandgeneralizationinmarlforsequentialsocialdilemmasthroughbilinearvaluenetworks |