Learning with opponent-learning awareness

Multi-agent settings are quickly gathering importance in machine learning. This includes a plethora of recent work on deep multi-agent reinforcement learning, but also can be extended to hierarchical reinforcement learning, generative adversarial networks and decentralised optimization. In all these...

وصف كامل

التفاصيل البيبلوغرافية
المؤلفون الرئيسيون: Foerster, J, Chen, R, Al-Shedivat, M, Whiteson, S, Abbeel, P, Mordatch, I
التنسيق: Conference item
منشور في: International Foundation for Autonomous Agents and Multiagent Systems 2018
_version_ 1826279877405310976
author Foerster, J
Chen, R
Al-Shedivat, M
Whiteson, S
Abbeel, P
Mordatch, I
author_facet Foerster, J
Chen, R
Al-Shedivat, M
Whiteson, S
Abbeel, P
Mordatch, I
author_sort Foerster, J
collection OXFORD
description Multi-agent settings are quickly gathering importance in machine learning. This includes a plethora of recent work on deep multi-agent reinforcement learning, but also can be extended to hierarchical reinforcement learning, generative adversarial networks and decentralised optimization. In all these settings the presence of multiple learning agents renders the training problem non-stationary and often leads to unstable training or undesired final results. We present Learning with Opponent-Learning Awareness (LOLA), a method in which each agent shapes the anticipated learning of the other agents in the environment. The LOLA learning rule includes an additional term that accounts for the impact of one agent’s policy on the anticipated parameter update of the other agents. Preliminary results show that the encounter of two LOLA agents leads to the emergence of titfor-tat and therefore cooperation in the iterated prisoners’ dilemma (IPD), while independent learning does not. In this domain, LOLA also receives higher payouts compared to a naive learner, and is robust against exploitation by higher order gradient-based methods. Applied to infinitely repeated matching pennies, LOLA agents converge to the Nash equilibrium. In a round robin tournament we show that LOLA agents can successfully shape the learning of a range of multi-agent learning algorithms from literature, resulting in the highest average returns on the IPD. We also show that the LOLA update rule can be efficiently calculated using an extension of the likelihood ratio policy gradient estimator, making the method suitable for model-free reinforcement learning. This method thus scales to large parameter and input spaces and nonlinear function approximators. We also apply LOLA to a grid world task with an embedded social dilemma using deep recurrent policies and opponent modelling. Again, by explicitly considering the learning of the other agent, LOLA agents learn to cooperate out of self-interest.
first_indexed 2024-03-07T00:05:21Z
format Conference item
id oxford-uuid:775b8bcc-cf2c-488f-9db5-eeafb2aad0c8
institution University of Oxford
last_indexed 2024-03-07T00:05:21Z
publishDate 2018
publisher International Foundation for Autonomous Agents and Multiagent Systems
record_format dspace
spelling oxford-uuid:775b8bcc-cf2c-488f-9db5-eeafb2aad0c82022-03-26T20:23:28ZLearning with opponent-learning awarenessConference itemhttp://purl.org/coar/resource_type/c_5794uuid:775b8bcc-cf2c-488f-9db5-eeafb2aad0c8Symplectic Elements at OxfordInternational Foundation for Autonomous Agents and Multiagent Systems2018Foerster, JChen, RAl-Shedivat, MWhiteson, SAbbeel, PMordatch, IMulti-agent settings are quickly gathering importance in machine learning. This includes a plethora of recent work on deep multi-agent reinforcement learning, but also can be extended to hierarchical reinforcement learning, generative adversarial networks and decentralised optimization. In all these settings the presence of multiple learning agents renders the training problem non-stationary and often leads to unstable training or undesired final results. We present Learning with Opponent-Learning Awareness (LOLA), a method in which each agent shapes the anticipated learning of the other agents in the environment. The LOLA learning rule includes an additional term that accounts for the impact of one agent’s policy on the anticipated parameter update of the other agents. Preliminary results show that the encounter of two LOLA agents leads to the emergence of titfor-tat and therefore cooperation in the iterated prisoners’ dilemma (IPD), while independent learning does not. In this domain, LOLA also receives higher payouts compared to a naive learner, and is robust against exploitation by higher order gradient-based methods. Applied to infinitely repeated matching pennies, LOLA agents converge to the Nash equilibrium. In a round robin tournament we show that LOLA agents can successfully shape the learning of a range of multi-agent learning algorithms from literature, resulting in the highest average returns on the IPD. We also show that the LOLA update rule can be efficiently calculated using an extension of the likelihood ratio policy gradient estimator, making the method suitable for model-free reinforcement learning. This method thus scales to large parameter and input spaces and nonlinear function approximators. We also apply LOLA to a grid world task with an embedded social dilemma using deep recurrent policies and opponent modelling. Again, by explicitly considering the learning of the other agent, LOLA agents learn to cooperate out of self-interest.
spellingShingle Foerster, J
Chen, R
Al-Shedivat, M
Whiteson, S
Abbeel, P
Mordatch, I
Learning with opponent-learning awareness
title Learning with opponent-learning awareness
title_full Learning with opponent-learning awareness
title_fullStr Learning with opponent-learning awareness
title_full_unstemmed Learning with opponent-learning awareness
title_short Learning with opponent-learning awareness
title_sort learning with opponent learning awareness
work_keys_str_mv AT foersterj learningwithopponentlearningawareness
AT chenr learningwithopponentlearningawareness
AT alshedivatm learningwithopponentlearningawareness
AT whitesons learningwithopponentlearningawareness
AT abbeelp learningwithopponentlearningawareness
AT mordatchi learningwithopponentlearningawareness