Learning with opponent-learning awareness

Multi-agent settings are quickly gathering importance in machine learning. This includes a plethora of recent work on deep multi-agent reinforcement learning, but also can be extended to hierarchical reinforcement learning, generative adversarial networks and decentralised optimization. In all these...

وصف كامل

التفاصيل البيبلوغرافية
المؤلفون الرئيسيون:	Foerster, J, Chen, R, Al-Shedivat, M, Whiteson, S, Abbeel, P, Mordatch, I
التنسيق:	Conference item
منشور في:	International Foundation for Autonomous Agents and Multiagent Systems 2018

_version_	1826279877405310976
author	Foerster, J Chen, R Al-Shedivat, M Whiteson, S Abbeel, P Mordatch, I
author_facet	Foerster, J Chen, R Al-Shedivat, M Whiteson, S Abbeel, P Mordatch, I
author_sort	Foerster, J
collection	OXFORD
description	Multi-agent settings are quickly gathering importance in machine learning. This includes a plethora of recent work on deep multi-agent reinforcement learning, but also can be extended to hierarchical reinforcement learning, generative adversarial networks and decentralised optimization. In all these settings the presence of multiple learning agents renders the training problem non-stationary and often leads to unstable training or undesired final results. We present Learning with Opponent-Learning Awareness (LOLA), a method in which each agent shapes the anticipated learning of the other agents in the environment. The LOLA learning rule includes an additional term that accounts for the impact of one agent’s policy on the anticipated parameter update of the other agents. Preliminary results show that the encounter of two LOLA agents leads to the emergence of titfor-tat and therefore cooperation in the iterated prisoners’ dilemma (IPD), while independent learning does not. In this domain, LOLA also receives higher payouts compared to a naive learner, and is robust against exploitation by higher order gradient-based methods. Applied to infinitely repeated matching pennies, LOLA agents converge to the Nash equilibrium. In a round robin tournament we show that LOLA agents can successfully shape the learning of a range of multi-agent learning algorithms from literature, resulting in the highest average returns on the IPD. We also show that the LOLA update rule can be efficiently calculated using an extension of the likelihood ratio policy gradient estimator, making the method suitable for model-free reinforcement learning. This method thus scales to large parameter and input spaces and nonlinear function approximators. We also apply LOLA to a grid world task with an embedded social dilemma using deep recurrent policies and opponent modelling. Again, by explicitly considering the learning of the other agent, LOLA agents learn to cooperate out of self-interest.
first_indexed	2024-03-07T00:05:21Z
format	Conference item
id	oxford-uuid:775b8bcc-cf2c-488f-9db5-eeafb2aad0c8
institution	University of Oxford
last_indexed	2024-03-07T00:05:21Z
publishDate	2018
publisher	International Foundation for Autonomous Agents and Multiagent Systems
record_format	dspace
spelling	oxford-uuid:775b8bcc-cf2c-488f-9db5-eeafb2aad0c82022-03-26T20:23:28ZLearning with opponent-learning awarenessConference itemhttp://purl.org/coar/resource_type/c_5794uuid:775b8bcc-cf2c-488f-9db5-eeafb2aad0c8Symplectic Elements at OxfordInternational Foundation for Autonomous Agents and Multiagent Systems2018Foerster, JChen, RAl-Shedivat, MWhiteson, SAbbeel, PMordatch, IMulti-agent settings are quickly gathering importance in machine learning. This includes a plethora of recent work on deep multi-agent reinforcement learning, but also can be extended to hierarchical reinforcement learning, generative adversarial networks and decentralised optimization. In all these settings the presence of multiple learning agents renders the training problem non-stationary and often leads to unstable training or undesired final results. We present Learning with Opponent-Learning Awareness (LOLA), a method in which each agent shapes the anticipated learning of the other agents in the environment. The LOLA learning rule includes an additional term that accounts for the impact of one agent’s policy on the anticipated parameter update of the other agents. Preliminary results show that the encounter of two LOLA agents leads to the emergence of titfor-tat and therefore cooperation in the iterated prisoners’ dilemma (IPD), while independent learning does not. In this domain, LOLA also receives higher payouts compared to a naive learner, and is robust against exploitation by higher order gradient-based methods. Applied to infinitely repeated matching pennies, LOLA agents converge to the Nash equilibrium. In a round robin tournament we show that LOLA agents can successfully shape the learning of a range of multi-agent learning algorithms from literature, resulting in the highest average returns on the IPD. We also show that the LOLA update rule can be efficiently calculated using an extension of the likelihood ratio policy gradient estimator, making the method suitable for model-free reinforcement learning. This method thus scales to large parameter and input spaces and nonlinear function approximators. We also apply LOLA to a grid world task with an embedded social dilemma using deep recurrent policies and opponent modelling. Again, by explicitly considering the learning of the other agent, LOLA agents learn to cooperate out of self-interest.
spellingShingle	Foerster, J Chen, R Al-Shedivat, M Whiteson, S Abbeel, P Mordatch, I Learning with opponent-learning awareness
title	Learning with opponent-learning awareness
title_full	Learning with opponent-learning awareness
title_fullStr	Learning with opponent-learning awareness
title_full_unstemmed	Learning with opponent-learning awareness
title_short	Learning with opponent-learning awareness
title_sort	learning with opponent learning awareness
work_keys_str_mv	AT foersterj learningwithopponentlearningawareness AT chenr learningwithopponentlearningawareness AT alshedivatm learningwithopponentlearningawareness AT whitesons learningwithopponentlearningawareness AT abbeelp learningwithopponentlearningawareness AT mordatchi learningwithopponentlearningawareness

Learning with opponent-learning awareness

مواد مشابهة