QSOD: Hybrid Policy Gradient for Deep Multi-agent Reinforcement Learning

When individuals interact with one another to accomplish specific goals, they learn from others’ experiences to achieve the tasks at hand. The same holds for learning in virtual environments, such as video games. Deep multiagent reinforcement learning shows promising results in terms of c...

Full description

Bibliographic Details
Main Authors:	Hafiz Muhammad Raza Ur Rehman, Byung-Won On, Devarani Devi Ningombam, Sungwon Yi, Gyu Sang Choi
Format:	Article
Language:	English
Published:	IEEE 2021-01-01
Series:	IEEE Access
Subjects:	Artificial intelligence multiagent systems optimization
Online Access:	https://ieeexplore.ieee.org/document/9540595/

_version_	1818881630732812288
author	Hafiz Muhammad Raza Ur Rehman Byung-Won On Devarani Devi Ningombam Sungwon Yi Gyu Sang Choi
author_facet	Hafiz Muhammad Raza Ur Rehman Byung-Won On Devarani Devi Ningombam Sungwon Yi Gyu Sang Choi
author_sort	Hafiz Muhammad Raza Ur Rehman
collection	DOAJ
description	When individuals interact with one another to accomplish specific goals, they learn from others’ experiences to achieve the tasks at hand. The same holds for learning in virtual environments, such as video games. Deep multiagent reinforcement learning shows promising results in terms of completing many challenging tasks. To demonstrate its viability, most algorithms use value decomposition for multiple agents. To guide each agent, behavior value decomposition is utilized to decompose the combined Q-value of the agents into individual agent Q-values. A different mixing method can be utilized, using a monotonicity assumption based on value decomposition algorithms such as QMIX and QVMix. However, this method selects individual agent actions through a greedy policy. The agents, which require large numbers of training trials, are not addressed. In this paper, we propose a novel hybrid policy for the action selection of an individual agent known as Q-value Selection using Optimization and DRL (QSOD). A grey wolf optimizer (GWO) is used to determine the choice of individuals’ actions. As in GWO, there is proper attention among the agents facilitated through the agents’ coordination with one another. We used the StarCraft 2 Learning Environment to compare our proposed algorithm with the state-of-the-art algorithms QMIX and QVMix. Experimental results demonstrate that our algorithm outperforms QMIX and QVMix in all scenarios and requires fewer training trials.
first_indexed	2024-12-19T15:04:55Z
format	Article
id	doaj.art-68f0b78038fa4f6abfdf228735c49cfa
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-12-19T15:04:55Z
publishDate	2021-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-68f0b78038fa4f6abfdf228735c49cfa2022-12-21T20:16:27ZengIEEEIEEE Access2169-35362021-01-01912972812974110.1109/ACCESS.2021.31133509540595QSOD: Hybrid Policy Gradient for Deep Multi-agent Reinforcement LearningHafiz Muhammad Raza Ur Rehman0https://orcid.org/0000-0003-2230-6927Byung-Won On1https://orcid.org/0000-0001-6929-3188Devarani Devi Ningombam2Sungwon Yi3Gyu Sang Choi4https://orcid.org/0000-0002-0854-768XDepartment of Information and Communication Engineering, Yeungnam University, Gyeongsan, South KoreaDepartment of Software Convergence Engineering, Kunsan National University, Gunsan, South KoreaPlanning Division, Electronics and Telecommunications Research Institute, Daejeon, South KoreaPlanning Division, Electronics and Telecommunications Research Institute, Daejeon, South KoreaDepartment of Information and Communication Engineering, Yeungnam University, Gyeongsan, South KoreaWhen individuals interact with one another to accomplish specific goals, they learn from others’ experiences to achieve the tasks at hand. The same holds for learning in virtual environments, such as video games. Deep multiagent reinforcement learning shows promising results in terms of completing many challenging tasks. To demonstrate its viability, most algorithms use value decomposition for multiple agents. To guide each agent, behavior value decomposition is utilized to decompose the combined Q-value of the agents into individual agent Q-values. A different mixing method can be utilized, using a monotonicity assumption based on value decomposition algorithms such as QMIX and QVMix. However, this method selects individual agent actions through a greedy policy. The agents, which require large numbers of training trials, are not addressed. In this paper, we propose a novel hybrid policy for the action selection of an individual agent known as Q-value Selection using Optimization and DRL (QSOD). A grey wolf optimizer (GWO) is used to determine the choice of individuals’ actions. As in GWO, there is proper attention among the agents facilitated through the agents’ coordination with one another. We used the StarCraft 2 Learning Environment to compare our proposed algorithm with the state-of-the-art algorithms QMIX and QVMix. Experimental results demonstrate that our algorithm outperforms QMIX and QVMix in all scenarios and requires fewer training trials.https://ieeexplore.ieee.org/document/9540595/Artificial intelligencemultiagent systemsoptimization
spellingShingle	Hafiz Muhammad Raza Ur Rehman Byung-Won On Devarani Devi Ningombam Sungwon Yi Gyu Sang Choi QSOD: Hybrid Policy Gradient for Deep Multi-agent Reinforcement Learning IEEE Access Artificial intelligence multiagent systems optimization
title	QSOD: Hybrid Policy Gradient for Deep Multi-agent Reinforcement Learning
title_full	QSOD: Hybrid Policy Gradient for Deep Multi-agent Reinforcement Learning
title_fullStr	QSOD: Hybrid Policy Gradient for Deep Multi-agent Reinforcement Learning
title_full_unstemmed	QSOD: Hybrid Policy Gradient for Deep Multi-agent Reinforcement Learning
title_short	QSOD: Hybrid Policy Gradient for Deep Multi-agent Reinforcement Learning
title_sort	qsod hybrid policy gradient for deep multi agent reinforcement learning
topic	Artificial intelligence multiagent systems optimization
url	https://ieeexplore.ieee.org/document/9540595/
work_keys_str_mv	AT hafizmuhammadrazaurrehman qsodhybridpolicygradientfordeepmultiagentreinforcementlearning AT byungwonon qsodhybridpolicygradientfordeepmultiagentreinforcementlearning AT devaranideviningombam qsodhybridpolicygradientfordeepmultiagentreinforcementlearning AT sungwonyi qsodhybridpolicygradientfordeepmultiagentreinforcementlearning AT gyusangchoi qsodhybridpolicygradientfordeepmultiagentreinforcementlearning

QSOD: Hybrid Policy Gradient for Deep Multi-agent Reinforcement Learning

Similar Items