Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments

Learning action policy for autonomous agents in a decentralized multi-agent environment has remained an interesting but difficult research problem. We propose to model this problem in a contextual bandit setting with delayed reward signals, in particular an individual short-term reward signal and a...

Full description

Bibliographic Details
Main Authors:	Mike Li, Quang Dang Nguyen
Format:	Article
Language:	English
Published:	IEEE 2021-01-01
Series:	IEEE Access
Subjects:	Contextual bandit learning reward oracles sampling guidance from expert-designed policies short-term and long-term rewards
Online Access:	https://ieeexplore.ieee.org/document/9474507/

_version_	1818612185158385664
author	Mike Li Quang Dang Nguyen
author_facet	Mike Li Quang Dang Nguyen
author_sort	Mike Li
collection	DOAJ
description	Learning action policy for autonomous agents in a decentralized multi-agent environment has remained an interesting but difficult research problem. We propose to model this problem in a contextual bandit setting with delayed reward signals, in particular an individual short-term reward signal and a shared long-term reward signal. Our algorithm utilizes an approach with reward oracles to directly model these delayed reward signals and also relies on a learning scheme benefiting from the sampling guidance of an expert-designed policy. This algorithm is expected to apply to a wide range of problems, including those with constraints on accessing state transitions and those with implicit reward information. A demonstration, deployed by deep learning regressors, shows the effectiveness of the proposed algorithm in learning offensive action policy in the RoboCup Soccer 2D Simulation (RCSS) environment against a well-known adversary benchmark team compared to a baseline policy.
first_indexed	2024-12-16T15:42:12Z
format	Article
id	doaj.art-fcb5205821ea46c889fa20c5bc1b4e47
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-12-16T15:42:12Z
publishDate	2021-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-fcb5205821ea46c889fa20c5bc1b4e472022-12-21T22:25:57ZengIEEEIEEE Access2169-35362021-01-019966419665710.1109/ACCESS.2021.30946239474507Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent EnvironmentsMike Li0https://orcid.org/0000-0003-4514-7260Quang Dang Nguyen1https://orcid.org/0000-0002-0403-6903Centre for Complex Systems, Faculty of Engineering, University of Sydney, Sydney, NSW, AustraliaCentre for Complex Systems, Faculty of Engineering, University of Sydney, Sydney, NSW, AustraliaLearning action policy for autonomous agents in a decentralized multi-agent environment has remained an interesting but difficult research problem. We propose to model this problem in a contextual bandit setting with delayed reward signals, in particular an individual short-term reward signal and a shared long-term reward signal. Our algorithm utilizes an approach with reward oracles to directly model these delayed reward signals and also relies on a learning scheme benefiting from the sampling guidance of an expert-designed policy. This algorithm is expected to apply to a wide range of problems, including those with constraints on accessing state transitions and those with implicit reward information. A demonstration, deployed by deep learning regressors, shows the effectiveness of the proposed algorithm in learning offensive action policy in the RoboCup Soccer 2D Simulation (RCSS) environment against a well-known adversary benchmark team compared to a baseline policy.https://ieeexplore.ieee.org/document/9474507/Contextual bandit learningreward oraclessampling guidance from expert-designed policiesshort-term and long-term rewards
spellingShingle	Mike Li Quang Dang Nguyen Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments IEEE Access Contextual bandit learning reward oracles sampling guidance from expert-designed policies short-term and long-term rewards
title	Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments
title_full	Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments
title_fullStr	Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments
title_full_unstemmed	Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments
title_short	Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments
title_sort	contextual bandit learning with reward oracles and sampling guidance in multi agent environments
topic	Contextual bandit learning reward oracles sampling guidance from expert-designed policies short-term and long-term rewards
url	https://ieeexplore.ieee.org/document/9474507/
work_keys_str_mv	AT mikeli contextualbanditlearningwithrewardoraclesandsamplingguidanceinmultiagentenvironments AT quangdangnguyen contextualbanditlearningwithrewardoraclesandsamplingguidanceinmultiagentenvironments

Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments

Similar Items