Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments
Learning action policy for autonomous agents in a decentralized multi-agent environment has remained an interesting but difficult research problem. We propose to model this problem in a contextual bandit setting with delayed reward signals, in particular an individual short-term reward signal and a...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2021-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/9474507/ |
_version_ | 1818612185158385664 |
---|---|
author | Mike Li Quang Dang Nguyen |
author_facet | Mike Li Quang Dang Nguyen |
author_sort | Mike Li |
collection | DOAJ |
description | Learning action policy for autonomous agents in a decentralized multi-agent environment has remained an interesting but difficult research problem. We propose to model this problem in a contextual bandit setting with delayed reward signals, in particular an individual short-term reward signal and a shared long-term reward signal. Our algorithm utilizes an approach with reward oracles to directly model these delayed reward signals and also relies on a learning scheme benefiting from the sampling guidance of an expert-designed policy. This algorithm is expected to apply to a wide range of problems, including those with constraints on accessing state transitions and those with implicit reward information. A demonstration, deployed by deep learning regressors, shows the effectiveness of the proposed algorithm in learning offensive action policy in the RoboCup Soccer 2D Simulation (RCSS) environment against a well-known adversary benchmark team compared to a baseline policy. |
first_indexed | 2024-12-16T15:42:12Z |
format | Article |
id | doaj.art-fcb5205821ea46c889fa20c5bc1b4e47 |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-12-16T15:42:12Z |
publishDate | 2021-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-fcb5205821ea46c889fa20c5bc1b4e472022-12-21T22:25:57ZengIEEEIEEE Access2169-35362021-01-019966419665710.1109/ACCESS.2021.30946239474507Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent EnvironmentsMike Li0https://orcid.org/0000-0003-4514-7260Quang Dang Nguyen1https://orcid.org/0000-0002-0403-6903Centre for Complex Systems, Faculty of Engineering, University of Sydney, Sydney, NSW, AustraliaCentre for Complex Systems, Faculty of Engineering, University of Sydney, Sydney, NSW, AustraliaLearning action policy for autonomous agents in a decentralized multi-agent environment has remained an interesting but difficult research problem. We propose to model this problem in a contextual bandit setting with delayed reward signals, in particular an individual short-term reward signal and a shared long-term reward signal. Our algorithm utilizes an approach with reward oracles to directly model these delayed reward signals and also relies on a learning scheme benefiting from the sampling guidance of an expert-designed policy. This algorithm is expected to apply to a wide range of problems, including those with constraints on accessing state transitions and those with implicit reward information. A demonstration, deployed by deep learning regressors, shows the effectiveness of the proposed algorithm in learning offensive action policy in the RoboCup Soccer 2D Simulation (RCSS) environment against a well-known adversary benchmark team compared to a baseline policy.https://ieeexplore.ieee.org/document/9474507/Contextual bandit learningreward oraclessampling guidance from expert-designed policiesshort-term and long-term rewards |
spellingShingle | Mike Li Quang Dang Nguyen Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments IEEE Access Contextual bandit learning reward oracles sampling guidance from expert-designed policies short-term and long-term rewards |
title | Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments |
title_full | Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments |
title_fullStr | Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments |
title_full_unstemmed | Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments |
title_short | Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments |
title_sort | contextual bandit learning with reward oracles and sampling guidance in multi agent environments |
topic | Contextual bandit learning reward oracles sampling guidance from expert-designed policies short-term and long-term rewards |
url | https://ieeexplore.ieee.org/document/9474507/ |
work_keys_str_mv | AT mikeli contextualbanditlearningwithrewardoraclesandsamplingguidanceinmultiagentenvironments AT quangdangnguyen contextualbanditlearningwithrewardoraclesandsamplingguidanceinmultiagentenvironments |