Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments

Learning action policy for autonomous agents in a decentralized multi-agent environment has remained an interesting but difficult research problem. We propose to model this problem in a contextual bandit setting with delayed reward signals, in particular an individual short-term reward signal and a...

Full description

Bibliographic Details
Main Authors:	Mike Li, Quang Dang Nguyen
Format:	Article
Language:	English
Published:	IEEE 2021-01-01
Series:	IEEE Access
Subjects:	Contextual bandit learning reward oracles sampling guidance from expert-designed policies short-term and long-term rewards
Online Access:	https://ieeexplore.ieee.org/document/9474507/

Description
Summary:	Learning action policy for autonomous agents in a decentralized multi-agent environment has remained an interesting but difficult research problem. We propose to model this problem in a contextual bandit setting with delayed reward signals, in particular an individual short-term reward signal and a shared long-term reward signal. Our algorithm utilizes an approach with reward oracles to directly model these delayed reward signals and also relies on a learning scheme benefiting from the sampling guidance of an expert-designed policy. This algorithm is expected to apply to a wide range of problems, including those with constraints on accessing state transitions and those with implicit reward information. A demonstration, deployed by deep learning regressors, shows the effectiveness of the proposed algorithm in learning offensive action policy in the RoboCup Soccer 2D Simulation (RCSS) environment against a well-known adversary benchmark team compared to a baseline policy.
ISSN:	2169-3536

Contextual Bandit Learning With Reward Oracles and Sampling Guidance in Multi-Agent Environments

Similar Items