Online learning with sample path constraints

We study online learning where a decision maker interacts with Nature with the objective of maximizing her long-term average reward subject to some sample path average constraints. We de ne the reward-in-hindsight as the highest reward the decision maker could have achieved, while satisfying the...

Full description

Bibliographic Details
Main Authors: Mannor, Shie, Tsitsiklis, John N., Yu, Jia Yuan
Other Authors: Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Format: Article
Language:en_US
Published: MIT Press 2010
Online Access:http://hdl.handle.net/1721.1/51700
https://orcid.org/0000-0003-2658-8239
Description
Summary:We study online learning where a decision maker interacts with Nature with the objective of maximizing her long-term average reward subject to some sample path average constraints. We de ne the reward-in-hindsight as the highest reward the decision maker could have achieved, while satisfying the constraints, had she known Nature's choices in advance. We show that in general the reward-in-hindsight is not attainable. The convex hull of the reward-in-hindsight function is, however, attainable. For the important case of a single constraint, the convex hull turns out to be the highest attainable function. Using a calibrated forecasting rule, we provide an explicit strategy that attains this convex hull. We also measure the performance of heuristic methods based on non-calibrated forecasters in experiments involving a CPU power management problem.