Summary: | Reinforcement learning (RL) has gained an increasing interest in recent years, being expected to deliver autonomous agents that can learn to interact with an environment. So far the empirical successes rely heavily on enormous amount of data collected during interaction, hence mostly limited to domains with exact simulators like video games and board games. Even with simulators, the demand for data of RL is sometimes too much a burden, not to mention that for domains in the physical world, like robotics and self-driving, collecting data through interaction is usually costly, in terms of both money and time. Consequently, provably efficient methods are highly valued.
The central problem to achieve data efficiency is to balance the trade-off between exploration and exploitation. The idea of optimism in the face of uncertainty from the bandit literature has been shown to achieve minimax optimality for Markov decision processes (MDPs) in the tabular case. However, such a model may be too general for some problems where certain structures allow for much more efficient learning. In the first part of this thesis, we consider the factored MDP model, where the transitions and rewards are factored. Assuming the factorization is known, we propose two optimism-based algorithms under the same model-based framework. One achieves minimax optimal regret guarantees for a rich class of factored structures, while the other enjoys better computational complexity with a slightly worse regret. We complement our algorithms by presenting structure-dependent lower bounds on regret for factored MDPs that reveal the difficulty of fully characterizing the lower bounds due to the intricacy of the structures.
Another limitation of the MDP model is that it does not capture learning in a multi-agent environment. In the second part of this thesis, we study online learning in unknown Markov games, a problem that arises in multi-agent RL where the actions of the opponents are unobservable. We show that in this challenging setting, achieving sublinear regret against the best response in hindsight is statistically hard. We then consider a weaker notion of regret by competing with the minimax value of the game, and present an algorithm that achieves 𝒪˜(𝐾2/3 ) regret after 𝐾 episodes. This is the first sublinear regret bound (to our knowledge) for online learning in un3 known Markov games. Importantly, the regret bound is independent of the size of the opponents’ action spaces. Even when the opponents’ actions are fully observable, our regret bound improves upon existing analyses by an exponential factor in the number of opponents.
|