Summary: | <p>Developing agents that behave intelligently in the world is an open challenge in
machine learning. Desiderata for such agents are efficient exploration, maximizing
long term utility, and the ability to effectively leverage prior data to solve new
tasks. Reinforcement learning (RL) is an approach that is predicated on learning
by directly interacting with an environment through trial-and-error, and presents
a way for us to train and deploy such agents. Moreover, combining RL with
powerful neural network function approximators – a sub-field known as “deep RL” –
has shown evidence towards achieving this goal. For instance, deep RL has yielded
agents that can play Go at superhuman levels, improve the efficiency of microchip
designs, and learn complex novel strategies for controlling nuclear fusion reactions.</p>
<p>A key issue that stands in the way of deploying deep RL is poor sample efficiency. Concretely, while it is possible to train effective agents using deep
RL, the key successes have largely been in environments where we have access to
large amounts of online interaction, often through the use of simulators. However,
in many real-world problems, we are confronted with scenarios where samples
are expensive to obtain. As has been alluded to, one way to alleviate this issue
is through accessing some prior data, often termed “offline data”, which can
accelerate how quickly we learn such agents, such as leveraging exploratory data
to prevent redundant deployments, or using human-expert data to quickly guide
agents towards promising behaviors and beyond. However, the best way to
incorporate this data into existing deep RL algorithms is not straightforward;
naïvely pre-training using RL algorithms on this offline data, a paradigm called
“offline RL” as a starting point for subsequent learning is often detrimental.
Moreover, it is unclear how to explicitly derive useful behaviors online that are
positively influenced by this offline pre-training.</p>
<p>With these factors in mind, this thesis follows a 3-pronged strategy towards
improving sample-efficiency in deep RL. First, we investigate effective pre-training
on offline data. Then, we tackle the online problem, looking at efficient adaptation
to environments when operating purely online. Finally, we conclude with hybrid
strategies that use offline data to explicitly augment policies when acting online.</p>
|