Summary: | We theoretically explore the relationship between sample-efficiency and adaptivity in reinforcement learning. An algorithm is sample-efficient if it uses a number
of queries n to the environment that is polynomial in the dimension d of the problem. Adaptivity refers to the frequency at which queries are sent and feedback is
processed to update the querying strategy. To investigate this interplay, we employ
a learning framework that allows sending queries in K batches, with feedback being processed and queries updated after each batch. This model encompasses the
whole adaptivity spectrum, ranging from non-adaptive ‘offline’ (K “ 1) to fully
adaptive (K “ n) scenarios, and regimes in between. For the problems of policy
evaluation and best-policy identification under d-dimensional linear function approximation, we establish Ωplog log dq lower bounds on the number of batches K
required for sample-efficient algorithms with n “ Oppolypdqq queries. Our results
show that just having adaptivity (K ą 1) does not necessarily guarantee sampleefficiency. Notably, the adaptivity-boundary for sample-efficiency is not between
offline reinforcement learning (K “ 1), where sample-efficiency was known to
not be possible, and adaptive settings. Instead, the boundary lies between different
regimes of adaptivity and depends on the problem dimension.
|