Summary: | This paper proposes a model estimation method in offline Bayesian model-based reinforcement learning (MBRL). Learning a Bayes-adaptive Markov decision process (BAMDP) model using standard variational inference often suffers from poor predictive performance due to covariate shift between offline data and future data distributions. To tackle this problem, this paper applies an importance-weighting technique for covariate shift to variational inference learning of a BAMDP model. Consequently, this paper uses a unified objective function to optimize both model and policy. The unified objective function can be seen as an importance-weighted variational objective function for model training. The unified objective function is also considered as the expected return for policy planning penalized by the model’s error, which is a standard objective function in MBRL. This paper proposes an algorithm optimizing the unified objective function. The proposed algorithm performs better than algorithms using standard variational inference without importance-weighting. Numerical experiments demonstrate the effectiveness of the proposed algorithm.
|