Summary: | This paper introduces a novel method of adding intrinsic bonuses to task-oriented reward function in order to efficiently facilitate reinforcement learning search. While various bonuses have been designed to date, this paper points out that the intrinsic bonuses can be analogous to either of the depth-first or breadth-first search algorithms in graph theory. Since these two have their own strengths and weaknesses, new bonuses that can bring out the respective characteristics are first proposed. The designed bonuses are derived as differences in key indicators that converge in the steady state based on the concepts of value disagreement and self-imitation. Then, a heuristic gain scheduling is applied to the designed bonuses, inspired by the iterative deepening search, which is known to inherit the advantages of the two search algorithms. The proposed method is expected to allow agent to efficiently reach the best solution in deeper states by gradually exploring unknown states. In three locomotion tasks with dense rewards and three simple tasks with sparse rewards, it is shown that the two types of bonus contribute to the performance improvement of the different tasks complementarily. In addition, by combining them with the proposed gain scheduling, all tasks can be accomplished with high performance.
|