Fingerprint policy optimisation for robust reinforcement learning

Policy gradient methods ignore the potential value of adjusting environment variables: unobservable state features that are randomly determined by the environment in a physical setting, but are controllable in a simulator. This can lead to slow learning, or convergence to suboptimal policies, if the...

Full description

Bibliographic Details
Main Authors: Paul, S, Osborne, M, Whiteson, S
Format: Conference item
Published: Journal of Machine Learning Research 2019