Fingerprint policy optimisation for robust reinforcement learning
Policy gradient methods ignore the potential value of adjusting environment variables: unobservable state features that are randomly determined by the environment in a physical setting, but are controllable in a simulator. This can lead to slow learning, or convergence to suboptimal policies, if the...
Main Authors: | , , |
---|---|
Format: | Conference item |
Published: |
Journal of Machine Learning Research
2019
|