Learning mirror maps in policy mirror descent

Policy Mirror Descent (PMD) is a popular framework in reinforcement learning, serving as a unifying perspective that encompasses numerous algorithms. These algorithms are derived through the selection of a mirror map and enjoy finite-time convergence guarantees. Despite its popularity, the explorati...

সম্পূর্ণ বিবরণ

গ্রন্থ-পঞ্জীর বিবরন
প্রধান লেখক:	Alfano, C, Towers, S, Sapora, S, Lu, C, Rebeschini, P
বিন্যাস:	Conference item
ভাষা:	English
প্রকাশিত:	International Conference on Learning Representations 2025

_version_	1826317637585469440
author	Alfano, C Towers, S Sapora, S Lu, C Rebeschini, P
author_facet	Alfano, C Towers, S Sapora, S Lu, C Rebeschini, P
author_sort	Alfano, C
collection	OXFORD
description	Policy Mirror Descent (PMD) is a popular framework in reinforcement learning, serving as a unifying perspective that encompasses numerous algorithms. These algorithms are derived through the selection of a mirror map and enjoy finite-time convergence guarantees. Despite its popularity, the exploration of PMD’s full potential is limited, with the majority of research focusing on a particular mirror map—namely, the negative entropy—which gives rise to the renowned Natural Policy Gradient (NPG) method. It remains uncertain from existing theoretical studies whether the choice of mirror map significantly influences PMD’s efficacy. In our work, we conduct empirical investigations to show that the conventional mirror map choice (NPG) often yields less-than-optimal outcomes across several standard benchmark environments. Using evolutionary strategies, we identify more efficient mirror maps that enhance the performance of PMD. We first focus on a tabular environment, i.e. Grid-World, where we relate existing theoretical bounds with the performance of PMD for a few standard mirror maps and the learned one. We then show that it is possible to learn a mirror map that outperforms the negative entropy in more complex environments, such as the MinAtar suite. Additionally, we demonstrate that the learned mirror maps generalize effectively to different tasks by testing each map across various other environments.
first_indexed	2025-03-11T16:57:04Z
format	Conference item
id	oxford-uuid:a150dbcf-4e5a-4b31-9741-9a7ccda5a7a5
institution	University of Oxford
language	English
last_indexed	2025-03-11T16:57:04Z
publishDate	2025
publisher	International Conference on Learning Representations
record_format	dspace
spelling	oxford-uuid:a150dbcf-4e5a-4b31-9741-9a7ccda5a7a52025-02-26T15:13:18ZLearning mirror maps in policy mirror descentConference itemhttp://purl.org/coar/resource_type/c_5794uuid:a150dbcf-4e5a-4b31-9741-9a7ccda5a7a5EnglishSymplectic ElementsInternational Conference on Learning Representations2025Alfano, CTowers, SSapora, SLu, CRebeschini, PPolicy Mirror Descent (PMD) is a popular framework in reinforcement learning, serving as a unifying perspective that encompasses numerous algorithms. These algorithms are derived through the selection of a mirror map and enjoy finite-time convergence guarantees. Despite its popularity, the exploration of PMD’s full potential is limited, with the majority of research focusing on a particular mirror map—namely, the negative entropy—which gives rise to the renowned Natural Policy Gradient (NPG) method. It remains uncertain from existing theoretical studies whether the choice of mirror map significantly influences PMD’s efficacy. In our work, we conduct empirical investigations to show that the conventional mirror map choice (NPG) often yields less-than-optimal outcomes across several standard benchmark environments. Using evolutionary strategies, we identify more efficient mirror maps that enhance the performance of PMD. We first focus on a tabular environment, i.e. Grid-World, where we relate existing theoretical bounds with the performance of PMD for a few standard mirror maps and the learned one. We then show that it is possible to learn a mirror map that outperforms the negative entropy in more complex environments, such as the MinAtar suite. Additionally, we demonstrate that the learned mirror maps generalize effectively to different tasks by testing each map across various other environments.
spellingShingle	Alfano, C Towers, S Sapora, S Lu, C Rebeschini, P Learning mirror maps in policy mirror descent
title	Learning mirror maps in policy mirror descent
title_full	Learning mirror maps in policy mirror descent
title_fullStr	Learning mirror maps in policy mirror descent
title_full_unstemmed	Learning mirror maps in policy mirror descent
title_short	Learning mirror maps in policy mirror descent
title_sort	learning mirror maps in policy mirror descent
work_keys_str_mv	AT alfanoc learningmirrormapsinpolicymirrordescent AT towerss learningmirrormapsinpolicymirrordescent AT saporas learningmirrormapsinpolicymirrordescent AT luc learningmirrormapsinpolicymirrordescent AT rebeschinip learningmirrormapsinpolicymirrordescent

Learning mirror maps in policy mirror descent

অনুরূপ উপাদানগুলি