Steerable Alignment with Conditional Multiobjective Preference Optimization

As the scale, capabilities and use-cases of large language models (LLMs) continue to grow, it is imperative that these systems are aligned with human preferences. Current state of the art strategies for alignment such as Reinforcement Learning from Human Feedback (RLHF) have provided useful paradigm...

Full description

Bibliographic Details
Main Author:	Manyika, Julian
Other Authors:	Hadfield-Menell, Dylan
Format:	Thesis
Published:	Massachusetts Institute of Technology 2024
Online Access:	https://hdl.handle.net/1721.1/156747

_version_	1826215325645930496
author	Manyika, Julian
author2	Hadfield-Menell, Dylan
author_facet	Hadfield-Menell, Dylan Manyika, Julian
author_sort	Manyika, Julian
collection	MIT
description	As the scale, capabilities and use-cases of large language models (LLMs) continue to grow, it is imperative that these systems are aligned with human preferences. Current state of the art strategies for alignment such as Reinforcement Learning from Human Feedback (RLHF) have provided useful paradigms for finetuning LLMs to produce outputs that are more consistent with human preferences. These approaches, however, assume that preferences are formed by a single, underlying reward model, which is likely insufficient for representing an individual’s preferences, certainly unable to represent diverse group preferences, and inf lexible for users at inference time. To address these limitations, we propose Conditional Multiobjective Preference Optimization (CMPO), a novel alignment strategy that trains a user-steerable LLM along multiple attributes of text, such as helpfulness and humor. CMPO simulates the pareto front of multiple single-attribute preference-optimized models through structural plurality and finetuning with Direct Preference Optimzation (DPO), and allows users to condition outputs on the predefined attributes at inference-time. Experiments show that CMPO generates responses that are preferred to those from separate attribute-specific DPO models and from models trained using SteerLM, a alternate model steering approach. CMPO empirically shows promise as a scalable and flexible finetuning strategy for creating LLMs that are attribute-steerable from parameterized preferences.
first_indexed	2024-09-23T16:24:01Z
format	Thesis
id	mit-1721.1/156747
institution	Massachusetts Institute of Technology
last_indexed	2024-09-23T16:24:01Z
publishDate	2024
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/1567472024-09-17T03:03:33Z Steerable Alignment with Conditional Multiobjective Preference Optimization Manyika, Julian Hadfield-Menell, Dylan Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science As the scale, capabilities and use-cases of large language models (LLMs) continue to grow, it is imperative that these systems are aligned with human preferences. Current state of the art strategies for alignment such as Reinforcement Learning from Human Feedback (RLHF) have provided useful paradigms for finetuning LLMs to produce outputs that are more consistent with human preferences. These approaches, however, assume that preferences are formed by a single, underlying reward model, which is likely insufficient for representing an individual’s preferences, certainly unable to represent diverse group preferences, and inf lexible for users at inference time. To address these limitations, we propose Conditional Multiobjective Preference Optimization (CMPO), a novel alignment strategy that trains a user-steerable LLM along multiple attributes of text, such as helpfulness and humor. CMPO simulates the pareto front of multiple single-attribute preference-optimized models through structural plurality and finetuning with Direct Preference Optimzation (DPO), and allows users to condition outputs on the predefined attributes at inference-time. Experiments show that CMPO generates responses that are preferred to those from separate attribute-specific DPO models and from models trained using SteerLM, a alternate model steering approach. CMPO empirically shows promise as a scalable and flexible finetuning strategy for creating LLMs that are attribute-steerable from parameterized preferences. M.Eng. 2024-09-16T13:46:40Z 2024-09-16T13:46:40Z 2024-05 2024-07-11T14:36:51.894Z Thesis https://hdl.handle.net/1721.1/156747 Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/ application/pdf Massachusetts Institute of Technology
spellingShingle	Manyika, Julian Steerable Alignment with Conditional Multiobjective Preference Optimization
title	Steerable Alignment with Conditional Multiobjective Preference Optimization
title_full	Steerable Alignment with Conditional Multiobjective Preference Optimization
title_fullStr	Steerable Alignment with Conditional Multiobjective Preference Optimization
title_full_unstemmed	Steerable Alignment with Conditional Multiobjective Preference Optimization
title_short	Steerable Alignment with Conditional Multiobjective Preference Optimization
title_sort	steerable alignment with conditional multiobjective preference optimization
url	https://hdl.handle.net/1721.1/156747
work_keys_str_mv	AT manyikajulian steerablealignmentwithconditionalmultiobjectivepreferenceoptimization

Steerable Alignment with Conditional Multiobjective Preference Optimization

Similar Items