Bi-factorial preference optimization: balancing safety-helpfulness in language models

Fine-tuning large language models (LLMs) on human preferences, typically through reinforcement learning from human feedback (RLHF), has proven successful in enhancing their capabilities. However, ensuring the safety of LLMs during fine-tuning remains a critical concern, and mitigating the potential...

Full description

Bibliographic Details
Main Authors:	Zhang, W, Torr, PHS, Elhoseiny, M, Bibi, A
Format:	Conference item
Language:	English
Published:	OpenReview 2025

_version_	1824459301620547584
author	Zhang, W Torr, PHS Elhoseiny, M Bibi, A
author_facet	Zhang, W Torr, PHS Elhoseiny, M Bibi, A
author_sort	Zhang, W
collection	OXFORD
description	Fine-tuning large language models (LLMs) on human preferences, typically through reinforcement learning from human feedback (RLHF), has proven successful in enhancing their capabilities. However, ensuring the safety of LLMs during fine-tuning remains a critical concern, and mitigating the potential conflicts in safety and helpfulness is costly in RLHF. To address this issue, we propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO), which re-parameterizes a joint RLHF objective of both safety and helpfulness into a single supervised learning objective. In the supervised optimization, a labeling function is used to capture global preferences ranking to balance both safety and helpfulness. To evaluate BFPO, we develop a benchmark including comprehensive discriminative and generative tasks for helpfulness and harmlessness. The results indicate that our method significantly outperforms existing approaches in both safety and helpfulness. Moreover, BFPO eliminates the need for human prompting and annotation in LLM fine-tuning while achieving the same level of safety as methods that heavily rely on human labor, with less than 10% of the computational resources. The training recipes and models will be released.
first_indexed	2025-02-19T04:39:37Z
format	Conference item
id	oxford-uuid:71b206f4-06f8-41ec-b74a-ab7e70753069
institution	University of Oxford
language	English
last_indexed	2025-02-19T04:39:37Z
publishDate	2025
publisher	OpenReview
record_format	dspace
spelling	oxford-uuid:71b206f4-06f8-41ec-b74a-ab7e707530692025-02-18T11:25:36ZBi-factorial preference optimization: balancing safety-helpfulness in language modelsConference itemhttp://purl.org/coar/resource_type/c_5794uuid:71b206f4-06f8-41ec-b74a-ab7e70753069EnglishSymplectic ElementsOpenReview2025Zhang, WTorr, PHSElhoseiny, MBibi, AFine-tuning large language models (LLMs) on human preferences, typically through reinforcement learning from human feedback (RLHF), has proven successful in enhancing their capabilities. However, ensuring the safety of LLMs during fine-tuning remains a critical concern, and mitigating the potential conflicts in safety and helpfulness is costly in RLHF. To address this issue, we propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO), which re-parameterizes a joint RLHF objective of both safety and helpfulness into a single supervised learning objective. In the supervised optimization, a labeling function is used to capture global preferences ranking to balance both safety and helpfulness. To evaluate BFPO, we develop a benchmark including comprehensive discriminative and generative tasks for helpfulness and harmlessness. The results indicate that our method significantly outperforms existing approaches in both safety and helpfulness. Moreover, BFPO eliminates the need for human prompting and annotation in LLM fine-tuning while achieving the same level of safety as methods that heavily rely on human labor, with less than 10% of the computational resources. The training recipes and models will be released.
spellingShingle	Zhang, W Torr, PHS Elhoseiny, M Bibi, A Bi-factorial preference optimization: balancing safety-helpfulness in language models
title	Bi-factorial preference optimization: balancing safety-helpfulness in language models
title_full	Bi-factorial preference optimization: balancing safety-helpfulness in language models
title_fullStr	Bi-factorial preference optimization: balancing safety-helpfulness in language models
title_full_unstemmed	Bi-factorial preference optimization: balancing safety-helpfulness in language models
title_short	Bi-factorial preference optimization: balancing safety-helpfulness in language models
title_sort	bi factorial preference optimization balancing safety helpfulness in language models
work_keys_str_mv	AT zhangw bifactorialpreferenceoptimizationbalancingsafetyhelpfulnessinlanguagemodels AT torrphs bifactorialpreferenceoptimizationbalancingsafetyhelpfulnessinlanguagemodels AT elhoseinym bifactorialpreferenceoptimizationbalancingsafetyhelpfulnessinlanguagemodels AT bibia bifactorialpreferenceoptimizationbalancingsafetyhelpfulnessinlanguagemodels

Bi-factorial preference optimization: balancing safety-helpfulness in language models

Similar Items