Bi-factorial preference optimization: balancing safety-helpfulness in language models

Fine-tuning large language models (LLMs) on human preferences, typically through reinforcement learning from human feedback (RLHF), has proven successful in enhancing their capabilities. However, ensuring the safety of LLMs during fine-tuning remains a critical concern, and mitigating the potential...

Full description

Bibliographic Details
Main Authors: Zhang, W, Torr, PHS, Elhoseiny, M, Bibi, A
Format: Conference item
Language:English
Published: OpenReview 2025
_version_ 1824459301620547584
author Zhang, W
Torr, PHS
Elhoseiny, M
Bibi, A
author_facet Zhang, W
Torr, PHS
Elhoseiny, M
Bibi, A
author_sort Zhang, W
collection OXFORD
description Fine-tuning large language models (LLMs) on human preferences, typically through reinforcement learning from human feedback (RLHF), has proven successful in enhancing their capabilities. However, ensuring the safety of LLMs during fine-tuning remains a critical concern, and mitigating the potential conflicts in safety and helpfulness is costly in RLHF. To address this issue, we propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO), which re-parameterizes a joint RLHF objective of both safety and helpfulness into a single supervised learning objective. In the supervised optimization, a labeling function is used to capture global preferences ranking to balance both safety and helpfulness. To evaluate BFPO, we develop a benchmark including comprehensive discriminative and generative tasks for helpfulness and harmlessness. The results indicate that our method significantly outperforms existing approaches in both safety and helpfulness. Moreover, BFPO eliminates the need for human prompting and annotation in LLM fine-tuning while achieving the same level of safety as methods that heavily rely on human labor, with less than 10% of the computational resources. The training recipes and models will be released.
first_indexed 2025-02-19T04:39:37Z
format Conference item
id oxford-uuid:71b206f4-06f8-41ec-b74a-ab7e70753069
institution University of Oxford
language English
last_indexed 2025-02-19T04:39:37Z
publishDate 2025
publisher OpenReview
record_format dspace
spelling oxford-uuid:71b206f4-06f8-41ec-b74a-ab7e707530692025-02-18T11:25:36ZBi-factorial preference optimization: balancing safety-helpfulness in language modelsConference itemhttp://purl.org/coar/resource_type/c_5794uuid:71b206f4-06f8-41ec-b74a-ab7e70753069EnglishSymplectic ElementsOpenReview2025Zhang, WTorr, PHSElhoseiny, MBibi, AFine-tuning large language models (LLMs) on human preferences, typically through reinforcement learning from human feedback (RLHF), has proven successful in enhancing their capabilities. However, ensuring the safety of LLMs during fine-tuning remains a critical concern, and mitigating the potential conflicts in safety and helpfulness is costly in RLHF. To address this issue, we propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO), which re-parameterizes a joint RLHF objective of both safety and helpfulness into a single supervised learning objective. In the supervised optimization, a labeling function is used to capture global preferences ranking to balance both safety and helpfulness. To evaluate BFPO, we develop a benchmark including comprehensive discriminative and generative tasks for helpfulness and harmlessness. The results indicate that our method significantly outperforms existing approaches in both safety and helpfulness. Moreover, BFPO eliminates the need for human prompting and annotation in LLM fine-tuning while achieving the same level of safety as methods that heavily rely on human labor, with less than 10% of the computational resources. The training recipes and models will be released.
spellingShingle Zhang, W
Torr, PHS
Elhoseiny, M
Bibi, A
Bi-factorial preference optimization: balancing safety-helpfulness in language models
title Bi-factorial preference optimization: balancing safety-helpfulness in language models
title_full Bi-factorial preference optimization: balancing safety-helpfulness in language models
title_fullStr Bi-factorial preference optimization: balancing safety-helpfulness in language models
title_full_unstemmed Bi-factorial preference optimization: balancing safety-helpfulness in language models
title_short Bi-factorial preference optimization: balancing safety-helpfulness in language models
title_sort bi factorial preference optimization balancing safety helpfulness in language models
work_keys_str_mv AT zhangw bifactorialpreferenceoptimizationbalancingsafetyhelpfulnessinlanguagemodels
AT torrphs bifactorialpreferenceoptimizationbalancingsafetyhelpfulnessinlanguagemodels
AT elhoseinym bifactorialpreferenceoptimizationbalancingsafetyhelpfulnessinlanguagemodels
AT bibia bifactorialpreferenceoptimizationbalancingsafetyhelpfulnessinlanguagemodels