Bi-factorial preference optimization: balancing safety-helpfulness in language models

Fine-tuning large language models (LLMs) on human preferences, typically through reinforcement learning from human feedback (RLHF), has proven successful in enhancing their capabilities. However, ensuring the safety of LLMs during fine-tuning remains a critical concern, and mitigating the potential...

Full description

Bibliographic Details
Main Authors: Zhang, W, Torr, PHS, Elhoseiny, M, Bibi, A
Format: Conference item
Language:English
Published: OpenReview 2025