Bi-factorial preference optimization: balancing safety-helpfulness in language models
Fine-tuning large language models (LLMs) on human preferences, typically through reinforcement learning from human feedback (RLHF), has proven successful in enhancing their capabilities. However, ensuring the safety of LLMs during fine-tuning remains a critical concern, and mitigating the potential...
Main Authors: | Zhang, W, Torr, PHS, Elhoseiny, M, Bibi, A |
---|---|
Format: | Conference item |
Language: | English |
Published: |
OpenReview
2025
|
Similar Items
-
Model merging and safety alignment: one bad model spoils the bunch
by: Hammoud, HAAK, et al.
Published: (2024) -
Universal in-context approximation by prompting fully recurrent models
by: Petrov, A, et al.
Published: (2025) -
Professional Attachment Report [with] Nanjing Spark plug factory (subsidiary of the Nanjing electroceramics factory).
by: Moy, Loh Cheak.
Published: (2009) -
Preference-Conditioned Language-Guided Abstraction
by: Peng, Andi, et al.
Published: (2024) -
Decoding stakeholder priorities of safety culture preferences in the oil and gas industry
by: Rahim, Hafiz, et al.
Published: (2024)