Bi-factorial preference optimization: balancing safety-helpfulness in language models
Fine-tuning large language models (LLMs) on human preferences, typically through reinforcement learning from human feedback (RLHF), has proven successful in enhancing their capabilities. However, ensuring the safety of LLMs during fine-tuning remains a critical concern, and mitigating the potential...
Main Authors: | , , , |
---|---|
Format: | Conference item |
Language: | English |
Published: |
OpenReview
2025
|