What makes and breaks safety fine-tuning? a mechanistic study
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment. To better understand the underlying factors that make models safe via safety fine-tuning, we design a synthetic data generation framework that captures salient aspects of an unsafe input by...
Main Authors: | , , , , , , |
---|---|
Format: | Conference item |
Language: | English |
Published: |
OpenReview
2024
|