What makes and breaks safety fine-tuning? a mechanistic study

Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment. To better understand the underlying factors that make models safe via safety fine-tuning, we design a synthetic data generation framework that captures salient aspects of an unsafe input by...

Full description

Bibliographic Details
Main Authors: Jain, S, Lubana, ES, Oksuz, K, Joy, T, Sanyal, A, Torr, P, Dokania, PK
Format: Conference item
Language:English
Published: OpenReview 2024