What makes and breaks safety fine-tuning? a mechanistic study

Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment. To better understand the underlying factors that make models safe via safety fine-tuning, we design a synthetic data generation framework that captures salient aspects of an unsafe input by...

Full description

Bibliographic Details
Main Authors:	Jain, S, Lubana, ES, Oksuz, K, Joy, T, Sanyal, A, Torr, P, Dokania, PK
Format:	Conference item
Language:	English
Published:	OpenReview 2024

What makes and breaks safety fine-tuning? a mechanistic study

Similar Items