Summary: | The alignment of large language models (LLMs) to human values becomes more and more pressing as their scale and capabilities have grown. One important feature of alignment is understanding the preference datasets that are used to finetune LLMs. Inverse Constitutional AI (ICAI) is presented as a novel interpretability framework to discover the principles underlying preference datasets. Motivated by the Constitutional AI training paradigm of instilling principles in models, ICAI aims to extract a succinct "constitution" of natural language principles from data. This thesis contributes an initial attempt at realizing ICAI through a clustering-based methodology applied to preference datasets. The proposed approach involves embedding preference pairs into vector representations, clustering the embeddings to group related preferences, generating interpretable principles for each cluster using language models, and validating these principles against held-out samples. Empirical evaluation is conducted on the hh-rlhf dataset for training helpful and harmless AI assistants, as well as a synthetic dataset constructed by relabeling hh-rlhf samples with predefined principles. Results demonstrate promising capabilities in clustering semantically coherent topics and generating human-interpretable principles, while also highlighting limitations in achieving fully disentangled, principle-based clustering. Directions for future work are discussed, including soft clustering, bottom-up principle extraction, prompt optimization approaches, and sparse dictionary learning methods. In this work, I argue the following thesis: ICAI shows promise as a strategy to disentangle and explain the preferences represented in preference data. A clustering-based approach to ICAI, though, fails to successfully extract a constitution of principles from preference data, as a result of clustering occurring along the topics in the data instead of the preferences themselves.
|