Summary: | This study presents a comprehensive evaluation of biases in prominent autoregressive
language models, including GPT-2, Llama-7B, and Mistral-7B. The research systematically
assesses the models' performance across multiple dimensions of bias, including toxicity,
gender, race, religion, and LGBTQIA+ identities.
To evaluate toxicity, the study employs the RealToxicityPrompts dataset and the
"roberta-hate-speech-dynabench-r4" model. Gender, racial and religious biases are examined
using the BOLD dataset and the REGARD metric, while the HONEST benchmark is
leveraged to assess biases against LGBTQIA+ identities.
Notably, the research explores the effectiveness of structured prompts, particularly zero-shot
Chain-of-Thought (CoT)-based implication prompting, as a debiasing technique. The results
demonstrate the potential of this approach to mitigate biases across various domains, with
Llama-7B exhibiting the most consistent and substantial improvements. However, the study
also highlights the challenges in effectively debiasing LGBTQIA+ biases, underscoring the
need for more targeted and specialised techniques in this area.
Overall, this work provides a comprehensive understanding of the biases present in
contemporary autoregressive language models and offers insights into effective strategies for
bias mitigation, paving the way for the development of more equitable and inclusive AI
systems.
|