Diminished diversity-of-thought in a standard large language model
We test whether large language models (LLMs) can be used to simulate human participants in social-science studies. To do this, we ran replications of 14 studies from the Many Labs 2 replication project with OpenAI’s text-davinci-003 model, colloquially known as GPT-3.5. Based on our pre-registered a...
Main Authors: | , , |
---|---|
Other Authors: | |
Format: | Article |
Language: | English |
Published: |
Springer US
2024
|
Online Access: | https://hdl.handle.net/1721.1/153319 |
_version_ | 1826189955436642304 |
---|---|
author | Park, Peter S. Schoenegger, Philipp Zhu, Chongyang |
author2 | Massachusetts Institute of Technology. Department of Physics |
author_facet | Massachusetts Institute of Technology. Department of Physics Park, Peter S. Schoenegger, Philipp Zhu, Chongyang |
author_sort | Park, Peter S. |
collection | MIT |
description | We test whether large language models (LLMs) can be used to simulate human participants in social-science studies. To do this, we ran replications of 14 studies from the Many Labs 2 replication project with OpenAI’s text-davinci-003 model, colloquially known as GPT-3.5. Based on our pre-registered analyses, we find that among the eight studies we could analyse, our GPT sample replicated 37.5% of the original results and 37.5% of the Many Labs 2 results. However, we were unable to analyse the remaining six studies due to an unexpected phenomenon we call the “correct answer” effect. Different runs of GPT-3.5 answered nuanced questions probing political orientation, economic preference, judgement, and moral philosophy with zero or near-zero variation in responses: with the supposedly “correct answer.” In one exploratory follow-up study, we found that a “correct answer” was robust to changing the demographic details that precede the prompt. In another, we found that most but not all “correct answers” were robust to changing the order of answer choices. One of our most striking findings occurred in our replication of the Moral Foundations Theory survey results, where we found GPT-3.5 identifying as a political conservative in 99.6% of the cases, and as a liberal in 99.3% of the cases in the reverse-order condition. However, both self-reported ‘GPT conservatives’ and ‘GPT liberals’ showed right-leaning moral foundations. Our results cast doubts on the validity of using LLMs as a general replacement for human participants in the social sciences. Our results also raise concerns that a hypothetical AI-led future may be subject to a diminished diversity of thought. |
first_indexed | 2024-09-23T08:32:04Z |
format | Article |
id | mit-1721.1/153319 |
institution | Massachusetts Institute of Technology |
language | English |
last_indexed | 2024-09-23T08:32:04Z |
publishDate | 2024 |
publisher | Springer US |
record_format | dspace |
spelling | mit-1721.1/1533192024-07-11T19:32:50Z Diminished diversity-of-thought in a standard large language model Park, Peter S. Schoenegger, Philipp Zhu, Chongyang Massachusetts Institute of Technology. Department of Physics We test whether large language models (LLMs) can be used to simulate human participants in social-science studies. To do this, we ran replications of 14 studies from the Many Labs 2 replication project with OpenAI’s text-davinci-003 model, colloquially known as GPT-3.5. Based on our pre-registered analyses, we find that among the eight studies we could analyse, our GPT sample replicated 37.5% of the original results and 37.5% of the Many Labs 2 results. However, we were unable to analyse the remaining six studies due to an unexpected phenomenon we call the “correct answer” effect. Different runs of GPT-3.5 answered nuanced questions probing political orientation, economic preference, judgement, and moral philosophy with zero or near-zero variation in responses: with the supposedly “correct answer.” In one exploratory follow-up study, we found that a “correct answer” was robust to changing the demographic details that precede the prompt. In another, we found that most but not all “correct answers” were robust to changing the order of answer choices. One of our most striking findings occurred in our replication of the Moral Foundations Theory survey results, where we found GPT-3.5 identifying as a political conservative in 99.6% of the cases, and as a liberal in 99.3% of the cases in the reverse-order condition. However, both self-reported ‘GPT conservatives’ and ‘GPT liberals’ showed right-leaning moral foundations. Our results cast doubts on the validity of using LLMs as a general replacement for human participants in the social sciences. Our results also raise concerns that a hypothetical AI-led future may be subject to a diminished diversity of thought. 2024-01-16T20:56:54Z 2024-01-16T20:56:54Z 2024-01-09 2024-01-14T04:12:17Z Article http://purl.org/eprint/type/JournalArticle https://hdl.handle.net/1721.1/153319 Park, P.S., Schoenegger, P. & Zhu, C. Diminished diversity-of-thought in a standard large language model. Behav Res (2024). PUBLISHER_CC en https://doi.org/10.3758/s13428-023-02307-x Creative Commons Attribution https://creativecommons.org/licenses/by/4.0/ The Author(s) application/pdf Springer US Springer US |
spellingShingle | Park, Peter S. Schoenegger, Philipp Zhu, Chongyang Diminished diversity-of-thought in a standard large language model |
title | Diminished diversity-of-thought in a standard large language model |
title_full | Diminished diversity-of-thought in a standard large language model |
title_fullStr | Diminished diversity-of-thought in a standard large language model |
title_full_unstemmed | Diminished diversity-of-thought in a standard large language model |
title_short | Diminished diversity-of-thought in a standard large language model |
title_sort | diminished diversity of thought in a standard large language model |
url | https://hdl.handle.net/1721.1/153319 |
work_keys_str_mv | AT parkpeters diminisheddiversityofthoughtinastandardlargelanguagemodel AT schoeneggerphilipp diminisheddiversityofthoughtinastandardlargelanguagemodel AT zhuchongyang diminisheddiversityofthoughtinastandardlargelanguagemodel |