What’s in a Name? Experimental Evidence of Gender Bias in Recommendation Letters Generated by ChatGPT

BackgroundArtificial intelligence chatbots such as ChatGPT (OpenAI) have garnered excitement about their potential for delegating writing tasks ordinarily performed by humans. Many of these tasks (eg, writing recommendation letters) have social and professional ramifications,...

Full description

Bibliographic Details
Main Authors: Deanna M Kaplan, Roman Palitsky, Santiago J Arconada Alvarez, Nicole S Pozzo, Morgan N Greenleaf, Ciara A Atkinson, Wilbur A Lam
Format: Article
Language:English
Published: JMIR Publications 2024-03-01
Series:Journal of Medical Internet Research
Online Access:https://www.jmir.org/2024/1/e51837
_version_ 1797276765239902208
author Deanna M Kaplan
Roman Palitsky
Santiago J Arconada Alvarez
Nicole S Pozzo
Morgan N Greenleaf
Ciara A Atkinson
Wilbur A Lam
author_facet Deanna M Kaplan
Roman Palitsky
Santiago J Arconada Alvarez
Nicole S Pozzo
Morgan N Greenleaf
Ciara A Atkinson
Wilbur A Lam
author_sort Deanna M Kaplan
collection DOAJ
description BackgroundArtificial intelligence chatbots such as ChatGPT (OpenAI) have garnered excitement about their potential for delegating writing tasks ordinarily performed by humans. Many of these tasks (eg, writing recommendation letters) have social and professional ramifications, making the potential social biases in ChatGPT’s underlying language model a serious concern. ObjectiveThree preregistered studies used the text analysis program Linguistic Inquiry and Word Count to investigate gender bias in recommendation letters written by ChatGPT in human-use sessions (N=1400 total letters). MethodsWe conducted analyses using 22 existing Linguistic Inquiry and Word Count dictionaries, as well as 6 newly created dictionaries based on systematic reviews of gender bias in recommendation letters, to compare recommendation letters generated for the 200 most historically popular “male” and “female” names in the United States. Study 1 used 3 different letter-writing prompts intended to accentuate professional accomplishments associated with male stereotypes, female stereotypes, or neither. Study 2 examined whether lengthening each of the 3 prompts while holding the between-prompt word count constant modified the extent of bias. Study 3 examined the variability within letters generated for the same name and prompts. We hypothesized that when prompted with gender-stereotyped professional accomplishments, ChatGPT would evidence gender-based language differences replicating those found in systematic reviews of human-written recommendation letters (eg, more affiliative, social, and communal language for female names; more agentic and skill-based language for male names). ResultsSignificant differences in language between letters generated for female versus male names were observed across all prompts, including the prompt hypothesized to be neutral, and across nearly all language categories tested. Historically female names received significantly more social referents (5/6, 83% of prompts), communal or doubt-raising language (4/6, 67% of prompts), personal pronouns (4/6, 67% of prompts), and clout language (5/6, 83% of prompts). Contradicting the study hypotheses, some gender differences (eg, achievement language and agentic language) were significant in both the hypothesized and nonhypothesized directions, depending on the prompt. Heteroscedasticity between male and female names was observed in multiple linguistic categories, with greater variance for historically female names than for historically male names. ConclusionsChatGPT reproduces many gender-based language biases that have been reliably identified in investigations of human-written reference letters, although these differences vary across prompts and language categories. Caution should be taken when using ChatGPT for tasks that have social consequences, such as reference letter writing. The methods developed in this study may be useful for ongoing bias testing among progressive generations of chatbots across a range of real-world scenarios. Trial RegistrationOSF Registries osf.io/ztv96; https://osf.io/ztv96
first_indexed 2024-03-07T15:32:58Z
format Article
id doaj.art-b211202e87194ea38a2978a2bdf95ac5
institution Directory Open Access Journal
issn 1438-8871
language English
last_indexed 2024-03-07T15:32:58Z
publishDate 2024-03-01
publisher JMIR Publications
record_format Article
series Journal of Medical Internet Research
spelling doaj.art-b211202e87194ea38a2978a2bdf95ac52024-03-05T16:00:33ZengJMIR PublicationsJournal of Medical Internet Research1438-88712024-03-0126e5183710.2196/51837What’s in a Name? Experimental Evidence of Gender Bias in Recommendation Letters Generated by ChatGPTDeanna M Kaplanhttps://orcid.org/0000-0002-9300-3029Roman Palitskyhttps://orcid.org/0000-0002-0415-6411Santiago J Arconada Alvarezhttps://orcid.org/0000-0003-0737-6679Nicole S Pozzohttps://orcid.org/0009-0003-7325-6373Morgan N Greenleafhttps://orcid.org/0000-0003-1569-5696Ciara A Atkinsonhttps://orcid.org/0000-0003-0835-7883Wilbur A Lamhttps://orcid.org/0000-0002-0325-7990 BackgroundArtificial intelligence chatbots such as ChatGPT (OpenAI) have garnered excitement about their potential for delegating writing tasks ordinarily performed by humans. Many of these tasks (eg, writing recommendation letters) have social and professional ramifications, making the potential social biases in ChatGPT’s underlying language model a serious concern. ObjectiveThree preregistered studies used the text analysis program Linguistic Inquiry and Word Count to investigate gender bias in recommendation letters written by ChatGPT in human-use sessions (N=1400 total letters). MethodsWe conducted analyses using 22 existing Linguistic Inquiry and Word Count dictionaries, as well as 6 newly created dictionaries based on systematic reviews of gender bias in recommendation letters, to compare recommendation letters generated for the 200 most historically popular “male” and “female” names in the United States. Study 1 used 3 different letter-writing prompts intended to accentuate professional accomplishments associated with male stereotypes, female stereotypes, or neither. Study 2 examined whether lengthening each of the 3 prompts while holding the between-prompt word count constant modified the extent of bias. Study 3 examined the variability within letters generated for the same name and prompts. We hypothesized that when prompted with gender-stereotyped professional accomplishments, ChatGPT would evidence gender-based language differences replicating those found in systematic reviews of human-written recommendation letters (eg, more affiliative, social, and communal language for female names; more agentic and skill-based language for male names). ResultsSignificant differences in language between letters generated for female versus male names were observed across all prompts, including the prompt hypothesized to be neutral, and across nearly all language categories tested. Historically female names received significantly more social referents (5/6, 83% of prompts), communal or doubt-raising language (4/6, 67% of prompts), personal pronouns (4/6, 67% of prompts), and clout language (5/6, 83% of prompts). Contradicting the study hypotheses, some gender differences (eg, achievement language and agentic language) were significant in both the hypothesized and nonhypothesized directions, depending on the prompt. Heteroscedasticity between male and female names was observed in multiple linguistic categories, with greater variance for historically female names than for historically male names. ConclusionsChatGPT reproduces many gender-based language biases that have been reliably identified in investigations of human-written reference letters, although these differences vary across prompts and language categories. Caution should be taken when using ChatGPT for tasks that have social consequences, such as reference letter writing. The methods developed in this study may be useful for ongoing bias testing among progressive generations of chatbots across a range of real-world scenarios. Trial RegistrationOSF Registries osf.io/ztv96; https://osf.io/ztv96https://www.jmir.org/2024/1/e51837
spellingShingle Deanna M Kaplan
Roman Palitsky
Santiago J Arconada Alvarez
Nicole S Pozzo
Morgan N Greenleaf
Ciara A Atkinson
Wilbur A Lam
What’s in a Name? Experimental Evidence of Gender Bias in Recommendation Letters Generated by ChatGPT
Journal of Medical Internet Research
title What’s in a Name? Experimental Evidence of Gender Bias in Recommendation Letters Generated by ChatGPT
title_full What’s in a Name? Experimental Evidence of Gender Bias in Recommendation Letters Generated by ChatGPT
title_fullStr What’s in a Name? Experimental Evidence of Gender Bias in Recommendation Letters Generated by ChatGPT
title_full_unstemmed What’s in a Name? Experimental Evidence of Gender Bias in Recommendation Letters Generated by ChatGPT
title_short What’s in a Name? Experimental Evidence of Gender Bias in Recommendation Letters Generated by ChatGPT
title_sort what s in a name experimental evidence of gender bias in recommendation letters generated by chatgpt
url https://www.jmir.org/2024/1/e51837
work_keys_str_mv AT deannamkaplan whatsinanameexperimentalevidenceofgenderbiasinrecommendationlettersgeneratedbychatgpt
AT romanpalitsky whatsinanameexperimentalevidenceofgenderbiasinrecommendationlettersgeneratedbychatgpt
AT santiagojarconadaalvarez whatsinanameexperimentalevidenceofgenderbiasinrecommendationlettersgeneratedbychatgpt
AT nicolespozzo whatsinanameexperimentalevidenceofgenderbiasinrecommendationlettersgeneratedbychatgpt
AT morganngreenleaf whatsinanameexperimentalevidenceofgenderbiasinrecommendationlettersgeneratedbychatgpt
AT ciaraaatkinson whatsinanameexperimentalevidenceofgenderbiasinrecommendationlettersgeneratedbychatgpt
AT wilburalam whatsinanameexperimentalevidenceofgenderbiasinrecommendationlettersgeneratedbychatgpt