Using genderize.io to infer the gender of first names: how to improve the accuracy of the inference

Objective: We recently showed that genderize.io is not a sufficiently powerful gender detection tool due to a large number of nonclassifications. In the present study, we aimed to assess whether the accuracy of inference by genderize.io can be improved by manipulating the first names in the database...

Full description

Bibliographic Details
Main Author:	Paul Sebo
Format:	Article
Language:	English
Published:	University Library System, University of Pittsburgh 2021-11-01
Series:	Journal of the Medical Library Association
Subjects:	accuracy gender determination genderize.io misclassification name name-to-gender
Online Access:	https://jmla.pitt.edu/ojs/jmla/article/view/1252

_version_	1818938550628909056
author	Paul Sebo
author_facet	Paul Sebo
author_sort	Paul Sebo
collection	DOAJ
description	Objective: We recently showed that genderize.io is not a sufficiently powerful gender detection tool due to a large number of nonclassifications. In the present study, we aimed to assess whether the accuracy of inference by genderize.io can be improved by manipulating the first names in the database. Methods: We used a database containing the first names, surnames, and gender of 6,131 physicians practicing in a multicultural country (Switzerland). We uploaded the original CSV file (file #1), the file obtained after removing all diacritic marks, such as accents and cedilla (file #2), and the file obtained after removing all diacritic marks and retaining only the first term of the compound first names (file #3). For each file, we computed three performance metrics: proportion of misclassifications (errorCodedWithoutNA), proportion of nonclassifications (naCoded), and proportion of misclassifications and nonclassifications (errorCoded). Results: naCoded, which was high for file #1 (16.4%), was reduced after data manipulation (file #2: 11.7%, file #3: 0.4%). As the increase in the number of misclassifications was small, the overall performance of genderize.io (i.e., errorCoded) improved, especially for file #3 (file #1: 17.7%, file #2: 13.0%, and file #3: 2.3%). Conclusions: A relatively simple manipulation of the data improved the accuracy of gender inference by genderize.io. We recommend using genderize.io only with files that were modified in this way.
first_indexed	2024-12-20T06:09:38Z
format	Article
id	doaj.art-4e0663cfa18448c28fb87b6ae0702cc7
institution	Directory Open Access Journal
issn	1536-5050 1558-9439
language	English
last_indexed	2024-12-20T06:09:38Z
publishDate	2021-11-01
publisher	University Library System, University of Pittsburgh
record_format	Article
series	Journal of the Medical Library Association
spelling	doaj.art-4e0663cfa18448c28fb87b6ae0702cc72022-12-21T19:50:43ZengUniversity Library System, University of PittsburghJournal of the Medical Library Association1536-50501558-94392021-11-01109410.5195/jmla.2021.1252608Using genderize.io to infer the gender of first names: how to improve the accuracy of the inferencePaul SeboObjective: We recently showed that genderize.io is not a sufficiently powerful gender detection tool due to a large number of nonclassifications. In the present study, we aimed to assess whether the accuracy of inference by genderize.io can be improved by manipulating the first names in the database. Methods: We used a database containing the first names, surnames, and gender of 6,131 physicians practicing in a multicultural country (Switzerland). We uploaded the original CSV file (file #1), the file obtained after removing all diacritic marks, such as accents and cedilla (file #2), and the file obtained after removing all diacritic marks and retaining only the first term of the compound first names (file #3). For each file, we computed three performance metrics: proportion of misclassifications (errorCodedWithoutNA), proportion of nonclassifications (naCoded), and proportion of misclassifications and nonclassifications (errorCoded). Results: naCoded, which was high for file #1 (16.4%), was reduced after data manipulation (file #2: 11.7%, file #3: 0.4%). As the increase in the number of misclassifications was small, the overall performance of genderize.io (i.e., errorCoded) improved, especially for file #3 (file #1: 17.7%, file #2: 13.0%, and file #3: 2.3%). Conclusions: A relatively simple manipulation of the data improved the accuracy of gender inference by genderize.io. We recommend using genderize.io only with files that were modified in this way. https://jmla.pitt.edu/ojs/jmla/article/view/1252accuracygender determinationgenderize.iomisclassificationnamename-to-gender
spellingShingle	Paul Sebo Using genderize.io to infer the gender of first names: how to improve the accuracy of the inference Journal of the Medical Library Association accuracy gender determination genderize.io misclassification name name-to-gender
title	Using genderize.io to infer the gender of first names: how to improve the accuracy of the inference
title_full	Using genderize.io to infer the gender of first names: how to improve the accuracy of the inference
title_fullStr	Using genderize.io to infer the gender of first names: how to improve the accuracy of the inference
title_full_unstemmed	Using genderize.io to infer the gender of first names: how to improve the accuracy of the inference
title_short	Using genderize.io to infer the gender of first names: how to improve the accuracy of the inference
title_sort	using genderize io to infer the gender of first names how to improve the accuracy of the inference
topic	accuracy gender determination genderize.io misclassification name name-to-gender
url	https://jmla.pitt.edu/ojs/jmla/article/view/1252
work_keys_str_mv	AT paulsebo usinggenderizeiotoinferthegenderoffirstnameshowtoimprovetheaccuracyoftheinference

Using genderize.io to infer the gender of first names: how to improve the accuracy of the inference

Similar Items