Computational thematics: comparing algorithms for clustering the genres of literary fiction

Abstract What are the best methods of capturing thematic similarity between literary texts? Knowing the answer to this question would be useful for automatic clustering of book genres, or any other thematic grouping. This paper compares a variety of algorithms for unsupervised learning of thematic s...

Full description

Bibliographic Details
Main Authors: Oleg Sobchuk, Artjoms Šeļa
Format: Article
Language:English
Published: Springer Nature 2024-03-01
Series:Humanities & Social Sciences Communications
Online Access:https://doi.org/10.1057/s41599-024-02933-6
_version_ 1797247405345734656
author Oleg Sobchuk
Artjoms Šeļa
author_facet Oleg Sobchuk
Artjoms Šeļa
author_sort Oleg Sobchuk
collection DOAJ
description Abstract What are the best methods of capturing thematic similarity between literary texts? Knowing the answer to this question would be useful for automatic clustering of book genres, or any other thematic grouping. This paper compares a variety of algorithms for unsupervised learning of thematic similarities between texts, which we call “computational thematics”. These algorithms belong to three steps of analysis: text pre-processing, extraction of text features, and measuring distances between the lists of features. Each of these steps includes a variety of options. We test all the possible combinations of these options. Every combination of algorithms is given a task to cluster a corpus of books belonging to four pre-tagged genres of fiction. This clustering is then validated against the “ground truth” genre labels. Such comparison of algorithms allows us to learn the best and the worst combinations for computational thematic analysis. To illustrate the difference between the best and the worst methods, we then cluster 5000 random novels from the HathiTrust corpus of fiction.
first_indexed 2024-04-24T19:58:10Z
format Article
id doaj.art-6d120103d24749de8bf757df44c87dba
institution Directory Open Access Journal
issn 2662-9992
language English
last_indexed 2024-04-24T19:58:10Z
publishDate 2024-03-01
publisher Springer Nature
record_format Article
series Humanities & Social Sciences Communications
spelling doaj.art-6d120103d24749de8bf757df44c87dba2024-03-24T12:13:37ZengSpringer NatureHumanities & Social Sciences Communications2662-99922024-03-0111111210.1057/s41599-024-02933-6Computational thematics: comparing algorithms for clustering the genres of literary fictionOleg Sobchuk0Artjoms Šeļa1Department of Human Behavior, Ecology and Culture, Max Planck Institute for Evolutionary AnthropologyInstitute of Polish Language, Polish Academy of SciencesAbstract What are the best methods of capturing thematic similarity between literary texts? Knowing the answer to this question would be useful for automatic clustering of book genres, or any other thematic grouping. This paper compares a variety of algorithms for unsupervised learning of thematic similarities between texts, which we call “computational thematics”. These algorithms belong to three steps of analysis: text pre-processing, extraction of text features, and measuring distances between the lists of features. Each of these steps includes a variety of options. We test all the possible combinations of these options. Every combination of algorithms is given a task to cluster a corpus of books belonging to four pre-tagged genres of fiction. This clustering is then validated against the “ground truth” genre labels. Such comparison of algorithms allows us to learn the best and the worst combinations for computational thematic analysis. To illustrate the difference between the best and the worst methods, we then cluster 5000 random novels from the HathiTrust corpus of fiction.https://doi.org/10.1057/s41599-024-02933-6
spellingShingle Oleg Sobchuk
Artjoms Šeļa
Computational thematics: comparing algorithms for clustering the genres of literary fiction
Humanities & Social Sciences Communications
title Computational thematics: comparing algorithms for clustering the genres of literary fiction
title_full Computational thematics: comparing algorithms for clustering the genres of literary fiction
title_fullStr Computational thematics: comparing algorithms for clustering the genres of literary fiction
title_full_unstemmed Computational thematics: comparing algorithms for clustering the genres of literary fiction
title_short Computational thematics: comparing algorithms for clustering the genres of literary fiction
title_sort computational thematics comparing algorithms for clustering the genres of literary fiction
url https://doi.org/10.1057/s41599-024-02933-6
work_keys_str_mv AT olegsobchuk computationalthematicscomparingalgorithmsforclusteringthegenresofliteraryfiction
AT artjomssela computationalthematicscomparingalgorithmsforclusteringthegenresofliteraryfiction