Fine-grained classification of social science journal articles using textual data: A comparison of supervised machine learning approaches

AbstractWe compare two supervised machine learning algorithms—Multinomial Naïve Bayes and Gradient Boosting—to classify social science articles using textual data. The high level of granularity of the classification scheme used and the possibility that multiple categories are assigne...

Full description

Bibliographic Details
Main Authors: Joshua Eykens, Raf Guns, Tim C. E. Engels
Format: Article
Language:English
Published: The MIT Press 2021-01-01
Series:Quantitative Science Studies
Online Access:https://direct.mit.edu/qss/article/2/1/89/97077/Fine-grained-classification-of-social-science
_version_ 1818338450135318528
author Joshua Eykens
Raf Guns
Tim C. E. Engels
author_facet Joshua Eykens
Raf Guns
Tim C. E. Engels
author_sort Joshua Eykens
collection DOAJ
description AbstractWe compare two supervised machine learning algorithms—Multinomial Naïve Bayes and Gradient Boosting—to classify social science articles using textual data. The high level of granularity of the classification scheme used and the possibility that multiple categories are assigned to a document make this task challenging. To collect the training data, we query three discipline specific thesauri to retrieve articles corresponding to specialties in the classification. The resulting data set consists of 113,909 records and covers 245 specialties, aggregated into 31 subdisciplines from three disciplines. Experts were consulted to validate the thesauri-based classification. The resulting multilabel data set is used to train the machine learning algorithms in different configurations. We deploy a multilabel classifier chaining model, allowing for an arbitrary number of categories to be assigned to each document. The best results are obtained with Gradient Boosting. The approach does not rely on citation data. It can be applied in settings where such information is not available. We conclude that fine-grained text-based classification of social sciences publications at a subdisciplinary level is a hard task, for humans and machines alike. A combination of human expertise and machine learning is suggested as a way forward to improve the classification of social sciences documents.
first_indexed 2024-12-13T15:11:18Z
format Article
id doaj.art-d79ff8efdac14506a864572bba5b6705
institution Directory Open Access Journal
issn 2641-3337
language English
last_indexed 2024-12-13T15:11:18Z
publishDate 2021-01-01
publisher The MIT Press
record_format Article
series Quantitative Science Studies
spelling doaj.art-d79ff8efdac14506a864572bba5b67052022-12-21T23:40:51ZengThe MIT PressQuantitative Science Studies2641-33372021-01-01218911010.1162/qss_a_00106Fine-grained classification of social science journal articles using textual data: A comparison of supervised machine learning approachesJoshua Eykens0http://orcid.org/0000-0002-1680-0112Raf Guns1http://orcid.org/0000-0003-3129-0330Tim C. E. Engels2http://orcid.org/0000-0002-4869-7949Centre for R&D Monitoring (ECOOM), Faculty of Social Sciences, University of Antwerp, Middelheimlaan 1, 2020 Antwerp, BelgiumCentre for R&D Monitoring (ECOOM), Faculty of Social Sciences, University of Antwerp, Middelheimlaan 1, 2020 Antwerp, BelgiumCentre for R&D Monitoring (ECOOM), Faculty of Social Sciences, University of Antwerp, Middelheimlaan 1, 2020 Antwerp, Belgium AbstractWe compare two supervised machine learning algorithms—Multinomial Naïve Bayes and Gradient Boosting—to classify social science articles using textual data. The high level of granularity of the classification scheme used and the possibility that multiple categories are assigned to a document make this task challenging. To collect the training data, we query three discipline specific thesauri to retrieve articles corresponding to specialties in the classification. The resulting data set consists of 113,909 records and covers 245 specialties, aggregated into 31 subdisciplines from three disciplines. Experts were consulted to validate the thesauri-based classification. The resulting multilabel data set is used to train the machine learning algorithms in different configurations. We deploy a multilabel classifier chaining model, allowing for an arbitrary number of categories to be assigned to each document. The best results are obtained with Gradient Boosting. The approach does not rely on citation data. It can be applied in settings where such information is not available. We conclude that fine-grained text-based classification of social sciences publications at a subdisciplinary level is a hard task, for humans and machines alike. A combination of human expertise and machine learning is suggested as a way forward to improve the classification of social sciences documents.https://direct.mit.edu/qss/article/2/1/89/97077/Fine-grained-classification-of-social-science
spellingShingle Joshua Eykens
Raf Guns
Tim C. E. Engels
Fine-grained classification of social science journal articles using textual data: A comparison of supervised machine learning approaches
Quantitative Science Studies
title Fine-grained classification of social science journal articles using textual data: A comparison of supervised machine learning approaches
title_full Fine-grained classification of social science journal articles using textual data: A comparison of supervised machine learning approaches
title_fullStr Fine-grained classification of social science journal articles using textual data: A comparison of supervised machine learning approaches
title_full_unstemmed Fine-grained classification of social science journal articles using textual data: A comparison of supervised machine learning approaches
title_short Fine-grained classification of social science journal articles using textual data: A comparison of supervised machine learning approaches
title_sort fine grained classification of social science journal articles using textual data a comparison of supervised machine learning approaches
url https://direct.mit.edu/qss/article/2/1/89/97077/Fine-grained-classification-of-social-science
work_keys_str_mv AT joshuaeykens finegrainedclassificationofsocialsciencejournalarticlesusingtextualdataacomparisonofsupervisedmachinelearningapproaches
AT rafguns finegrainedclassificationofsocialsciencejournalarticlesusingtextualdataacomparisonofsupervisedmachinelearningapproaches
AT timceengels finegrainedclassificationofsocialsciencejournalarticlesusingtextualdataacomparisonofsupervisedmachinelearningapproaches