Multi-Keyword Classification: A Case Study in Finnish Social Sciences Data Archive

In this paper, we consider the task of assigning relevant labels to studies in the social science domain. Manual labelling is an expensive process and prone to human error. Various multi-label text classification machine learning approaches have been proposed to resolve this problem. We introduce a...

Full description

Bibliographic Details
Main Authors:	Erjon Skenderi, Jukka Huhtamäki, Kostas Stefanidis
Format:	Article
Language:	English
Published:	MDPI AG 2021-11-01
Series:	Information
Subjects:	multi-label classification supervised learning text representation text feature extraction
Online Access:	https://www.mdpi.com/2078-2489/12/12/491

_version_	1797503653945278464
author	Erjon Skenderi Jukka Huhtamäki Kostas Stefanidis
author_facet	Erjon Skenderi Jukka Huhtamäki Kostas Stefanidis
author_sort	Erjon Skenderi
collection	DOAJ
description	In this paper, we consider the task of assigning relevant labels to studies in the social science domain. Manual labelling is an expensive process and prone to human error. Various multi-label text classification machine learning approaches have been proposed to resolve this problem. We introduce a dataset obtained from the Finnish Social Science Archive and comprised of 2968 research studies’ metadata. The metadata of each study includes attributes, such as the “abstract” and the “set of labels”. We used the Bag of Words (BoW), TF-IDF term weighting and pretrained word embeddings obtained from FastText and BERT models to generate the text representations for each study’s abstract field. Our selection of multi-label classification methods includes a Naive approach, Multi-label k Nearest Neighbours (ML-kNN), Multi-Label Random Forest (ML-RF), X-BERT and Parabel. The methods were combined with the text representation techniques and their performance was evaluated on our dataset. We measured the classification accuracy of the combinations using Precision, Recall and F1 metrics. In addition, we used the Normalized Discounted Cumulative Gain to measure the label ranking performance of the selected methods combined with the text representation techniques. The results showed that the ML-RF model achieved a higher classification accuracy with the TF-IDF features and, based on the ranking score, the Parabel model outperformed the other methods.
first_indexed	2024-03-10T03:53:47Z
format	Article
id	doaj.art-87297a58e4ed40d7bad2563140cf96d8
institution	Directory Open Access Journal
issn	2078-2489
language	English
last_indexed	2024-03-10T03:53:47Z
publishDate	2021-11-01
publisher	MDPI AG
record_format	Article
series	Information
spelling	doaj.art-87297a58e4ed40d7bad2563140cf96d82023-11-23T08:51:17ZengMDPI AGInformation2078-24892021-11-01121249110.3390/info12120491Multi-Keyword Classification: A Case Study in Finnish Social Sciences Data ArchiveErjon Skenderi0Jukka Huhtamäki1Kostas Stefanidis2Faculty of Management and Business, Tampere University, 33100 Tampere, FinlandFaculty of Management and Business, Tampere University, 33100 Tampere, Finland Faculty of Information Technology and Communication Sciences, Tampere University, 33100 Tampere, FinlandIn this paper, we consider the task of assigning relevant labels to studies in the social science domain. Manual labelling is an expensive process and prone to human error. Various multi-label text classification machine learning approaches have been proposed to resolve this problem. We introduce a dataset obtained from the Finnish Social Science Archive and comprised of 2968 research studies’ metadata. The metadata of each study includes attributes, such as the “abstract” and the “set of labels”. We used the Bag of Words (BoW), TF-IDF term weighting and pretrained word embeddings obtained from FastText and BERT models to generate the text representations for each study’s abstract field. Our selection of multi-label classification methods includes a Naive approach, Multi-label k Nearest Neighbours (ML-kNN), Multi-Label Random Forest (ML-RF), X-BERT and Parabel. The methods were combined with the text representation techniques and their performance was evaluated on our dataset. We measured the classification accuracy of the combinations using Precision, Recall and F1 metrics. In addition, we used the Normalized Discounted Cumulative Gain to measure the label ranking performance of the selected methods combined with the text representation techniques. The results showed that the ML-RF model achieved a higher classification accuracy with the TF-IDF features and, based on the ranking score, the Parabel model outperformed the other methods.https://www.mdpi.com/2078-2489/12/12/491multi-label classificationsupervised learningtext representationtext feature extraction
spellingShingle	Erjon Skenderi Jukka Huhtamäki Kostas Stefanidis Multi-Keyword Classification: A Case Study in Finnish Social Sciences Data Archive Information multi-label classification supervised learning text representation text feature extraction
title	Multi-Keyword Classification: A Case Study in Finnish Social Sciences Data Archive
title_full	Multi-Keyword Classification: A Case Study in Finnish Social Sciences Data Archive
title_fullStr	Multi-Keyword Classification: A Case Study in Finnish Social Sciences Data Archive
title_full_unstemmed	Multi-Keyword Classification: A Case Study in Finnish Social Sciences Data Archive
title_short	Multi-Keyword Classification: A Case Study in Finnish Social Sciences Data Archive
title_sort	multi keyword classification a case study in finnish social sciences data archive
topic	multi-label classification supervised learning text representation text feature extraction
url	https://www.mdpi.com/2078-2489/12/12/491
work_keys_str_mv	AT erjonskenderi multikeywordclassificationacasestudyinfinnishsocialsciencesdataarchive AT jukkahuhtamaki multikeywordclassificationacasestudyinfinnishsocialsciencesdataarchive AT kostasstefanidis multikeywordclassificationacasestudyinfinnishsocialsciencesdataarchive

Multi-Keyword Classification: A Case Study in Finnish Social Sciences Data Archive

Similar Items