Vocational Domain Identification with Machine Learning and Natural Language Processing on Wikipedia Text: Error Analysis and Class Balancing

Highly-skilled migrants and refugees finding employment in low-skill vocations, despite professional qualifications and educational backgrounds, has become a global tendency, mainly due to the language barrier. Employment prospects for displaced communities are mostly decided by their knowledge of t...

Full description

Bibliographic Details
Main Authors:	Maria Nefeli Nikiforos, Konstantina Deliveri, Katia Lida Kermanidis, Adamantia Pateli
Format:	Article
Language:	English
Published:	MDPI AG 2023-05-01
Series:	Computers
Subjects:	natural language processing social text mining machine learning vocational domain identification vocational language error analysis
Online Access:	https://www.mdpi.com/2073-431X/12/6/111

_version_	1827737895913390080
author	Maria Nefeli Nikiforos Konstantina Deliveri Katia Lida Kermanidis Adamantia Pateli
author_facet	Maria Nefeli Nikiforos Konstantina Deliveri Katia Lida Kermanidis Adamantia Pateli
author_sort	Maria Nefeli Nikiforos
collection	DOAJ
description	Highly-skilled migrants and refugees finding employment in low-skill vocations, despite professional qualifications and educational backgrounds, has become a global tendency, mainly due to the language barrier. Employment prospects for displaced communities are mostly decided by their knowledge of the sublanguage of the vocational domain they are interested in working. Common vocational domains include agriculture, cooking, crafting, construction, and hospitality. The increasing amount of user-generated content in wikis and social networks provides a valuable source of data for data mining, natural language processing, and machine learning applications. This paper extends the contribution of the authors’ previous research on automatic vocational domain identification by further analyzing the results of machine learning experiments with a domain-specific textual data set while considering two research directions: a. prediction analysis and b. data balancing. Wrong prediction analysis and the features that contributed to misclassification, along with correct prediction analysis and the features that were the most dominant, contributed to the identification of a primary set of terms for the vocational domains. Data balancing techniques were applied on the data set to observe their impact on the performance of the classification model. A novel four-step methodology was proposed in this paper for the first time, which consists of successive applications of SMOTE oversampling on imbalanced data. Data oversampling obtained better results than data undersampling in imbalanced data sets, while hybrid approaches performed reasonably well.
first_indexed	2024-03-11T02:37:00Z
format	Article
id	doaj.art-94b3e51e189d4dab9c75d4f946804f6a
institution	Directory Open Access Journal
issn	2073-431X
language	English
last_indexed	2024-03-11T02:37:00Z
publishDate	2023-05-01
publisher	MDPI AG
record_format	Article
series	Computers
spelling	doaj.art-94b3e51e189d4dab9c75d4f946804f6a2023-11-18T09:54:12ZengMDPI AGComputers2073-431X2023-05-0112611110.3390/computers12060111Vocational Domain Identification with Machine Learning and Natural Language Processing on Wikipedia Text: Error Analysis and Class BalancingMaria Nefeli Nikiforos0Konstantina Deliveri1Katia Lida Kermanidis2Adamantia Pateli3Department of Informatics, Ionian University, 49132 Corfu, GreeceDepartment of Informatics, Ionian University, 49132 Corfu, GreeceDepartment of Informatics, Ionian University, 49132 Corfu, GreeceDepartment of Informatics, Ionian University, 49132 Corfu, GreeceHighly-skilled migrants and refugees finding employment in low-skill vocations, despite professional qualifications and educational backgrounds, has become a global tendency, mainly due to the language barrier. Employment prospects for displaced communities are mostly decided by their knowledge of the sublanguage of the vocational domain they are interested in working. Common vocational domains include agriculture, cooking, crafting, construction, and hospitality. The increasing amount of user-generated content in wikis and social networks provides a valuable source of data for data mining, natural language processing, and machine learning applications. This paper extends the contribution of the authors’ previous research on automatic vocational domain identification by further analyzing the results of machine learning experiments with a domain-specific textual data set while considering two research directions: a. prediction analysis and b. data balancing. Wrong prediction analysis and the features that contributed to misclassification, along with correct prediction analysis and the features that were the most dominant, contributed to the identification of a primary set of terms for the vocational domains. Data balancing techniques were applied on the data set to observe their impact on the performance of the classification model. A novel four-step methodology was proposed in this paper for the first time, which consists of successive applications of SMOTE oversampling on imbalanced data. Data oversampling obtained better results than data undersampling in imbalanced data sets, while hybrid approaches performed reasonably well.https://www.mdpi.com/2073-431X/12/6/111natural language processingsocial text miningmachine learningvocational domain identificationvocational languageerror analysis
spellingShingle	Maria Nefeli Nikiforos Konstantina Deliveri Katia Lida Kermanidis Adamantia Pateli Vocational Domain Identification with Machine Learning and Natural Language Processing on Wikipedia Text: Error Analysis and Class Balancing Computers natural language processing social text mining machine learning vocational domain identification vocational language error analysis
title	Vocational Domain Identification with Machine Learning and Natural Language Processing on Wikipedia Text: Error Analysis and Class Balancing
title_full	Vocational Domain Identification with Machine Learning and Natural Language Processing on Wikipedia Text: Error Analysis and Class Balancing
title_fullStr	Vocational Domain Identification with Machine Learning and Natural Language Processing on Wikipedia Text: Error Analysis and Class Balancing
title_full_unstemmed	Vocational Domain Identification with Machine Learning and Natural Language Processing on Wikipedia Text: Error Analysis and Class Balancing
title_short	Vocational Domain Identification with Machine Learning and Natural Language Processing on Wikipedia Text: Error Analysis and Class Balancing
title_sort	vocational domain identification with machine learning and natural language processing on wikipedia text error analysis and class balancing
topic	natural language processing social text mining machine learning vocational domain identification vocational language error analysis
url	https://www.mdpi.com/2073-431X/12/6/111
work_keys_str_mv	AT marianefelinikiforos vocationaldomainidentificationwithmachinelearningandnaturallanguageprocessingonwikipediatexterroranalysisandclassbalancing AT konstantinadeliveri vocationaldomainidentificationwithmachinelearningandnaturallanguageprocessingonwikipediatexterroranalysisandclassbalancing AT katialidakermanidis vocationaldomainidentificationwithmachinelearningandnaturallanguageprocessingonwikipediatexterroranalysisandclassbalancing AT adamantiapateli vocationaldomainidentificationwithmachinelearningandnaturallanguageprocessingonwikipediatexterroranalysisandclassbalancing

Vocational Domain Identification with Machine Learning and Natural Language Processing on Wikipedia Text: Error Analysis and Class Balancing

Similar Items