Misclassification Bias in Computational Social Science: A Simulation Approach for Assessing the Impact of Classification Errors on Social Indicators Research
A growing body of literature has examined the potential of machine learning algorithms in constructing social indicators based on the automatic classification of digital traces. However, as long as the classification algorithms’ predictions are not completely error-free, the estimate of t...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2022-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/9706439/ |
_version_ | 1828890039107452928 |
---|---|
author | Sergey Smetanin Mikhail Komarov |
author_facet | Sergey Smetanin Mikhail Komarov |
author_sort | Sergey Smetanin |
collection | DOAJ |
description | A growing body of literature has examined the potential of machine learning algorithms in constructing social indicators based on the automatic classification of digital traces. However, as long as the classification algorithms’ predictions are not completely error-free, the estimate of the relative occurrence of a particular class may be affected by misclassification bias, thereby affecting the value of the calculated social indicator. Although a significant amount of studies have investigated misclassification bias correction techniques, they commonly rely on a set of assumptions that are likely to be violated in practice, which calls into question the effectiveness of these methods. Thus, there is a knowledge gap with respect to the assessment of misclassification bias’s impact on a specific social indicator formula without strict reference to the number of classes. Moreover, given the erroneous nature of automatic classification algorithms, the quality of a predicted indicator can be assessed not only using regression quality metrics, as was done in existing literature, but also using correlation metrics. In this paper, we propose a simulation approach for assessing the impact of misclassification bias on the calculated social indicators in terms of regression and correlation metrics. The proposed approach focuses on indicators calculated based on the distribution of classes and can process any number of classes. The proposed approach allows selecting the most appropriate classification model for a particular social indicator, and vice versa. Moreover, it allows for assessment of the optimistic level of correlation between the indicator calculated based on the results of the classification algorithm and the true underlying indicator. |
first_indexed | 2024-12-13T12:51:37Z |
format | Article |
id | doaj.art-d3fc775f847b43f186dc401a633616de |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-12-13T12:51:37Z |
publishDate | 2022-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-d3fc775f847b43f186dc401a633616de2022-12-21T23:45:19ZengIEEEIEEE Access2169-35362022-01-0110188861889810.1109/ACCESS.2022.31498979706439Misclassification Bias in Computational Social Science: A Simulation Approach for Assessing the Impact of Classification Errors on Social Indicators ResearchSergey Smetanin0https://orcid.org/0000-0001-6373-3410Mikhail Komarov1https://orcid.org/0000-0001-7075-0016Department of Business Informatics, Graduate School of Business, National Research University Higher School of Economics, Moscow, RussiaDepartment of Business Informatics, Graduate School of Business, National Research University Higher School of Economics, Moscow, RussiaA growing body of literature has examined the potential of machine learning algorithms in constructing social indicators based on the automatic classification of digital traces. However, as long as the classification algorithms’ predictions are not completely error-free, the estimate of the relative occurrence of a particular class may be affected by misclassification bias, thereby affecting the value of the calculated social indicator. Although a significant amount of studies have investigated misclassification bias correction techniques, they commonly rely on a set of assumptions that are likely to be violated in practice, which calls into question the effectiveness of these methods. Thus, there is a knowledge gap with respect to the assessment of misclassification bias’s impact on a specific social indicator formula without strict reference to the number of classes. Moreover, given the erroneous nature of automatic classification algorithms, the quality of a predicted indicator can be assessed not only using regression quality metrics, as was done in existing literature, but also using correlation metrics. In this paper, we propose a simulation approach for assessing the impact of misclassification bias on the calculated social indicators in terms of regression and correlation metrics. The proposed approach focuses on indicators calculated based on the distribution of classes and can process any number of classes. The proposed approach allows selecting the most appropriate classification model for a particular social indicator, and vice versa. Moreover, it allows for assessment of the optimistic level of correlation between the indicator calculated based on the results of the classification algorithm and the true underlying indicator.https://ieeexplore.ieee.org/document/9706439/Misclassification biassocial indicatorsclassificationsupervised machine learningcomputational social sciencesentiment analysis |
spellingShingle | Sergey Smetanin Mikhail Komarov Misclassification Bias in Computational Social Science: A Simulation Approach for Assessing the Impact of Classification Errors on Social Indicators Research IEEE Access Misclassification bias social indicators classification supervised machine learning computational social science sentiment analysis |
title | Misclassification Bias in Computational Social Science: A Simulation Approach for Assessing the Impact of Classification Errors on Social Indicators Research |
title_full | Misclassification Bias in Computational Social Science: A Simulation Approach for Assessing the Impact of Classification Errors on Social Indicators Research |
title_fullStr | Misclassification Bias in Computational Social Science: A Simulation Approach for Assessing the Impact of Classification Errors on Social Indicators Research |
title_full_unstemmed | Misclassification Bias in Computational Social Science: A Simulation Approach for Assessing the Impact of Classification Errors on Social Indicators Research |
title_short | Misclassification Bias in Computational Social Science: A Simulation Approach for Assessing the Impact of Classification Errors on Social Indicators Research |
title_sort | misclassification bias in computational social science a simulation approach for assessing the impact of classification errors on social indicators research |
topic | Misclassification bias social indicators classification supervised machine learning computational social science sentiment analysis |
url | https://ieeexplore.ieee.org/document/9706439/ |
work_keys_str_mv | AT sergeysmetanin misclassificationbiasincomputationalsocialscienceasimulationapproachforassessingtheimpactofclassificationerrorsonsocialindicatorsresearch AT mikhailkomarov misclassificationbiasincomputationalsocialscienceasimulationapproachforassessingtheimpactofclassificationerrorsonsocialindicatorsresearch |