Misclassification Bias in Computational Social Science: A Simulation Approach for Assessing the Impact of Classification Errors on Social Indicators Research

A growing body of literature has examined the potential of machine learning algorithms in constructing social indicators based on the automatic classification of digital traces. However, as long as the classification algorithms’ predictions are not completely error-free, the estimate of t...

Full description

Bibliographic Details
Main Authors: Sergey Smetanin, Mikhail Komarov
Format: Article
Language:English
Published: IEEE 2022-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9706439/
_version_ 1828890039107452928
author Sergey Smetanin
Mikhail Komarov
author_facet Sergey Smetanin
Mikhail Komarov
author_sort Sergey Smetanin
collection DOAJ
description A growing body of literature has examined the potential of machine learning algorithms in constructing social indicators based on the automatic classification of digital traces. However, as long as the classification algorithms’ predictions are not completely error-free, the estimate of the relative occurrence of a particular class may be affected by misclassification bias, thereby affecting the value of the calculated social indicator. Although a significant amount of studies have investigated misclassification bias correction techniques, they commonly rely on a set of assumptions that are likely to be violated in practice, which calls into question the effectiveness of these methods. Thus, there is a knowledge gap with respect to the assessment of misclassification bias’s impact on a specific social indicator formula without strict reference to the number of classes. Moreover, given the erroneous nature of automatic classification algorithms, the quality of a predicted indicator can be assessed not only using regression quality metrics, as was done in existing literature, but also using correlation metrics. In this paper, we propose a simulation approach for assessing the impact of misclassification bias on the calculated social indicators in terms of regression and correlation metrics. The proposed approach focuses on indicators calculated based on the distribution of classes and can process any number of classes. The proposed approach allows selecting the most appropriate classification model for a particular social indicator, and vice versa. Moreover, it allows for assessment of the optimistic level of correlation between the indicator calculated based on the results of the classification algorithm and the true underlying indicator.
first_indexed 2024-12-13T12:51:37Z
format Article
id doaj.art-d3fc775f847b43f186dc401a633616de
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-12-13T12:51:37Z
publishDate 2022-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-d3fc775f847b43f186dc401a633616de2022-12-21T23:45:19ZengIEEEIEEE Access2169-35362022-01-0110188861889810.1109/ACCESS.2022.31498979706439Misclassification Bias in Computational Social Science: A Simulation Approach for Assessing the Impact of Classification Errors on Social Indicators ResearchSergey Smetanin0https://orcid.org/0000-0001-6373-3410Mikhail Komarov1https://orcid.org/0000-0001-7075-0016Department of Business Informatics, Graduate School of Business, National Research University Higher School of Economics, Moscow, RussiaDepartment of Business Informatics, Graduate School of Business, National Research University Higher School of Economics, Moscow, RussiaA growing body of literature has examined the potential of machine learning algorithms in constructing social indicators based on the automatic classification of digital traces. However, as long as the classification algorithms’ predictions are not completely error-free, the estimate of the relative occurrence of a particular class may be affected by misclassification bias, thereby affecting the value of the calculated social indicator. Although a significant amount of studies have investigated misclassification bias correction techniques, they commonly rely on a set of assumptions that are likely to be violated in practice, which calls into question the effectiveness of these methods. Thus, there is a knowledge gap with respect to the assessment of misclassification bias’s impact on a specific social indicator formula without strict reference to the number of classes. Moreover, given the erroneous nature of automatic classification algorithms, the quality of a predicted indicator can be assessed not only using regression quality metrics, as was done in existing literature, but also using correlation metrics. In this paper, we propose a simulation approach for assessing the impact of misclassification bias on the calculated social indicators in terms of regression and correlation metrics. The proposed approach focuses on indicators calculated based on the distribution of classes and can process any number of classes. The proposed approach allows selecting the most appropriate classification model for a particular social indicator, and vice versa. Moreover, it allows for assessment of the optimistic level of correlation between the indicator calculated based on the results of the classification algorithm and the true underlying indicator.https://ieeexplore.ieee.org/document/9706439/Misclassification biassocial indicatorsclassificationsupervised machine learningcomputational social sciencesentiment analysis
spellingShingle Sergey Smetanin
Mikhail Komarov
Misclassification Bias in Computational Social Science: A Simulation Approach for Assessing the Impact of Classification Errors on Social Indicators Research
IEEE Access
Misclassification bias
social indicators
classification
supervised machine learning
computational social science
sentiment analysis
title Misclassification Bias in Computational Social Science: A Simulation Approach for Assessing the Impact of Classification Errors on Social Indicators Research
title_full Misclassification Bias in Computational Social Science: A Simulation Approach for Assessing the Impact of Classification Errors on Social Indicators Research
title_fullStr Misclassification Bias in Computational Social Science: A Simulation Approach for Assessing the Impact of Classification Errors on Social Indicators Research
title_full_unstemmed Misclassification Bias in Computational Social Science: A Simulation Approach for Assessing the Impact of Classification Errors on Social Indicators Research
title_short Misclassification Bias in Computational Social Science: A Simulation Approach for Assessing the Impact of Classification Errors on Social Indicators Research
title_sort misclassification bias in computational social science a simulation approach for assessing the impact of classification errors on social indicators research
topic Misclassification bias
social indicators
classification
supervised machine learning
computational social science
sentiment analysis
url https://ieeexplore.ieee.org/document/9706439/
work_keys_str_mv AT sergeysmetanin misclassificationbiasincomputationalsocialscienceasimulationapproachforassessingtheimpactofclassificationerrorsonsocialindicatorsresearch
AT mikhailkomarov misclassificationbiasincomputationalsocialscienceasimulationapproachforassessingtheimpactofclassificationerrorsonsocialindicatorsresearch